
GNU Ocrad is an OCR (Optical Character Recognition) program implemented
as a filter and based on a feature extraction method. It reads a bitmap
image in pbm format and outputs text in ISO-8859-1 (Latin-1) charset.
Also includes a layout analyser able to separate the columns or blocks
of text normally found on printed pages.
It can be used as a stand-alone console application, or as a backend to
other programs.

See the file INSTALL for compilation and installation instructions.

Try "ocrad --help" for usage instructions.

Caveats.
Scan directly in b/w mode. Convert from grayscale only if you know what
you are doing.
For better results the characters should be at least 20 pixels high.
Merged characters are always a problem. Try to avoid them.
Very bold or very light (broken) characters are also a problem.
Always see with your own eyes the pbm file before blaming Ocrad for the
results. Remember the saying, "garbage in, garbage out".

Ideas, comments, patches, donations (hardware, money, etc), etc, are welcome.

---------------------------

Debug levels ( option -D )
100 - Show raw block list, unordered.
 99 - Show recursive block list, unordered.
 98 - Show main block list, unordered.
 97 - Show recursive block list, ordered.
 96 - Show main block list, ordered.
 95..90 - reserved.
 89 - Show all blocks from every character.
 88 - Show main black blocks from every character.
 87 - Show guess list for every character.
 86 - Show best guess for every character.

---------------------------

OCR Results File (ORF)

Calling ocrad with option -x produces an orf file, that is, a parsable
file containing the ocr results. The format is as follows:

- All lines starting with '#' are ignored.
- The first valid line has the form 'source file filename'. Where
  'filename' is the name of the pbm file being processed ('-' for stdin).
  This is the only line guaranteed to exist for every input file read
  without errors. If the file, or any block or line, has no text, the
  corresponding part in the orf file will be missing.
- The second valid line has the form 'total blocks n'. Where 'n' is the
  total number of text blocks in the source image.

For each text block in the source image, the following data follows:
- A line in the form 'block i x y w h'. Where 'i' is the block number
  and 'x y w h' are the block position and size as described below for
  character boxes.
- A line in the form 'lines n'. Where 'n' is the number of lines in this
  block.

For each line in every text block, the following data follows:
- A line in the form 'line i chars n height h'. Where 'i' is the line
  number, 'n' is the number of characters in this line, and 'h' is the
  mean height of the characters in this line (in pixels).
- N lines (one for every character) in the form "x y w h b; g[, 'c'v]...".
  'x' = the left border (x-coordinate) of the char bounding box in the
        source image (in pixels).
  'y' = the top border (y-coordinate).
  'w' = the width of the bounding box.
  'h' = the height of the bounding box.
  'b' = the percent of black pixels in the bounding box.
  'g' = the number of different recognition guesses for this character.
  The result characters follow after the number of guesses in the form
  of a comma-separated list of pairs. Every pair is formed by the actual
  recognised char enclosed in single quotes, followed by the confidence
  value, without space between them.
  The higher the value of confidence, the more confident is the result.

Running './ocrad -x test1.orf examples/test1.pbm' in the source directory
will give you an example orf file.
