Training Tesseract

  1. Create a bunch of .tif files containing numbers in calculator-like font. Name the files as follows:

    calc.7digitregular.exp0.tif

    (where calc is language, 7digititalics is font, 0 is file number, you can make up your own names as long as it follows the convention). If you created images in Paint, tesseract will probably give you all sorts of warning. I used GIMP instead (GNU Image Manipulation Program, an open source graphics editor)

  2. (optional) Blur the images.

  3. Open CMD and run the following command:

    tesseract calc.7digitregular.exp0.tif calc.7digitregular.exp0. batch.nochop makebox

    Here tesseract tries to search for characters in your dataset and guess what those characters are. Most probably it will guess wrong. The characters that it managed to find will be stored in a file calc.7digitregular.exp0.box. If the file is empty, it means tesseract didn’t find anything, and you can’t work with this! Try creating a file with different number strings, spacing the numbers differently, blurring the image.

  4. Open the .box file in a text editor, in the first column you will see the characters that Tesseract thinks it found, and in the other columns are the coordinates (in pixels) for each of the character. Most probably half of the time it guessed wrong. Just delete the wrong characters here and type the correct ones instead.

  5. Train Tesseract:

    tesseract calc.7digitregular.exp6.tif calc.7digitregular.exp6 box.train

  6. Generate a file with set of possible characters that can be encountered in your “language” on the basis of .box files:

    unicharset_extractor calc.7digitregular.exp6.box

  7. Create a separate file defining font properties: Fontname <italic> <bold> <fixed> <serif> <fraktur>

  8. Do feature clustering and some other stuff:

    shapeclustering -F font_properties -U unicharset calc.7digitregular.exp0.tr calc.7digitregular.exp6.tr … <list all your .tr files>
    mftraining -F font_properties -U unicharset -O calc.unicharset calc.7digitregular.exp0.tr calc.7digitregular.exp6.tr … <list all your .tr files>
    cntraining calc.7digitregular.exp0.tr calc.7digitregular.exp6.tr … <list all your .tr files>
    
  9. Rename all files that have just been created (unicharset, shapetable, normproto, inttemp, and pffmtable) with the prefix calc. Execute this: combine_tessdata calc.

  10. You can now use your new trained set to recognize some more new fonts or new examples, repeat steps 3-9.

    tesseract calc.7digitregular.exp5.tif calc.7digitregular.exp5 -l calc batch.nochop makebox tesseract calc.7digititalics.exp1.tif calc.7digititalics.exp1 -l calc batch.nochop makebox