Optical character recognition (OCR) is a difficult and finicky problem. In the open-source world, there are relatively few choices of quality OCR software. This document compares three different Linux OCR programs: Ocrad, GOCR, and Tesseract.
All three programs could be compiled and installed out of the box using the standard "./configure", "make", and "make install" commands. Note that Tesseract 2.00 requires downloading two tarballs: the engine itself and a language-specific data set.
From each original color image, I also produced two thresholded
bitonal images. The first bitonal image was obtained by naively
thresholding the color image at a grey level of 0.5. The second
bitonal image was obtained by a custom method which first applied a
highpass filter with a radius of 50 pixels, and then thresholded at a
grey value of 0.02. The custom filtering and thresholding was done
using the mkbitmap program, using
Here are the six images on which OCR was performed:
For reference, here are the actual transcriptions:
|Test 1, color||Output (Latin-1)||Output (UTF-8)||Output (UTF-8)|
|Test 1, naive bitonal||Output (Latin-1)||Output (UTF-8)||Output (UTF-8)|
|Test 1, custom bitonal||Output (Latin-1)||Output (UTF-8)||Output (UTF-8)|
|Test 2, color||Output (Latin-1)||Output (UTF-8)||Output (UTF-8)|
|Test 2, naive bitonal||Output (Latin-1)||Output (UTF-8)||Output (UTF-8)|
|Test 2, custom bitonal||Output (Latin-1)||Output (UTF-8)||Output (UTF-8)|
You can see the actual errors, marked up in red, by clicking on each link.
Each box shows the number of errors, followed by the percentage of errors in parentheses. Note that lower numbers are better. For reference, Test 1 has 315 words, and Test 2 has 740 words in total.
|Error rate (words)||Ocrad||GOCR||Tesseract|
|Test 1, color||63 (20.00%)||94 (29.84%)||294 (93.33%)|
|Test 1, naive bitonal||72 (22.85%)||89 (28.25%)||21 (6.66%)|
|Test 1, custom bitonal||85 (26.98%)||85 (26.98%)||20 (6.34%)|
|Test 2, color||281 (37.97%)||409 (55.27%)||664 (89.72%)|
|Test 2, naive bitonal||222 (30.00%)||366 (49.45%)||5 (.67%)|
|Test 2, custom bitonal||168 (22.70%)||350 (47.29%)||3 (.40%)|
|Error rate (characters)||Ocrad||GOCR||Tesseract|
|Test 1, color||94 (5.32%)||123 (6.96%)||1327 (75.14%)|
|Test 1, naive bitonal||97 (5.49%)||119 (6.73%)||31 (1.75%)|
|Test 1, custom bitonal||101 (5.71%)||144 (8.15%)||26 (1.47%)|
|Test 2, color||446 (12.57%)||944 (26.62%)||2609 (73.57%)|
|Test 2, naive bitonal||297 (8.37%)||801 (22.58%)||8 (.22%)|
|Test 2, custom bitonal||200 (5.64%)||709 (19.99%)||5 (.14%)|
|Test 1, color||17.36||66.78||21.00|
|Test 1, naive bitonal||4.76||46.68||18.11|
|Test 1, custom bitonal||4.14||61.80||14.37|
|Test 2, color||15.69||149.73||44.15|
|Test 2, naive bitonal||5.67||126.35||27.34|
|Test 2, custom bitonal||4.77||224.52||25.42|
The accuracy of Ocrad was equal or better than that of GOCR in all cases, and in fact was equal only in one case.
Both Ocrad and GOCR perform very badly on italic text. It seems that these programs have been trained on non-slanted fonts only, and have no mechanism to correct for slant. On the other hand, Tesseract handles italics easily, with one notable exception: it seems to choke on the italic word "if". This last fact accounts for 7 of the 16 word-errors in Test 1.
Pre-thresholding is a necessity for Tesseract, but it also improves the output of Ocrad and GOCR with respect to Test 2. Here, the custom thresholding method works better than the naive one. On the other hand, with respect to Test 1, pre-thresholding does not improve the output; it actually makes it slightly worse. A possible conclusion is that Ocrad and GOCR work best on inputs where each letter is clearly separated.
In terms of runtime, Ocrad is very fast, Tesseract is tolerable, and GOCR is very slow. Ocrad is much faster on bitonal images than non-bitonal images, which makes it appear that it spends most of its time converting greyscale to bitonal. This might be an obvious area of improvement.