Peter Selinger: Review of Linux OCR software

Review of Linux OCR software

Written Aug 18, 2007. Updated Aug 27, 2007.

Optical character recognition (OCR) is a difficult and finicky problem. In the open-source world, there are relatively few choices of quality OCR software. This document compares three different Linux OCR programs: Ocrad, GOCR, and Tesseract.

The programs

I tested three open-source OCR programs:

Ocrad 0.17. Ocrad is the GNU OCR program. It was written by Antonio Diaz Diaz and is licensed under GPL.
GOCR 0.44. GOCR is an OCR program written by Joerg Schulenburg and others. It is licensed under GPL.
Tesseract 2.00. Tesseract is an OCR engine that was developed by Hewlett Packard in the 1980's and 1990's and was state-of-the-art at the time. Under the sponsorship of Google, Tesseract was made open source in 2006. It is released under an Apache license.

All three programs could be compiled and installed out of the box using the standard "./configure", "make", and "make install" commands. Note that Tesseract 2.00 requires downloading two tarballs: the engine itself and a language-specific data set.

The test data

The test data consists of two pages of text, which were scanned in color at 600dpi. The pages were edited slightly to remove confidential information, and were cropped to a bounding box around the text area. Each page contains a single column of text.

From each original color image, I also produced two thresholded bitonal images. The first bitonal image was obtained by naively thresholding the color image at a grey level of 0.5. The second bitonal image was obtained by a custom method which first applied a highpass filter with a radius of 50 pixels, and then thresholded at a grey value of 0.02. The custom filtering and thresholding was done using the mkbitmap program, using the parameters -x -f 50 -t 0.02. The intention of the custom thresholding method was to produce a lighter image, thereby increasing separation between individual letters.

Here are the six images on which OCR was performed:

The images were losslessly converted to PNM and TIF to suit the input format of each of the tested programs.

For reference, here are the actual transcriptions:

Test 1 (UTF-8)
Test 2 (ASCII)

Results

I ran each of the three programs on each of the six images, with default parameters. I recorded the number of errors (in words), the number of errors (in characters), and the runtime (in CPU-seconds).

Raw output

The following chart gives, for each combination of input file and OCR program, a link to the actual output of the program. Note that the files produced by GOCR and Tesseract are UTF-8 encoded, whereas the output of Ocrad is Latin-1. Your web browser may not detect these encodings automatically.

Raw output Ocrad GOCR Tesseract

Test 1, color Output (Latin-1) Output (UTF-8) Output (UTF-8)

Test 1, naive bitonal Output (Latin-1) Output (UTF-8) Output (UTF-8)

Test 1, custom bitonal Output (Latin-1) Output (UTF-8) Output (UTF-8)

Test 2, color Output (Latin-1) Output (UTF-8) Output (UTF-8)

Test 2, naive bitonal Output (Latin-1) Output (UTF-8) Output (UTF-8)

Test 2, custom bitonal Output (Latin-1) Output (UTF-8) Output (UTF-8)

Raw output	Ocrad	GOCR	Tesseract
Test 1, color	Output (Latin-1)	Output (UTF-8)	Output (UTF-8)
Test 1, naive bitonal	Output (Latin-1)	Output (UTF-8)	Output (UTF-8)
Test 1, custom bitonal	Output (Latin-1)	Output (UTF-8)	Output (UTF-8)
Test 2, color	Output (Latin-1)	Output (UTF-8)	Output (UTF-8)
Test 2, naive bitonal	Output (Latin-1)	Output (UTF-8)	Output (UTF-8)
Test 2, custom bitonal	Output (Latin-1)	Output (UTF-8)	Output (UTF-8)

Error rate (words)

The following chart shows, for each output, the number and percentage of word errors. Each omission, insertion, and substitution of a word has been counted as an error. Punctuation is counted as belonging to the preceding word, so "landlord." instead of "landlord," counts as one word error. Multiple mistakes within a single word count as a single error. Mere substitution of a similar unicode character, e.g. m-dash (0x2014) instead of "-", has not been counted as an error.

You can see the actual errors, marked up in red, by clicking on each link.

Each box shows the number of errors, followed by the percentage of errors in parentheses. Note that lower numbers are better. For reference, Test 1 has 315 words, and Test 2 has 740 words in total.

Error rate (words) Ocrad GOCR Tesseract

Test 1, color 63 (20.00%) 94 (29.84%) 294 (93.33%)

Test 1, naive bitonal 72 (22.85%) 89 (28.25%) 21 (6.66%)

Test 1, custom bitonal 85 (26.98%) 85 (26.98%) 20 (6.34%)

Test 2, color 281 (37.97%) 409 (55.27%) 664 (89.72%)

Test 2, naive bitonal 222 (30.00%) 366 (49.45%) 5 (.67%)

Test 2, custom bitonal 168 (22.70%) 350 (47.29%) 3 (.40%)

Error rate (words)	Ocrad	GOCR	Tesseract
Test 1, color	63 (20.00%)	94 (29.84%)	294 (93.33%)
Test 1, naive bitonal	72 (22.85%)	89 (28.25%)	21 (6.66%)
Test 1, custom bitonal	85 (26.98%)	85 (26.98%)	20 (6.34%)
Test 2, color	281 (37.97%)	409 (55.27%)	664 (89.72%)
Test 2, naive bitonal	222 (30.00%)	366 (49.45%)	5 (.67%)
Test 2, custom bitonal	168 (22.70%)	350 (47.29%)	3 (.40%)

Error rate (characters)

The following chart shows the number and percentage of character errors. As in the previous chart, lower number are better. As before, the actual errors can be viewed by clicking on the links. The total number of characters in Test 1 is 1766, and in Test 2 it is 3546.

Error rate (characters) Ocrad GOCR Tesseract

Test 1, color 94 (5.32%) 123 (6.96%) 1327 (75.14%)

Test 1, naive bitonal 97 (5.49%) 119 (6.73%) 31 (1.75%)

Test 1, custom bitonal 101 (5.71%) 144 (8.15%) 26 (1.47%)

Test 2, color 446 (12.57%) 944 (26.62%) 2609 (73.57%)

Test 2, naive bitonal 297 (8.37%) 801 (22.58%) 8 (.22%)

Test 2, custom bitonal 200 (5.64%) 709 (19.99%) 5 (.14%)

Error rate (characters)	Ocrad	GOCR	Tesseract
Test 1, color	94 (5.32%)	123 (6.96%)	1327 (75.14%)
Test 1, naive bitonal	97 (5.49%)	119 (6.73%)	31 (1.75%)
Test 1, custom bitonal	101 (5.71%)	144 (8.15%)	26 (1.47%)
Test 2, color	446 (12.57%)	944 (26.62%)	2609 (73.57%)
Test 2, naive bitonal	297 (8.37%)	801 (22.58%)	8 (.22%)
Test 2, custom bitonal	200 (5.64%)	709 (19.99%)	5 (.14%)

Runtime

The total runtime in CPU-seconds (user mode plus kernel mode) was measured using the time(1) program. The tests were performed on a 900MHz Transmeta Crusoe Processor.

Runtime (seconds) Ocrad GOCR Tesseract

Test 1, color 17.36 66.78 21.00

Test 1, naive bitonal 4.76 46.68 18.11

Test 1, custom bitonal 4.14 61.80 14.37

Test 2, color 15.69 149.73 44.15

Test 2, naive bitonal 5.67 126.35 27.34

Test 2, custom bitonal 4.77 224.52 25.42

Runtime (seconds)	Ocrad	GOCR	Tesseract
Test 1, color	17.36	66.78	21.00
Test 1, naive bitonal	4.76	46.68	18.11
Test 1, custom bitonal	4.14	61.80	14.37
Test 2, color	15.69	149.73	44.15
Test 2, naive bitonal	5.67	126.35	27.34
Test 2, custom bitonal	4.77	224.52	25.42

Discussion of test results

In terms of accuracy, Tesseract vastly outperforms both Ocrad and GOCR on bitonal images. But strangely, Tesseract is completely useless on input that is not bitonal, so one should always threshold the input first. The actual thresholding method used seems almost irrelevant, which makes me wonder why Tesseract doesn't simply threshold its non-bitonal input internally.

The accuracy of Ocrad was equal or better than that of GOCR in all cases, and in fact was equal only in one case.

Both Ocrad and GOCR perform very badly on italic text. It seems that these programs have been trained on non-slanted fonts only, and have no mechanism to correct for slant. On the other hand, Tesseract handles italics easily, with one notable exception: it seems to choke on the italic word "if". This last fact accounts for 7 of the 16 word-errors in Test 1.

Pre-thresholding is a necessity for Tesseract, but it also improves the output of Ocrad and GOCR with respect to Test 2. Here, the custom thresholding method works better than the naive one. On the other hand, with respect to Test 1, pre-thresholding does not improve the output; it actually makes it slightly worse. A possible conclusion is that Ocrad and GOCR work best on inputs where each letter is clearly separated.

In terms of runtime, Ocrad is very fast, Tesseract is tolerable, and GOCR is very slow. Ocrad is much faster on bitonal images than non-bitonal images, which makes it appear that it spends most of its time converting greyscale to bitonal. This might be an obvious area of improvement.

Conclusion

Of course, it must be stressed that the test results reported here are derived from only two scanned pages. It is possible that for other inputs, the programs rank differently. However, based on the tests reported on this page, here is a summary of my conclusions:

Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal.
Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy.
GOCR gives poor output at a slow speed.

Back to Peter Selinger's Homepage:

Peter Selinger / Department of Mathematics and Statistics / Dalhousie University
selinger@mathstat.dal.ca / PGP key