Tesseract OCR is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.
Supported Platforms
The developers are regularly testing on the following platforms:
� Ubuntu 6.06 (x86/32, x86/64)
� Ubuntu 6.10 (x86/32, x86/64)
� Windows (x86/32)
Additionally, we believe that the code should be running on these other platforms, but we don't have the resources to test on them regularly:
� recent Linux distributions (x86/32, x86/64)
� Mac OS X (x86, PPC)
If you're interested in supporting in supporting other platforms or languages, please get in touch with Ray Smith.
Product's homepage
What's New in This Release: [ read full changelog ]
Preparations for thread safety:
· Changed TessBaseAPI methods to be non-static
· Created a class hierarchy for the directories to hold instance data, and began moving code into the classes.
· Moved thresholding code to a separate class.
· Added major new page layout analysis module.
· Added HOCR output.
· Added Leptonica as main image I/O and handling. Currently optional, but in future releases linking with Leptonica will be mandatory.
· Ambiguity table rewritten to allow definite replacements in place of fix_quotes.
· Added TessdataManager to combine data files into a single file.
· Some dead code deleted.
· VC++6 no longer supported. It can't cope with the use of templates.
· Many more languages added.
· Doxygenation of most of the function header comments.