OCRopus is an open source document analysis and OCR system.
OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.
OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.
Note: To use the application, run "ocropus-batch png" from the directory containing the text images.
What's New in This Release: [ read full changelog ]
· OCRopus has been turned into a library
· there is a new set of command line programs for book-level recognition
· there is a new line recognizer
· there is a new component model
· OCRopus supports book-level retraining and adaptation
· there are new preprocessing functions
· there is a new language modeling system
· there are many improvements to layout analysis
· OpenFST support is now optional
· there is TIFF support
· Lua support has been factored into a separate repository (ocroscript)
· there is a separate, new Python binding