OCRFeeder is a document layout analysis and optical character recognition system.
Given the images it will automatically outline its contents, distinguish between what's graphics and text and perform OCR over the latter. It generates multiple formats being its main one ODT.
OCRFeeder features a complete GTK graphical user interface that allows the users to correct any unrecognized characters, defined or correct bounding boxes, set paragraph styles, clean the input images, import PDFs, save and load the project, export everything to multiple formats, etc.
Installation on Ubuntu:
The only packages needed to be installed on Ubuntu 8.10 is PyGoocanvas and Unpaper, the rest of the dependences are already installed in a fresh install of this version of Ubuntu. The engine Ocrad is also installed for the reasons explained in the previous section.
To install PyGoocanvas, Ocrad and Unpaper, the following command should be executed as superuser:
apt-get install python-pygoocanvas ocrad unpaper
After all of the packages finish the installation, OCRFeeder is ready to be
installed. To install it, all that is needed is to run setup.py script as
superuser:
setup.py install
OCRFeeder can now be run by calling it from a desktop menu or by running the *ocrfeeder* command. When using the GNOME desktop, if the desktop menu entry is not showing the OCRFeeder's icon, the following command must be used to update the icon cache (as superuser):
gtk-update-icon-cache -f -t /usr/share/icons/hicolor
Command Line Usage:
This section explains how to use OCRFeeder from the command line.
The command line interface part of OCRFeeder aims at users who want to perform quick and unattended conversions of document images to editable formats. It also makes this project usable from other applications.
Two parameters are mandatory:
1) the path to each document image to be processed is given after the parameter
--images;
2) the name of the document to be generated is given after the parameter
--o.
For example:
ocrfeeder-cli --images ~/image1.png ~/image2.jpeg
--o converted_document
The pages of the generated documents honor the order of the given paths.
It is also possible to specify the format of the document to be generated
(HTML or ODT) with the option --format. In case no format is specified,
the images will be exported to ODT. Continuing with the example above:
ocrfeeder-cli --images ~/image1.png ~/image2.jpeg --format HTML
--o converted_document
OCRFeeder Studio (the graphical user interface part) can also be launched
from the command line. Two options can be used to load images right after
the program initiates. Those are --images which will add the images given
as the option's arguments and --dir that will add all the images under a
given directory path. The options can be used individually or combined,
for example:
ocrfeeder --images ~/image1.png ~/image2.jpeg
--dir ~/Desktop
For any usage, the options and parameters can be given in any order.
Product's homepage
Requirements:
· Python
· PyGTK
· PIL
· PyGooCanvas
· AFPL Ghostscript
· Unpaper
What's New in This Release: [ read full changelog ]
· Now the content boxes can be dragged by their limits to extend their bounds
· Add "sane" missing dependency
· Change some mnemonics in the menu to avoid clashes (fixes gb#645983) (thanks to Åukasz JernaÅ›)
· Resets the favorite engine when it does not exist
· Prevent errors when adding unexisting images
· Focus box's editor text area automatically (gb#635308)
· Clarify the help output about the --images option
New and Updated Translations:
· Marek ÄŒernocký [cs]
· Joe Hansen [da]
· Mario Blättermann [de]
· Daniel Mustieles [es]
· Claude Paroz [fr]
· Gianvito Cavasoli [it]
· Åukasz JernaÅ› [pl]
· Djavan Fagundes [pt_BR]
· Matej UrbanÄiÄ [sl]
· Aron Xu [zh_CN]