Transcribo is a software aimed at the development of a modular, easy to use and powerful cross-platform software to convert various file formats into accurate plain text. What might seem a somewhat strange goal in the age of pdf and HTML turns out to be very useful, e.g., for output devices which can only handle plain text such as Braille embossers. Indeed, Transcribo has been designed with the objective in mind to allow printing documents in high-quality Braille. However, Transcribo should be useful in all contexts where plain text in complex layouts is needed.
Transcribo has been designed so as to separate the processing of the input file from the actual rendering algorithm. Hence, one can speak of two layers: In the input layer various format-specific frontends parse the input streams and feed them into the renderer (second level). More specifically, frontends specific to the supported input formats.
* parse the input file,
* derive the layout structure and
* call the renderer to generate
o a proprietary, tree representation of the document, and
o traverse the tree creating a line-by-line representation.
* Thereafter, the renderer's paginator is called to insert white space as margins, page breaks, create headers and footers etc.
* Finally, the paginated line-by-line representation is assembled to a plain text file.
The renderer allows to attach to each content block (paragraph, heading, reference etc.) a specific translator and wrapper to perform translations and achieve the required text outline. In combination with frontends for mark-up languages, this feature allows the user to control the output at a very high level of granularity.
Currently there are frontends for .. reStructuredText and plain text. Additional frontends for formats such as LaTeX, OpenOffice, RTF and HTML would appear useful.
Installation and usage:
Transcribo is developed with Python 2.6. It should run on older versions, possibly with small changes. There are no dependencies. However, if you want to use the translation features for Braille, you may wish to install a Braille translator such as liblouis or YABT. In addition, if you want to use the frontend for reStructuredText, you will need Docutils, because the frontend for reStructuredText is essentially a docutils writer component. Use the transcribo-rst.py script, a Docutils frontend tool, to generate plain text from rST documents. Without Docutils, you can only generate plain text from plain text using the transcribo-txt.py script. Type python transcribo-txt.py --help to see the command line options.
Transcribo is a pure Python package. It is installed by unpacking the archive and typing from the shell prompt something like:
cd < package dir >
python setup.py install
Then run one of the scripts in the scripts/ or test/ subdirectory (see above).
2.2 Using the rST frontend
The module transcribo.rST.py is a Docutils writer component. See the Docutils documentation for background info. It supports a reasonable subset of the rST features. Implemented features include paragraphs, sections, section numbers (basic support), bullet lists, enumerations, block quotes, line blocks, references (page references are on the wish list), strong and emphasis (represented by cappitalized letters), inline literals. To translate an rST document into plain text, use the transcribo-rst.py frontend tool. Use the command line or the configuration file to modify the page width and the translator to be used (default is None). All other configurations are contained in transcribo.renderer.styles.py.
What's New in This Release: [ read full changelog ]
· unified command line front end using argparse (dependency under Python2.6)
· new generic configuration system named yaconfig with cascading style sheets using PyYAML (new dependency)
· supports multiple YAML files which are successively mixed into a tree of nested dictionaries
· multiple inheritance from any node specified by absolute or local paths (relative paths not fully supported)
· supports string interpolation similar to configparser from the stdlib (this feature is not used though)
· more rST features including
· references and targets (not yet footnotes)
· table of contents with or without page numbers
· definition lists, literal blocks and transitions
· use the class directive to change hyphenation, wrapper, translator etc. on the fly
· readers (the components that read the input files such as rST are fully configurable through cascading style sheets in YAML format. In case of rST this means that the Docutils own configuration system is no longer visible to the Transcribo user. Note that Transcribo when used with the rst reader acts as a Docutils writer component.
· no longer depends on a Braille translator such as YABT
· hard page breaks improved; can be used with rST reader through style sheets: break page after end of section etc.