catdoc 0.94.2

catdoc is program which reads one or more Microsoft word files and outputs text.
catdoc is program which reads one or more Microsoft word files and outputs text, contained insinde them to standard output. Therefore it does same work for .doc files, as unix cat command for plain ASCII files.

catdoc project is now accompanied by xls2csv - program which converts Excel spreadsheet into comma-separated value file. Newest addition to catdoc suite is catppt - program, which extracts readable text from the PowerPoint files.

Optionaly, catdoc is able to translate some non-ASCII chars into correspoindig TeX escape sequences and convert charsets from Windows ANSI codepage or unicode to local codepage of target machine.

It also have database of substitution sequences which are used for symbols which are not present in the target encoding. So if you are trying to read Russian word file under C locale, you'll get a transliteration.

Under Unix it uses nl_langinfo function to find out which output encoding to use, under DOS it uses appropriate DOS function, which gets codepage value from the COUNTRY statement in config.sys.

catdoc is also able to read RTF files and even plain text, so it can be used as general-purpose encoding convertor. (Because catdoc is russian program, by default it converts cp1251 to koi8-r, when running under UNIX and to cp866 when running under DOS.

Catdoc has rudimentary table handling. In TeX mode it inserts & when encounters field delimiter and when encounters end of table row. No table headers are produced although.

Catdoc doesn't even try to preserver MS-Word character formatting. It's goal is to extract plain text and allow you to read it and, probably, reformat with TeX, according to TeXnical rules, most Word users haven't even heard about.

xls2csv does roughly same for Excel files. It extracts data and leaves out any formatting info and formulas. Concept is that you want to see data, not the way it was created.

There is tcl/tk GUI script wordview which provides GUI for viewing Word and RTF files using catdoc. Since internal representation of Tcl string is utf-8 and most systems now have unicode fonts, you'll probably be able to read document in any language using this script.

last updated on:
June 11th, 2012, 9:12 GMT
license type:
GPL (GNU General Public License) 
developed by:
Victor Wagner
ROOT \ Utilities
Download Button

In a hurry? Add it to your Download Basket!

user rating 17



Rate it!

Add your review!