Uplug 0.2.0c

Uplug is a collection of tools for linguistic corpus processing, word alignment, and term extraction from parallel corpora.
Uplug is a collection of tools for linguistic corpus processing, word alignment, and term extraction from parallel corpora. Several tools have been integrated in Uplug.

Pre-processing tools include a sentence splitter, tokenizer, and external part-of-speech tagger and shallow parsers. The following external tools are used: the Grok system for English (tagging and chunking) and the morphological analyzer ChaSen for Japanese.

Other tools such as the TreeTagger can easily be added. Translated documents can be sentence aligned using the length-based approach by Gale & Church. Words and phrases can be aligned using the clue alignment approach and the toolbox for training statistical alignment models GIZA++.

What's New in This Release:

· robust conversion of encodings in tag.pl/toktag.pl/chunk.pl
· added treetagger startup scripts for es and nl, replace "nbsp" to " "
· robust conversion between encodings in bitext-indexer.pl/opus-indexer.pl
· added startup scripts for spanish and dutch tree-tagger models
· updated startup scripts for other treetagger models according to latest TreeTagger distribution
· fixed hunalign (bug in converting alignment output to xml)
· added missing ';' at line 40 in Uplug.pm

last updated on:
April 3rd, 2008, 16:38 GMT
price:
FREE!
developed by:
Joerg Tiedemann
license type:
GPL (GNU General Public License) 
category:
ROOT \ Text Editing&Processing \ Markup

FREE!

In a hurry? Add it to your Download Basket!

user rating 15

UNRATED
2.3/5
 

0/5

Add your review!

SUBMIT