Text::Ngrams is a flexible Ngram analysis (for characters, words, and more).
SYNOPSIS
For default character n-gram analysis of string:
use Text::Ngrams;
my $ng3 = Text::Ngrams->new;
$ng3->process_text('abcdefg1235678hijklmnop');
print $ng3->to_string;
my @ngramsarray = $ng3->get_ngrams;
One can also feed tokens manually:
use Text::Ngrams;
my $ng3 = Text::Ngrams->new;
$ng3->feed_tokens('a');
$ng3->feed_tokens('b');
$ng3->feed_tokens('c');
$ng3->feed_tokens('d');
$ng3->feed_tokens('e');
$ng3->feed_tokens('f');
$ng3->feed_tokens('g');
$ng3->feed_tokens('h');
We can choose n-grams of various sizes, e.g.:
my $ng = Text::Ngrams->new( windowsize => 6 );
or different types of n-grams, e.g.:
my $ng = Text::Ngrams->new( type => byte );
my $ng = Text::Ngrams->new( type => word );
my $ng = Text::Ngrams->new( type => utf8 );
To process a list of files:
$ng->process_files('somefile.txt', 'otherfile.txt');
This module implement text n-gram analysis, supporting several types of analysis, including character and word n-grams.
The module Text::Ngrams is very flexible. For example, it allows a user to manually feed a sequence of any tokens. It handles several types of tokens (character, word), and also allows a lot of flexibility in automatic recognition and feed of tokens and the way they are combined in an n-gram. It counts all n-gram frequencies up to the maximal specified length. The output format is meant to be pretty much human-readable, while also loadable by the module.
The module can be used from the command line through the script ngrams.pl provided with the package.
Product's homepage
Requirements:
· Perl