Text::Bloom can evaluate Bloom signature of a set of terms.
my $b = Text::Bloom->new();
$b->Compute( qw( foo bar baz ) );
my $sig = $b->WriteToString();
$b->WriteToFile( 'afile.sig' );
my $b2 = Text::Bloom::NewFromFile( 'afile.sig' );
my $b3 = Text::Bloom->new();
$b3->Compute( qw( foo bar barbaz ) );
my $sim = $b->Similarity( $b2 );
my $b4 = Text::Bloom::NewFromString( $sig );
Text::Bloom applies the Bloom filtering technique to the statistical analysis of documents.
The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32.
Each quantized value is mapped to d integers in the range 0..size-1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the HashV function.
Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.
Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.
The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms.
The signature may be used for:
testing whether a given set of terms is present in the document,
computing which fraction of terms are common to two documents.
The bit representation may be written to and read from a file. Text::Bloom prepends a header to the bit stream proper; moreover, whenever the package Compress::Zlib is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents.
The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression /[0-9a-z]+/.
There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method QuantizeV.