Softpedia
 


LINUX CATEGORIES:



GLOBAL PAGES >>
NEWS ARCHIVE >>
SOFTPEDIA REVIEWS >>
MEET THE EDITORS >>
WEEK'S BEST
  • Linux Kernel 3.9.3 / 3....
  • LibreOffice 3.6.6 / 4.0.3
  • MPlayer 1.1.1
  • systemd 204
  • Arch Linux 2013.05.01
  • Blender 2.67a
  • KDE Software Compilatio...
  • CrunchBang Linux Stable...
  • Elementary OS 0.1 / 0.2...
  • SystemRescueCd 3.6.0
  • Home > Linux > Programming > Perl Modules

    Text::Bloom 1.07

    Download button

    No screenshots available
    Downloads: 320  View global page NEW!  Tell us about an update
    User Rating:
    Rated by:
    NOT RATED
    0 user(s)
    Developer:

    License / Price:

    Last Updated:

    Category:
    Andrea Spinelli and Walter Vannini | More programs
    Perl Artistic License / FREE
    August 15th, 2007, 06:05 GMT
    ROOT / Programming / Perl Modules

     Read user reviews (0)  Refer to a friend  Subscribe

    Text::Bloom description

    It can evaluate Bloom signature of a set of terms.

    Text::Bloom can evaluate Bloom signature of a set of terms.

    SYNOPSIS

    my $b = Text::Bloom->new();
    $b->Compute( qw( foo bar baz ) );
    my $sig = $b->WriteToString();
    $b->WriteToFile( 'afile.sig' );
    my $b2 = Text::Bloom::NewFromFile( 'afile.sig' );
    my $b3 = Text::Bloom->new();
    $b3->Compute( qw( foo bar barbaz ) );
    my $sim = $b->Similarity( $b2 );
    my $b4 = Text::Bloom::NewFromString( $sig );


    Text::Bloom applies the Bloom filtering technique to the statistical analysis of documents.

    The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32.

    Each quantized value is mapped to d integers in the range 0..size-1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the HashV function.

    Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.

    Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.

    The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms.

    The signature may be used for:

    testing whether a given set of terms is present in the document,
    computing which fraction of terms are common to two documents.

    The bit representation may be written to and read from a file. Text::Bloom prepends a header to the bit stream proper; moreover, whenever the package Compress::Zlib is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents.

    The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression /[0-9a-z]+/.

    There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method QuantizeV.


    Product's homepage

    Requirements:

    · Perl

      


    TAGS:

    document analysis | Bloom signature | Bloom filtering | Bloom | signature | filtering

    Go to top

    WindowsGamesDriversMacLinuxScriptsMobileHandheldNews

    SUBMIT PROGRAM   |   ADVERTISE   |   GET HELP   |   SEND US FEEDBACK   |   RSS FEEDS   |   UPDATE YOUR SOFTWARE   |   ROMANIAN FORUM