NoAho

  202 downloads
0.9.02 MIT/X Consortium License    
  not rated
Non-Overlapping Aho-Corasick Trie

description

download

specifications

NoAho provides fast, non-overlapping simultaneous multiple keyword search.

Features:
- 'short' and 'long' (longest matching key) searches, both one-off and iteration over all non-overlapping keyword matches in some text.
- Works with both unicode and str in Python 2, and unicode in Python 3 (it's all UCS4 under the hood).
- Allows you to associate an arbitrary Python object payload with each keyword, and supports dict operations len(), [], and 'in' for the keywords (though no del or traversal).
- Does the 'compilation' (generation of Aho-Corasick failure links) of the trie on-demand; you can mix adding keywords and searching text freely.
- Can be used commercially, it's under the minimal, MIT license.

Anti-Features:
- Will not find overlapped keywords (eg given keywords "abcde" and 'defgh", will not find "defgh" in "abcdefgh"; would find both in "abcdedefgh"), unless you move along the string manually, one character at a time, which would defeat the purpose. The package 'Acora' is an alternative package for this use.
- Lacking overlap, find[all]_short is kind of useless.
- Lacks key iteration and deletion from the mapping (dict) protocol
- Memory leaking untested (should be ok but ...)
- No /testcase/ for unicode in Python 2 (did manual test however)
- Unicode chars represented as ucs4, and, each character has its own hashtable, so it's relatively memory-heavy.
- Requires a C++ compiler.

Bug reports and patches welcome of course!
READ MORE   
Last updated on March 21st, 2012

0 User reviews so far.

SUBMIT