Sherlock Holmes is a universal search engine, a system for gathering and indexing of textual data (text files, web pages, etc), both locally and over the network.
Product's homepage
Here are some key features of "Sherlock Holmes":
· Gathers files via HTTP or from local files.
· Parses text files, HTML, PDF, and several other formats using external parsers (such as MS Word and PostScript).
· The whole system is modular, so adding your own data sources or parsers is just matter of plugging in right module (well, usually also writing it).
· Works well in mixed charset environment.
· Considers multiple occurences of the same file (even with minor changes) a single document with multiple URL's.
· Everything is highly configurable. You can write filtering rules in a special language which allows to tweak configuration variables depending on the document being processed.
· Searching of words, phrases, and boolean expressions. Searching in filenames and link texts.
· Proximity search and proximity weighting of regular searches.
· Recognition of languages, easy integration of stemmers and synonymic dictionaries.
· Spelling checker based on word frequencies observed in the indexed data, hinting the user that his query might be misspelled.
· Search results include context in each document.
· Scales well to tens of millions of documents on normal PC hardware.
· User interface (the front-end) is completely separated from the rest of the system, making it easy to modify and also to embed the search engine in existing applications.
· Downloaded files and indices are compressed to save space.