Sherlock Holmes 4.0
A universal search engine.
Sherlock Holmes is a universal search engine, a system for gathering and indexing of textual data (text files, web pages, etc), both locally and over the network.
- Gathers files via HTTP or from local files.
- Parses text files, HTML, PDF, and several other formats using external parsers (such as MS Word and PostScript).
- The whole system is modular, so adding your own data sources or parsers is just matter of plugging in right module (well, usually also writing it).
- Works well in mixed charset environment.
- Considers multiple occurences of the same file (even with minor changes) a single document with multiple URL's.
- Everything is highly configurable. You can write filtering rules in a special language which allows to tweak configuration variables depending on the document being processed.
- Searching of words, phrases, and boolean expressions. Searching in filenames and link texts.
- Proximity search and proximity weighting of regular searches.
- Recognition of languages, easy integration of stemmers and synonymic dictionaries.
- Spelling checker based on word frequencies observed in the indexed data, hinting the user that his query might be misspelled.
- Search results include context in each document.
- Scales well to tens of millions of documents on normal PC hardware.
- User interface (the front-end) is completely separated from the rest of the system, making it easy to modify and also to embed the search engine in existing applications.
- Downloaded files and indices are compressed to save space.