Xapian is an Open Source Probabilistic Information Retrieval library, released under the GPL. Xapian iss written in C , with bindings to allow use from other languages (Perl, Java, Python, PHP, and TCL are currently supported; Guile and C# are being worked on).
Xapian is designed to be a highly adaptable toolkit to allow developers to easily add advanced indexing and search facilities to their own applications.
If you're after a packaged search engine for your website, you should take a look at Omega, which is an application we supply built upon Xapian. But unlike most other website search solutions, Xapian's versatility allows you to extend Omega to meet your needs as they grow.
Here are some key features of "Xapian and Omega":
· Free Software/Open Source - licensed under the GPL.
· Highly portable - runs on many Linux, MacOS X, many other Unix platforms, and Microsoft Windows.
· Written in C . Perl bindings are available in the module Search::Xapian on CPAN. Java JNI bindings are included in the xapian-bindings module. We also support SWIG which can generate bindings for 13 languages. At present those for Python, PHP4, and TCL are working. Guile and C# are being worked on.
· Ranked probablistic search - important words get more weight than unimportant words, so the most relevant documents are more likely to come near the top of the results list.
· Relevance feedback - given one or more documents, Xapian can suggest the most relevant index terms to expand a query, suggest related documents, categorise documents, etc.
· Phrase and proximity searching - users can search for words occuring in an exact phrase or within a specified number of words, either in a specified order, or in any order.
· Full range of structured boolean search operators ("stock NOT market", etc). The results of the boolean search are ranked by the probablistic weights. Boolean filters can also be applied to restrict a probabilistic search.
· Supports stemming of search terms (e.g. a search for "football" would match documents which mention "footballs" or "footballer"). This helps to find relevant documents which might otherwise be missed. Stemmers are currently included for Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish.
· Supports database files > 2GB - essential for scaling to large document collections.
· Platform independent data formats - you can build a database on one machine and search it on another.
· Allows simultaneous update and searching. New documents become searchable right away.
As well as the library, we supply a number of small example programs, and a larger application - an indexing and CGI-based application called omega:
· The indexer supplied can index HTML, PHP, PDF, PostScript, and plain text. Adding support for indexing other formats is easy where conversion filters are available (e.g. Microsoft Word). This indexer works using the filing system, but we also provide a script to allow the htdig web crawler to be hooked in, allowing remote sites to be searched using Omega.
· You can also index data from any SQL or other RDBMS supported by the Perl DBI module. That includes MySQL, PostgreSQL, SQLite, Sybase, MS SQL, LDAP, and ODBC.
· CGI search front-end supplied with highly customisable appearance. This can also be customised to output results in XML or CSV, which is useful if you are dynamically generating pages (e.g. with PHP or mod_perl) and just want raw search results which you can process in your own page layout code.
What's New in This Release: [ read full changelog ]
· This version fixes some minor bugs and adds a few new features.