Apache Nutch 2.2.1
A free and Open Source Web searching software based on the Apache Lucene library
It adds Web specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. It is developed and distributed by the Apache Foundation, it two separate branches.
Being modular and pluggable, Apache Nutch has its benefits, by providing extensible interfaces like Parse, Index and ScoringFilter for custom implementations, such as Apache Tika for parsing.
Moreover, Apache Nutch is designed to run on a single machine, but it is more powerful when running in a Hadoop cluster. Pluggable indexing exists for Elastic Search, Apache Solr, etc.
In a hurry? Add it to your Download Basket!
What's New in This Release:
- NUTCH-1591 Incorrect conversion of ByteBuffer to String (Jason Howes via lewismc)
- NUTCH-1571 SolrInputSplit doesn't implement Writable and crawl script doesn't pass crawlId to generate and updatedb tasks (yuanyun.cn via lewismc)
- NUTCH-1126 JUnit test for urlfilter-prefix (Talat UYARER via markus)
- NUTCH-1585 Ensure duplicate tags do not exist in microformat-reltag tag set (lewismc)