Apache Nutch 2.2.1

A free and Open Source Web searching software based on the Apache Lucene library
Apache Nutch - The Apache Nutch logo!
  1 Screenshot
Apache Nutch project is an open source, scalable, highly extensible and free Web-based web crawler software that builds on Apache Lucene (Java version) library.

It adds Web specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. It is developed and distributed by the Apache Foundation, it two separate branches.

Being modular and pluggable, Apache Nutch has its benefits, by providing extensible interfaces like Parse, Index and ScoringFilter for custom implementations, such as Apache Tika for parsing.

Moreover, Apache Nutch is designed to run on a single machine, but it is more powerful when running in a Hadoop cluster. Pluggable indexing exists for Elastic Search, Apache Solr, etc.

last updated on:
October 7th, 2013, 21:25 GMT
license type:
GPL (GNU General Public License) 
developed by:
Sami Siren
ROOT \ Internet \ HTTP (WWW)
Apache Nutch
Download Button

In a hurry? Add it to your Download Basket!

user rating



Rate it!
What's New in This Release:
  • NUTCH-1591 Incorrect conversion of ByteBuffer to String (Jason Howes via lewismc)
  • NUTCH-1571 SolrInputSplit doesn't implement Writable and crawl script doesn't pass crawlId to generate and updatedb tasks (yuanyun.cn via lewismc)
  • NUTCH-1126 JUnit test for urlfilter-prefix (Talat UYARER via markus)
  • NUTCH-1585 Ensure duplicate tags do not exist in microformat-reltag tag set (lewismc)
read full changelog

Add your review!