Nutch project is Web searching software which builds on Lucene Java, adding Web specifics such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
What's New in This Release: [ read full changelog ]
· The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.0. This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, HDFS™, an in memory data store and various high profile SQL stores. After some two years of development Nutch v2.0 also offers all of the mainstream Nutch functionality and it builds on Apache Solr™ adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika™ for HTML and an array other document formats. Nutch v2.0 shadows the latest stable mainstream release (v1.5.X) based on Apache Hadoop™ and covers many use cases from small crawls on a single machine to large scale deployments on Hadoop clusters.