Apache Nutch 2.2.1

A free and Open Source Web searching software based on the Apache Lucene library
Apache Nutch project is an open source, scalable, highly extensible and free Web-based web crawler software that builds on Apache Lucene (Java version) library.

It adds Web specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. It is developed and distributed by the Apache Foundation, it two separate branches.

Being modular and pluggable, Apache Nutch has its benefits, by providing extensible interfaces like Parse, Index and ScoringFilter for custom implementations, such as Apache Tika for parsing.

Moreover, Apache Nutch is designed to run on a single machine, but it is more powerful when running in a Hadoop cluster. Pluggable indexing exists for Elastic Search, Apache Solr, etc.

last updated on:
October 7th, 2013, 21:25 GMT
price:
FREE!
developed by:
Sami Siren
license type:
GPL (GNU General Public License) 
category:
ROOT \ Internet \ HTTP (WWW)

FREE!

In a hurry? Add it to your Download Basket!

user rating

UNRATED
0.0/5
 

0/5

1 Screenshot
Apache Nutch - The Apache Nutch logo!
What's New in This Release:
  • NUTCH-1591 Incorrect conversion of ByteBuffer to String (Jason Howes via lewismc)
  • NUTCH-1571 SolrInputSplit doesn't implement Writable and crawl script doesn't pass crawlId to generate and updatedb tasks (yuanyun.cn via lewismc)
  • NUTCH-1126 JUnit test for urlfilter-prefix (Talat UYARER via markus)
  • NUTCH-1585 Ensure duplicate tags do not exist in microformat-reltag tag set (lewismc)
read full changelog

Add your review!

SUBMIT