Apache Nutch

2.2.1 GPL (GNU General Public License)    
  UNRATED

  2,946 downloads

A free and Open Source Web searching software based on the Apache Lucene library

description

download

specs

changelog

Apache Nutch project is an open source, scalable, highly extensible and free Web-based web crawler software that builds on Apache Lucene (Java version) library.

It adds Web specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. It is developed and distributed by the Apache Foundation, it two separate branches.

Being modular and pluggable, Apache Nutch has its benefits, by providing extensible interfaces like Parse, Index and ScoringFilter for custom implementations, such as Apache Tika for parsing.

Moreover, Apache Nutch is designed to run on a single machine, but it is more powerful when running in a Hadoop cluster. Pluggable indexing exists for Elastic Search, Apache Solr, etc.
read more   
Last updated on October 7th, 2013

#Web searching #HTML parser #Web crawler #Apache #Nutch #Web #searching

Apache Nutch - The Apache Nutch logo!

0 User reviews so far.

SUBMIT