Apache Nutch For Linux

n/a

Last updated: Jun 22, 2015 GPL

SOFTPEDIA® DOWNLOAD NOW 2,957 downloads so far

A free and Open Source Web searching software based on the Apache Lucene library. #Web searching #HTML parser #Web crawler #Apache #Nutch #Web

Description

changelog

Free Download

Apache Nutch project is an open source, scalable, highly extensible and free Web-based web crawler software that builds on Apache Lucene (Java version) library.

It adds Web specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. It is developed and distributed by the Apache Foundation, it two separate branches.

Being modular and pluggable, Apache Nutch has its benefits, by providing extensible interfaces like Parse, Index and ScoringFilter for custom implementations, such as Apache Tika for parsing.

Moreover, Apache Nutch is designed to run on a single machine, but it is more powerful when running in a Hadoop cluster. Pluggable indexing exists for Elastic Search, Apache Solr, etc.

What's new in Apache Nutch 2.3:

NUTCH-1779 Apply formatting to the code (lewismc)
NUTCH-1907 Incorrect output of Outlinks to Hosts within HostDbUpdateReducer (lewismc)
NUTCH-1856 Document webpage.avsc and host.avsc (lewismc)
NUTCH-1834 GeneratorMapper behavior depends on log level (Gerhard Gossen via snagel)

Read the full changelog

DOWNLOAD Apache Nutch 2.3