Apache Nutch Changelog

What's new in Apache Nutch 2.3

Jun 22, 2015
  • NUTCH-1779 Apply formatting to the code (lewismc)
  • NUTCH-1907 Incorrect output of Outlinks to Hosts within HostDbUpdateReducer (lewismc)
  • NUTCH-1856 Document webpage.avsc and host.avsc (lewismc)
  • NUTCH-1834 GeneratorMapper behavior depends on log level (Gerhard Gossen via snagel)
  • NUTCH-1899 upgrade restlet lib to prevent build failure (talat)
  • NUTCH-1797 remove unused package o.a.n.html (Saurabh Chhajed via snagel)
  • NUTCH-1888 Specify HTMLMapper to use in TikaParser (Halil Simsek via jnioche)
  • NUTCH-1897 Easier debugging of plugin XML errors (markus)
  • NUTCH-1823 Upgrade to elasticsearch 1.4.1 (Phu Kieu, markus, lewismc)
  • NUTCH-1829 Generator : unable to distinguish real errors (Mathieu Bouchard, jnioche, snagel)
  • NUTCH-1778 Generator not logging number of URLs in batch correctly (jnioche via snagel)
  • NUTCH-1877 Suffix URL filter to ignore query string by default (markus via snagel)
  • NUTCH-1825 protocol-http may hang for certain web pages (Phu Kieu via snagel)
  • NUTCH-1483 Can't crawl filesystem with protocol-file plugin (Rogério Pereira Araújo, Mengying Wang, snagel)
  • NUTCH-1885 Protocol-file should treat symbolic links as redirects (Mengying Wang, snagel)
  • NUTCH-1880 URLUtil should not add additional slashes for file URLs (snagel)
  • NUTCH-1879 Regex URL normalizer should remove multiple slashes after file: protocol (snagel)
  • NUTCH-1820 remove field "orig" which duplicates "id" (lewismc, snagel)
  • NUTCH-1843 Upgrade to Gora 0.5 (talat, lewismc, Kiril Menshikov, drazzib)
  • NUTCH-1883 bin/crawl: use function to run bin/nutch and check exit value (snagel)
  • NUTCH-1882 ant eclipse target to add output path to src/test (snagel)
  • NUTCH-1827 Port NUTCH-1467 and NUTCH-1561 to 2.x (snagel)
  • NUTCH-1876 Upgrade to Crawler Commons 0.5 (jnioche)
  • NUTCH-1866 ant eclipse target should not delete runtime (nimafl via lewismc)
  • NUTCH-1859 Make Nutch webapp port configurable (Nima Falaki via lewismc)
  • NUTCH-1848 Bug in DashboardPage.html instances counter (Nima Falaki via lewismc)
  • NUTCH-841 Create a Wicket-based Web Application for Nutch (Fjodor Vershinin via lewismc)
  • NUTCH-1832 Make Nutch work without an indexer (mattmann via lewismc)
  • NUTCH-1840 the describe function in SolrIndexWriter is not correct (kaveh minooie via jnioche)
  • NUTCH-1837 Upgrade to Tika 1.6 (lewismc)
  • NUTCH-1829 Generator : unable to distinguish real errors (Mathieu Bouchard via jnioche)
  • NUTCH-1828 bin/crawl : incorrect handling of nutch errors (Mathieu Bouchard via jnioche)
  • NUTCH-1693 TextMD5Signature computed on textual content (Tien Nguyen Manh, markus via snagel)
  • NUTCH-1409 remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip (Matthias Agethle via snagel)
  • NUTCH-1819 batchId in GeneratorJob ( Fjodor Vershinin via lewismc)
  • NUTCH-1708 use same id when indexing and deleting redirects (snagel)
  • NUTCH-1817 Remove pom.xml from source (jnioche)
  • NUTCH-1811 bin/nutch junit to use junit 4 test runner (snagel)
  • NUTCH-1776 Log incorrect plugin.folder file path (Diaa via snagel)
  • NUTCH-1566 bin/nutch to allow whitespace in paths (tejasp, snagel)
  • NUTCH-1605 MIME type detector recognizes xlsx as zip file (snagel)
  • NUTCH-385 Improve description of thread related configuration for Fetcher (jnioche,lufeng)
  • NUTCH-1798 Crawl script not calling index command correctly (Aaron Bedward via jnioche)
  • NUTCH-1769 REST API refactoring (Fjodor Vershinin via lewismc)
  • NUTCH-1633 slf4j is provided by hadoop and should not be included in the job file (kaveh minooie via jnioche)
  • NUTCH-1787 update and complete API doc overview page (snagel)
  • NUTCH-1767 remove special treatment of "params" in relative links (snagel)
  • NUTCH-1718 redefine http.robots.agent as "additional agent names" (snagel, Tejas Patil, Daniel Kugel)
  • NUTCH-1796 Ensure Gora object builders are used as oppose to empty constructors (snagel via lewismc)
  • NUTCH-1590 [SECURITY] Frame injection vulnerability in published Javadoc (jnioche)
  • NUTCH-1736 Can't fetch page if http response header contains Transfer-Encoding:chunked (ysc via jnioche)
  • NUTCH-1782 NodeWalker to return current node (markus)
  • NUTCH-1781 Update gora-*-mapping.xml and gora.proeprties to reflect Gora 0.4 (lewismc)
  • NUTCH-1768 Upgrade to ElasticSearch 1.1.0 (jnioche)
  • NUTCH-1634 readdb -stats shows the result twice (kaveh minooie via jnioche)
  • NUTCH-1780 ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file (kaveh minooie via lewismc)
  • NUTCH-1676 Add rudimentary SSL support to protocol-http (jnioche, markus)
  • NUTCH-1674 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index (Tien Nguyen Manh and Alparslan Avcı via jnioche)
  • NUTCH-1714 Upgrade to Gora 0.4 (Alparslan Avcı via jnioche)
  • NUTCH-1752 Cache robots.txt rules per protocol:host:port (snagel)
  • NUTCH-1613 Timeouts in protocol-httpclient when crawling same host with >2 threads (brian44 via jnioche)
  • NUTCH-1182 fetcher to log hung threads (snagel)
  • NUTCH-1618 Turn speculative execution off for Fetching (talat)
  • NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser (talat)
  • NUTCH-1725 CleaningJob's reducer does not commit deleted docs. (ilhamikalkan via talat)
  • NUTCH-1728 indexer-solr plugin is not delete docs from Solr (ilhamikalkan via talat)
  • NUTCH-1753 Eclipse dependecy problem for 2.x (talat)
  • NUTCH-1720 Duplicate lines in HttpBase.java (Walter Tietze via jnioche)
  • NUTCH-797 URL not properly constructed when link target begins with a "?" (Doug Cook, Robert Hohman, Stondet, ab via snagel)
  • NUTCH-1759 Upgrade to Crawler Commons 0.4 (jnioche)
  • NUTCH-1700 Remove deprecated code in src/plugin/creativecommons/build.xml (lewismc)
  • NUTCH-1761 Crawl script fails to find job file if not started from inside bin dir (David Hosking, jnioche)
  • NUTCH-1603 ZIP parser complains about truncated PDF file (snagel via lewismc)
  • NUTCH-1743 parsechecker to show outlinks (snagel)
  • NUTCH-1732 Better cmd line parsing for NutchServer (Fjodor Vershinin via lewismc)
  • NUTCH-1751 Empty anchors should not index (Sertac TURKEL via lewismc)
  • NUTCH-1733 parse-html to support HTML5 charset definitions (snagel)
  • NUTCH-1727 Configurable length for Tlds (Sertac TURKEL via lewismc)
  • NUTCH-1738 Expose number of URLs generated per batch in GeneratorJob (Talat UYARER via ewismc)
  • NUTCH-1671 indexchecker to add digest field (snagel, lufeng)
  • NUTCH-1645 Junit Test Case for Adaptive Fetch Schedule class (Yasin Kılınç, lufeng, Sertac URKEL via snagel)
  • NUTCH-1478 Parse-metatags and index-metadata plugin for Nutch 2.x series (kiran, Nguyen anh Tien, Talat UYARER, Vangelis Karvounis via lewismc)
  • NUTCH-1729 Upgrade to Tika 1.5 (jnioche)
  • NUTCH-1721 Upgrade to Crawler common 0.3 (tejasp)
  • NUTCH-1719 DomainStatistics fails in 2.x because URL is not unreversed (Gerhard Gossen via lewismc)
  • NUTCH-1253 Incompatable neko and xerces versions (snagel, lewismc, Talat UYARER)
  • NUTCH-1715 RobotRulesParser adds additional '*' to the robots name (tejasp)
  • NUTCH-356 Plugin repository cache can lead to memory leak (Enrico Triolo, Doğacan Güney via markus)
  • NUTCH-1164 Write JUnit tests for protocol-http (Sertac TURKEL via tejasp)
  • NUTCH-1710 Add gora package logging to log4j.properties (lewismc)
  • NUTCH-1655 Indexer Plugin for Elastic Search (Talat UYARER via lewismc)
  • NUTCH-1699 Tika Parser - Image Parse Bug (Mehmet Zahid Yüzügüldü, snagel via lewismc)
  • NUTCH-1568 port pluggable indexing architecture to 2.x (Talat UYARER via lewismc)
  • NUTCH-1672 Inlinks are added twice in DbUpdateReducer (Tien Nguyen Manh via lewismc)
  • NUTCH-1667 Updatedb always ignore batchId (Tien Nguyen Manh via lewismc)
  • NUTCH-1695 NutchDocument.toString() (markus via lewismc)
  • NUTCH-1696 Enable use of (Gora) SNAPSHOT dependencies (lewismc)
  • NUTCH-1681 In URLUtil.java, toUNICODE method does not work correctly (Ä°lhami KALKAN, snagel, markus via lewismc)
  • NUTCH-1673 Title isn't reset in MoreIndexingFilter (Nguyen Manh Tien via lewismc)
  • NUTCH-1621 Remove deprecated class o.a.n.crawl.Crawler (Rui Gao via jnioche)
  • NUTCH-1651 modifiedTime and prevmodifiedTime never set (Talat UYARER via lewismc)
  • NUTCH-1360 Suport the storing of IP address connected to when web crawling (ferdy, lewismc, Yasin Kılınç)
  • NUTCH-1588 Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x (Talat UYARER via lewismc)
  • NUTCH-1650 Adaptive Fetch Scheduler interval Wrong Set (Talat UYARER via lewismc)
  • NUTCH-1413 Record response time (Yasin KILINC, Talat UYARER, snagel via lewismc)
  • NUTCH-1125 JUnit test for tld (Sertac TURKEL via lewismc)
  • NUTCH-1124 JUnit test for scoring-opic (Talat UYARER via lewismc)
  • NUTCH-1641 Log timings for main jobs (jnioche)
  • NUTCH-1556 enabling updatedb to accept batchId (kaveh minooie,Feng)
  • NUTCH-1619 Writes Dmoz Description and Title information to db with snippet argument ( Yasin Kılınç via feng)
  • NUTCH-1631 Display Document Count Added To Solr Server (Furkan KAMACI via lewismc)
  • NUTCH-1629 Injector skips empty lines in seed files (kaveh minooie via jnioche)
  • NUTCH-1624 Typo in WebTableReader line 486 (kaveh minooie via lewismc)
  • NUTCH-1294 IndexClean job with solr implementation. (Dan Rosher, lewismc, Claudiu Chis via feng)
  • NUTCH-911 protocol-file to return proper protocol status (Peter Lundberg via snagel)
  • NUTCH-1587 misspelled property "threshold" in conf/log4j.properties (snagel)
  • NUTCH-1604 ProtocolFactory not thread-safe (jnioche)
  • NUTCH-1595 Upgrade to Tika 1.4 (jnioche, markus)
  • NUTCH-1594 count variable is never changed in ParseUtil class (Canan via Feng)

New in Apache Nutch 2.2.1 (Oct 7, 2013)

  • NUTCH-1591 Incorrect conversion of ByteBuffer to String (Jason Howes via lewismc)
  • NUTCH-1571 SolrInputSplit doesn't implement Writable and crawl script doesn't pass crawlId to generate and updatedb tasks (yuanyun.cn via lewismc)
  • NUTCH-1126 JUnit test for urlfilter-prefix (Talat UYARER via markus)
  • NUTCH-1585 Ensure duplicate tags do not exist in microformat-reltag tag set (lewismc)
  • NUTCH-1475 Index-More Plugin -- A better fall back value for date field (James Sullivan, snagel via lewismc)
  • NUTCH-1420 Get rid of the dreaded � (markus + lewismc)
  • NUTCH-1578 Upgrade to Hadoop 1.2.0 (markus)
  • NUTCH-1522 Upgrade to Tika 1.3 (jnioche)

New in Apache Nutch 2.0 (Jul 10, 2012)

  • The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.0. This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, HDFS™, an in memory data store and various high profile SQL stores. After some two years of development Nutch v2.0 also offers all of the mainstream Nutch functionality and it builds on Apache Solr™ adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika™ for HTML and an array other document formats. Nutch v2.0 shadows the latest stable mainstream release (v1.5.X) based on Apache Hadoop™ and covers many use cases from small crawls on a single machine to large scale deployments on Hadoop clusters.

New in Apache Nutch 1.3 (Jun 23, 2011)

  • NUTCH-995 Generate POM file using the Ivy makepom task (mattmann, jnioche, Gabriele Kahlout)
  • NUTCH-1003 task 'package' does not reflect the new organisation of the code (jnioche)
  • NUTCH-994 Fine tune Solr schema (markus)
  • NUTCH-997 IndexingFitlers to store Date objects instead of Strings (jnioche)
  • NUTCH-996 Indexer adds solr.commit.size+1 docs (markus)
  • NUTCH-983 Upgrade SolrJ to 3.1 (markus, jnioche)
  • NUTCH-989 Index-basic plugin and Solr schema now use date fieldType for tstamp field (markus)
  • NUTCH-888 Remove parse-rss and add tests for rss to parse-tika (jnioche)
  • NUTCH-991 SolrDedup must issue a commit (markus)
  • NUTCH 986 SolrDedup fails due to date incorrect format (markus)
  • NUTCH-977 SolrMappingReader uses hardcoded configuration parameter name for mapping file (markus)
  • NUTCH-976 Rename properties solrindex.to solr.(markus)
  • NUTCH-890 Fix IllegalAccessError with slf4j used in Solrj (markus)
  • NUTCH-891 Subcollection plugin won't require blacklist any more (markus)
  • NUTCH-972 CrawlDbMerger doesn't break on non-existent input (Gabriele Kahlout via jnioche)
  • NUTCH-967 Upgrade to Tika 0.9 (jnioche)
  • NUTCH-975 Fix missing/wrong headers in source files (markus)
  • NUTCH-963 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (Claudio Martella, markus)
  • NUTCH-825 Publish nutch artifacts to central maven repository (mattmann, jnioche)
  • NUTCH-962 max. redirects not handled correctly: fetcher stops at max-1 redirects (Sebastian Nagel via ab)
  • NUTCH-921 Reduce dependency of Nutch on config files (ab)
  • NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
  • NUTCH-872 Change the default fetcher.parse to FALSE (ab)
  • NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann)
  • NUTCH-964 Upgraded Xerces to 2.91, ERROR conf.Configuration - Failed to set setXIncludeAware (markus)
  • NUTCH-927 Fetcher.timelimit.mins is invalid when depth is greater than 1 (Wade Lau via jnioche)
  • NUTCH-824 Crawling - File Error 404 when fetching file with an hexadecimal character in the file name (Michela Becchi via jnioche)
  • NUTCH-954 Strict application of Content-Length limit for http protocols (Alexis Detreglode via jnioche)
  • NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche)
  • NUTCH-935 basicurlnormalizer removes unnecessary /./ in URLs (Stondet via markus)
  • NUTCH-912 MoreIndexingFilter does not parse docx and xlsx date formats (Markus Jelsma, jnioche)
  • NUTCH-886 A .gitignore file for Nutch (dogacan)
  • NUTCH-930 Remove remaining dependencies on Lucene API (ab)
  • NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
  • NUTCH-936 LanguageIdentifier should not set empty lang field on NutchDocument (Markus Jelsma via jnioche)
  • NUTCH-787 ScoringFilters should not override the injected score (jnioche)
  • NUTCH-949 Conflicting ANT jars in classpath (jnioche)
  • NUTCH-863 Benchmark and a testbed proxy server (ab)
  • NUTCH-844 Improve NutchConfiguration (ab)
  • NUTCH-845 Native hadoop libs not available through maven (ab)
  • NUTCH-843 Separate the build and runtime environments (ab)
  • NUTCH-821 Use ivy in nutch builds (Enis Soztutar, jnioche)
  • NUTCH-837 Remove search servers and Lucene dependencies (ab)
  • NUTCH-836 Remove deprecated parse plugins (jnioche)
  • NUTCH-939 Added -dir command line option to SolrIndexer (Claudio Martella via ab)
  • NUTCH-948 Remove Lucene dependencies (ab)

New in Apache Nutch 1.0 (Mar 29, 2009)

  • This release includes several major feature improvements such as new indexing framework, new scoring framework, Apache Solr integration just to mention a few.