Fact-checked by Grok 2 weeks ago

Apache Nutch

Apache Nutch is a highly extensible and scalable open-source web crawler software project developed under the auspices of the Apache Software Foundation.^[1] Designed for production-ready web crawling and diverse data acquisition tasks, it enables fine-grained configuration and accommodates a wide variety of sources through its modular architecture.^[1] Originally initiated in 2001 by Doug Cutting and Mike Cafarella as an effort to build a large-scale open-source web search engine capable of indexing billions of pages, Nutch began as a single-machine crawler but quickly evolved to address scalability challenges.^[2] By 2004, inspired by Google's distributed file system research, the project incorporated components that later spun off into Apache Hadoop, including the Nutch Distributed File System (which became HDFS) and MapReduce integration for batch processing vast data volumes.^[2] Nutch entered the Apache Incubator in January 2005, released its first version as a Lucene sub-project in August 2005, and graduated to a top-level Apache project on April 21, 2010, marking its maturation into a robust tool for distributed web crawling.^[3] Key to its scalability is its reliance on Apache Hadoop for handling large-scale operations, allowing it to process petabytes of data across clusters while supporting smaller, non-distributed jobs.^[1] Nutch features pluggable integrations, such as Apache Tika for content parsing, and indexing with Apache Solr or Elasticsearch, making it versatile for building custom search engines or data pipelines.^[1] Its extensibility is provided through stable interfaces for implementing custom components like parsers, HTML filters, scoring algorithms, and generators, ensuring adaptability to specific use cases without altering core functionality.^[1] As of July 2025, the latest stable release is version 1.21, which includes enhancements for improved performance and compatibility in modern environments.^[4]

Overview and History

Project Overview

Apache Nutch is a highly extensible and scalable open-source web crawler software project designed for data acquisition tasks such as web crawling and indexing.^[1] It enables the batch processing of web-scale data, making it suitable for handling large volumes of information in a distributed environment.^[5] Originating in the early 2000s as an effort to develop an open-source alternative to proprietary search engines, Nutch stems from the Apache Lucene project, leveraging its text indexing capabilities to form the foundation for web-scale crawling.^[6] The project's modularity allows developers to customize it for building bespoke search engines or data pipelines, with support for plugins that integrate parsing tools like Apache Tika and indexing backends such as Apache Solr or Elasticsearch.^[1] As of November 2025, the current stable release is version 1.21, made available on July 20, 2025, which includes enhancements for improved plugin compatibility and performance in distributed setups.^[4] This release underscores Nutch's ongoing role in the Apache ecosystem as a mature tool for production-ready web data extraction.^[7]

Historical Development

Apache Nutch was founded in 2001 by Doug Cutting and Mike Cafarella as an open-source web search engine project, built using Java and based on Cutting's earlier Lucene search library, with the goal of creating a scalable alternative to proprietary systems like Google.^[2] The project drew inspiration from Google's early distributed computing approaches, aiming to enable large-scale web crawling without reliance on closed-source technology.^[8] Early development focused on core crawling capabilities, initially limited to single-machine operation, with the first public releases appearing in 2003–2004. In 2003–2004, inspired by Google's Google File System paper, the project incorporated the Nutch Distributed File System (NDFS) to handle massive datasets for crawling, which demonstrated scalability in a 100-million-page index demo in June 2003.^[9] In January 2005, Nutch entered the Apache Incubator to formalize its governance under the Apache Software Foundation.^[3] It graduated from the Incubator in June 2005 to become a subproject of Apache Lucene.^[10] A major milestone occurred in early 2006, when the NDFS and MapReduce components were extracted from Nutch due to their broader applicability beyond search, forming the independent Apache Hadoop project later that year.^[11] This separation enhanced Nutch's focus on crawling while leveraging Hadoop for distributed processing. Nutch 1.0 followed in March 2009, marking its first stable release as an Apache project.^[3] Nutch achieved top-level Apache project status on April 21, 2010, gaining independence from Lucene.^[12] The 1.x series continued with releases like 1.2 in 2010, emphasizing dependency upgrades for Hadoop, Solr, and Tika.^[3] In the 2010s, the 2.x branch emerged, adopting Apache Gora in 2012 for flexible NoSQL storage backends like HBase, enabling better support for diverse data persistence in distributed environments.^[13] Recent 1.x releases have prioritized production readiness, including 1.13–1.16 from 2017–2019 for bug fixes and integrations, 1.18 in January 2021, 1.19 in August 2022, 1.20 in April 2024, and 1.21 in July 2025 with improved plugin support and Hadoop compatibility.^[5]^[14]

Technical Architecture

Core Components

Apache Nutch's core components form the foundational elements of its modular architecture, enabling efficient management of web crawling data and extensibility for diverse use cases. These components include the crawl database, segments, link database, plugin system, and storage backends, each designed to handle specific aspects of data storage, processing, and customization in a scalable manner.^[15] The crawl database (crawldb) serves as a centralized repository for metadata associated with URLs discovered during crawling. It stores critical information such as the status of each URL (e.g., unfetched, fetched, or redir_temp), fetch timestamps, page scores for prioritization, and content signatures for duplicate detection. This database is initially populated through the injection of seed URLs and subsequently updated to incorporate new discoveries from parsed segments, ensuring a comprehensive record of the crawl's progress. The crawldb is implemented using Hadoop-compatible file formats such as SequenceFiles, supporting storage on HDFS for distributed environments or local filesystems for smaller setups.^[15] Segments represent temporary storage units that encapsulate the outputs of individual crawling rounds, facilitating batch processing without permanent data accumulation until merging. Each segment is a timestamped directory (e.g., crawl/segments/20250720120000) containing subdirectories for fetch lists, fetched content, parse data, and fetch logs, including raw HTML, extracted text, and outlinks. These structures allow Nutch to process discrete batches of URLs, with fetched content stored in a compressed format and parsed outputs including metadata like title and content digest for further indexing. Segments are generated from the crawldb and are transient, often merged periodically to update the main databases.^[15] The link database (linkdb) maintains a graph-like structure of hyperlinks extracted from crawled pages, focusing on inverted links to support analysis and optimization. It records incoming links (inlinks) for each URL, including the source URL and anchor text, which enables link-based scoring mechanisms such as PageRank for prioritizing high-value pages. This component also aids in deduplication by cross-referencing link patterns and is built by inverting outlinks from segments, providing a consolidated view of the web graph for enhanced crawl efficiency.^[16]^[15] Nutch's plugin system provides a modular framework for extending core functionality through custom implementations, allowing developers to tailor behaviors without modifying the base code. Based on an Eclipse 2.x-inspired architecture, plugins define extension points for components like URL filters (e.g., regex-based exclusion), protocol fetchers (e.g., HTTP or FTP handlers), and content parsers (e.g., integrating Apache Tika for multimedia). Plugins are dynamically loaded by specifying their names in the plugin.includes property within the configuration file, enabling runtime activation of interfaces such as ScoringFilter for custom ranking or ParseFilter for metadata extraction. This system ensures Nutch's adaptability to specialized crawling needs, such as domain-specific parsing or politeness policies.^[17] Storage backends underpin Nutch's data persistence, leveraging Apache Hadoop's distributed file system (HDFS) for handling large-scale crawls across clusters. For smaller setups, local filesystems suffice, but production environments favor HDFS for its fault-tolerant, scalable storage of segments and databases. Nutch 1.x uses file-based structures like SequenceFiles and MapFiles for persisting the crawldb, linkdb, and other data on HDFS or local storage.^[15]

Crawling Workflow

The crawling workflow in Apache Nutch operates as a batch-oriented process, executed in iterative rounds to systematically discover, fetch, and process web content while managing scale and politeness to hosts. This workflow is orchestrated primarily through the bin/crawl script, which automates the sequence of steps, allowing configuration for parameters such as crawl depth and the maximum number of URLs to process per iteration.^[18] The process relies on key data structures like the crawldb, which stores URL metadata, statuses, and scores, to track progress across batches.^[15] The workflow begins with the inject step, where seed URLs—typically provided in a directory of text files—are loaded into the crawldb. This step assigns initial metadata, such as fetch times and scores, to these URLs, marking them as unvisited (status NOTFETCHED) and preparing them for discovery. For example, the command bin/nutch inject crawldb urls integrates the seeds while applying URL filters to exclude invalid or undesired links.^[15] This foundational step ensures the crawler starts from a controlled set of entry points, avoiding random or incomplete coverage. Next, the generate step selects a subset of URLs from the crawldb for fetching in the current batch. It prioritizes URLs based on scores (e.g., from link analysis or recency), generating fetch lists organized into segments—temporary directories that group URLs by host or domain to enforce politeness policies, such as minimum delays between requests to the same site (configurable via http.agent.delay). The command bin/nutch generate crawldb segments -topN 100000 -batchId 123 limits output to the top N scored URLs and assigns a unique batch identifier to prevent overlaps. This step caps the number of URLs per host to balance load, typically using reducers in Hadoop mode to partition by domain. In the fetch step, Nutch downloads the content of URLs in the generated segments using pluggable fetchers, primarily for HTTP/HTTPS protocols. It respects robots.txt directives, handles redirects, retries failed requests (up to a configurable limit), and applies user-agent strings for identification. Raw content and headers are stored in segment files like fetcher and content, while errors or timeouts update the crawldb status to NOTMODIFIED or GONE as appropriate. For instance, with 50 threads, the command bin/nutch fetch segments/<batchId> -threads 50 processes the batch efficiently, incorporating delays to comply with host policies.^[15] The parse step follows, where fetched content is analyzed by pluggable parsers, such as Apache Tika for HTML, to extract plain text, outgoing links, and metadata (e.g., title, media type). Outgoing links (outlinks) to newly discovered URLs are extracted and stored in the parse data files (e.g., parse_data), along with plain text in parse_text. These outlinks are later used to update the crawldb and can be inverted to build the linkdb. Parse data also includes metadata like signatures for deduplication. The command bin/nutch parse segments/<batchId> skips already parsed content, ensuring efficiency in subsequent batches. This step enriches the crawldb with link structures for scoring and enables the discovery of deeper web layers. Finally, the update step merges results from parsing back into the crawldb, recalculating scores based on factors like inlink count and link analysis (e.g., PageRank-inspired methods), updating fetch statuses (e.g., to SUCCESS or REDIRECT), and incorporating new URLs from parses. Optional deduplication occurs here or later via content signatures (e.g., MD5 hashes) to eliminate duplicates across batches. The command bin/nutch updatedb crawldb segments/<batchId> completes the round, preparing the crawldb for the next iteration. Scores are adjusted to prioritize fresh or authoritative content, with caps on inlinks to manage storage. The linkdb can be updated separately using the bin/nutch invertlinks command to invert outlinks from segments into inlinks. The entire workflow runs in discrete batches via the bin/crawl script, such as bin/crawl urls crawldb segments -depth 3 -topN 50000, which repeats the generate-fetch-parse-update cycle for the specified depth (number of rounds) while limiting URLs per host or globally to control scope and resource use. This batched approach allows for incremental crawling, where each round builds on the previous crawldb state, facilitating restarts and monitoring without losing progress. Politeness and scoring ensure ethical, effective exploration, adapting to web dynamics over multiple iterations.^[18]

Key Features and Capabilities

Extensibility and Plugins

Apache Nutch employs a modular plugin architecture inspired by the Eclipse 2.x framework, allowing developers to extend core functionality without modifying the base codebase. Plugins are discovered and loaded at runtime through configuration in the nutch-site.xml file, where the plugin.includes property specifies a comma-separated list of plugin names to activate. This system uses plugin.xml manifest files within each plugin directory to declare extensions and dependencies, enabling dynamic installation of components as listeners to predefined extension points.^[17]^[19] Key built-in plugins provide essential crawling behaviors, such as protocol plugins for HTTP and HTTPS to handle web fetching, robots.txt compliance via the urlfilter-regex plugin to respect site directives, language detection through the parse-tika plugin, and content normalization using URL normalizers to standardize links. These plugins implement interfaces like URLFilter for filtering URLs before fetching (e.g., excluding certain domains or paths), FetchFilter for post-fetch processing, and ParseFilter for extracting and modifying metadata during parsing. Nutch includes numerous extension points—including 12 core interfaces—for pre- and post-processing stages, facilitating customizations like topical focused crawling with keyword-based URL filters or handling multimedia content through specialized protocol extensions.^[20]^[21] Custom plugin development allows tailored adaptations, such as integrating JavaScript rendering with the community-maintained Selenium protocol plugin to capture dynamic content from AJAX-heavy sites, or implementing custom scoring filters via the ScoringFilter interface to prioritize pages based on relevance metrics. Developers configure these via updates to nutch-site.xml and build plugins using Ant with build.xml and ivy.xml for dependencies, ensuring seamless integration into the crawling pipeline. For parsing diverse formats, Nutch briefly references external tools like Apache Tika through its built-in parse plugins.^[22]^[23]^[1] As of July 2025, Apache Nutch 1.21 introduced enhancements to extensibility, including a new protocol-smb plugin for accessing SMB shares and consolidation of plugin extension names for better maintainability.^[4]^[24]

Indexing and Storage Options

Apache Nutch employs Apache Gora as its storage abstraction layer to manage data persistence for crawled content, supporting distributed backends such as HBase, Cassandra, and Accumulo for scalable operations in production environments.^[3] For development and testing, a local Derby database can be configured via the SQL backend, though this option has been deprecated in favor of NoSQL alternatives in recent versions. The crawl database (crawldb) and link database (linkdb), which track URL status and hyperlinks respectively, are typically serialized as SequenceFiles and stored on the Hadoop Distributed File System (HDFS) when using Hadoop-integrated setups, ensuring efficient batch processing and fault tolerance.^[25] Following the parsing phase, Nutch's indexing process utilizes IndexWriter plugins to export processed documents to external search engines, primarily Apache Solr or Elasticsearch, enabling full-text search capabilities.^[26] Key document fields generated during indexing include the URL for unique identification, title for metadata, content for searchable text, and a boost score to influence ranking based on relevance factors like page authority.^[27] These fields are populated through a series of indexing filters that normalize and enrich the data, such as adding hostname and anchor text from inbound links, before batch submission to the chosen backend.^[27] For raw data access without immediate indexing, Nutch provides the NutchSegmentReader utility, which allows reading and inspecting crawl segments directly from HDFS in a structured format.^[28] Additionally, integration scripts and plugins facilitate exporting crawl data to archival formats like JSON for programmatic consumption or WARC (Web ARChive) for long-term preservation and interoperability with tools like the Common Crawl project.^[29]^[30] Customization of indexing is achieved through schema mapping in Solr cores, where Nutch's default schema.xml—featuring fields like text_general for analyzed general-purpose text—is copied and adapted to match specific indexing requirements.^[15] For handling large-scale indexes exceeding single-node capacity, SolrCloud mode distributes the index across a cluster, supporting sharding and replication while integrating seamlessly with Nutch's Hadoop-based workflows.^[15] In developments as of July 2025, Apache Nutch version 1.21 includes bug fixes and improvements for native support of Elasticsearch through the IndexWriter plugins, facilitating integration with modern Elasticsearch versions for improved search performance.^[4]^[26]^[31]

Scalability and Performance

Distributed Processing with Hadoop

Apache Nutch integrates deeply with Apache Hadoop to enable distributed web crawling, implementing its core crawling steps—generate, fetch, parse, and update—as MapReduce jobs. This architecture allows Nutch to process vast amounts of data across a cluster of commodity hardware, leveraging Hadoop's distributed file system (HDFS) for storage and YARN for resource management in versions 2.x and later. Specifically, Nutch requires Hadoop 2.x or 3.x to support YARN-based job scheduling, ensuring efficient allocation of computational resources for large-scale operations.^[32] In the job distribution process, the fetch and parse phases are parallelized across multiple nodes, where input splits are derived from segmented URL lists stored in the CrawlDB. Each mapper processes a subset of URLs, fetching content or parsing responses independently, while combiners perform partial aggregation to reduce data shuffling between map and reduce phases—for instance, in the dedup job to identify duplicates by content digests. This distribution enables Nutch to handle high-volume crawling tasks efficiently, scaling horizontally by adding nodes to the Hadoop cluster.^[32] Fault tolerance is inherent in Nutch's Hadoop integration, with automatic retries for failed tasks such as network timeouts during fetching, and checkpointing mechanisms via HDFS to persist intermediate states like crawl segments. If a job fails midway, Nutch can resume from the last checkpoint without restarting the entire crawl, minimizing data loss and downtime in distributed environments.^[33] Configuration for distributed processing occurs primarily through Hadoop's configuration files such as core-site.xml, mapred-site.xml, and yarn-site.xml, where parameters like fs.defaultFS specify the HDFS name service location, yarn.resourcemanager.address for YARN resource manager in Hadoop 2.x and later, and io.sort.factor to tune memory usage for sorting during MapReduce operations to optimize performance. Nutch also supports a local mode for single-node testing, running Hadoop in pseudodistributed fashion without a full cluster.^[33]^[34] Following the 2006 spin-off of Hadoop from the Nutch project, Nutch 1.x was optimized to fully leverage the Hadoop ecosystem, enhancing its ability to manage petabyte-scale crawls through refined MapReduce pipelines and HDFS integration. This evolution positioned Nutch as a robust tool for batch-oriented, large-volume data acquisition in production settings. As of version 1.21 (July 2025), enhancements further improve performance and compatibility with modern Hadoop environments.^[32]^[4]

Scaling Strategies

Apache Nutch supports vertical scaling on single machines by increasing JVM heap size and fetcher threads, making it suitable for smaller crawls involving fewer than 1 million pages. For instance, the default heap size of 4 GB can be raised further using options like -Xmx8g in the Nutch run script, while fetcher threads can be configured via fetcher.threads.fetch (default 50) to optimize resource utilization without distributed processing.^[35]^[15] This approach is effective for prototyping or limited-domain crawls but becomes inefficient beyond modest scales due to hardware constraints. Horizontal scaling leverages Hadoop clusters with multiple DataNodes to distribute workload across machines, enabling crawls of hundreds of millions of pages. Configuration involves setting up HDFS for storage and MapReduce for job distribution, with URL partitioning by domain (via partition.url.mode=byDomain) to prevent hotspots where a single reducer handles disproportionate traffic from popular sites. In practice, clusters of 16 nodes have fetched over 400 million pages by spawning parallel instances and limiting URLs per host to 1,000, ensuring balanced load and avoiding overload on high-traffic domains.^[36]^[37] Politeness and throttling mechanisms are configurable to respect site policies and avoid bans, with delays set via fetcher.server.delay (e.g., 1-2 seconds per host) and concurrent fetches limited by topN (e.g., 1,000 highest-scored URLs per segment) and fetcher.threads.per.queue (typically 1 for strict politeness). These settings enforce crawl delays from robots.txt and cap session times (e.g., 45 minutes per fetch) to manage queue backlogs from sites with long delays, promoting ethical crawling at scale.^[37]^[15]^[38] Performance is monitored through Hadoop counters, such as FetcherCounter.fetch_success for tracking fetch success rates (often exceeding 95% in optimized runs) and ParserCounter.parse_success for parse efficiency. Common bottlenecks include I/O-intensive operations in the parse phase, where high disk usage from segment storage can slow processing; tuning involves monitoring these counters via job logs to identify and mitigate issues like failed parses or network timeouts.^[39] While Nutch can handle billions of pages through batch processing—designed to fetch several billion monthly—it struggles with real-time requirements due to its MapReduce-based workflow, which processes in discrete segments rather than continuous streams. For advanced setups needing streaming capabilities, hybrid approaches integrate Nutch with frameworks like Apache Storm via projects such as StormCrawler to enable near-real-time crawling and processing.^[40]^[41]^[42]

Applications and Ecosystem

Search Engines Powered by Nutch

Apache Nutch has served as the foundational crawling technology for several notable search engines, enabling efficient web-scale data collection tailored to specific domains. These implementations demonstrate Nutch's versatility in powering both public and specialized search services by leveraging its extensible architecture to focus on targeted content sources.^[43] Krugle, a code search engine launched in the mid-2000s, utilizes Nutch to crawl source code repositories, web pages containing code snippets, archives, and other technically oriented content. The platform employs a modified version of Nutch specifically for accessing version control systems like CVS and Subversion, allowing users to discover and search open-source software across the web. This integration has enabled Krugle to index millions of code artifacts, providing developers with a specialized search experience beyond general-purpose engines. As of 2025, Krugle continues to operate as an enterprise-scale code search tool.^[43]^[44] The Creative Commons Search engine formerly employed Nutch, through the CcNutch plugin, to systematically crawl and index web content licensed under Creative Commons (CC) terms, facilitating discovery of freely shareable media, documents, and creative works. The plugin extended Nutch's parsing capabilities to recognize CC metadata, aggregating resources from diverse sites while respecting licensing restrictions during crawling. Originally prototyped in the early 2000s, this implementation powered search.creativecommons.org until around 2009 and influenced subsequent projects like DiscoverEd, underscoring Nutch's role in promoting open access to licensed materials.^[45]^[46]^[43] In enterprise settings, Nutch has powered intranet search tools for internal document crawling and retrieval. Such deployments often customize Nutch's fetch and parse filters to handle intranet protocols and access controls, demonstrating its utility in non-public environments.^[43] Case studies further illustrate Nutch's impact in archival and vertical search applications. Although Heritrix serves as the primary crawler for the Internet Archive's web collections, Nutch was integrated via extensions like NutchWAX to generate full-text indexes and enable searchable access to archived content spanning billions of pages. This combination allowed users to query historical web data while preserving the distinct archival focus of Heritrix. In vertical searches, Nutch has supported domain-specific engines and prototypes, such as for health policy documents that automate retrieval from regulatory sites like the European Medicines Agency using topical URL filters and parsing plugins. These examples showcase Nutch's ongoing relevance in specialized, high-impact search scenarios.^[47]^[48]^[49] In recent research applications as of 2025, Nutch continues to be used in projects like the Tala-Med search engine for medical literature, integrating with Docker, PostgreSQL, and Elasticsearch for scalable crawling and indexing.^[50] Apache Nutch has significantly influenced the broader Apache ecosystem through its spin-offs, notably Apache Hadoop, which emerged in 2006 as a distributed storage and processing framework extracted from Nutch's codebase to handle large-scale web crawling tasks.^[9] This separation allowed Nutch to leverage Hadoop's MapReduce and HDFS for scalable batch processing while enabling Hadoop to evolve independently for general big data applications.^[33] Another key spin-off is Apache Gora, introduced in the 2010s to provide a unifying abstraction layer for NoSQL data stores, which Nutch 2.x and later versions adopted for flexible persistence across backends like HBase and Cassandra.^[51] For content parsing and extraction, Nutch features deep integration with Apache Tika, enabling the handling of diverse formats such as HTML, PDF, and images by automatically detecting and extracting text and metadata during the crawling pipeline; this integration has been core since Nutch 1.1 and was enhanced in version 1.7 for broader multi-format support.^[1] Indexing capabilities are bolstered by native plugins for Apache Solr, which supports schema-based search indexing, and Elasticsearch, offering distributed indexing with advanced aggregations and real-time querying.^[1] In the wider ecosystem, Nutch demonstrates compatibility with Apache Storm for extending batch-oriented crawling into real-time processing scenarios, often through complementary tools like StormCrawler that adapt Nutch-inspired workflows for streaming environments.^[52] It also integrates with Wandora, an open-source topic mapping application, to process crawled data for semantic knowledge extraction and visualization.^[53] For content management systems, Drupal provides dedicated modules like the Nutch integration module, allowing administrators to control crawl lifecycles, seed URLs, and indexing directly from the Drupal interface.^[54] Community-driven extensions further enhance Nutch's interoperability, including the third-party ElasticsearchWriter plugin available since Nutch 1.7, which streamlines direct indexing to Elasticsearch clusters by handling document serialization and batch commits.^[55] These integrations position Nutch as a foundational component in open-source search and data pipelines, emphasizing modularity across the Apache portfolio.

References

[1]
Apache Nutch™
Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accomodates a wide variety ...Download · NutchTutorial · The Apache Software Foundation · Community
[2]
From Spiders to Elephants: The History of Hadoop - Datanami
Apr 15, 2015 · Cutting and Carafella dubbed this project Apache Nutch, and deployed a proof of concept of the indexer on a single machine with about 1GB of RAM ...
[3]
Legacy Nutch News Announcements - Apache Nutch™
The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.17, we advise all current users and developers of the 1.X series to ...
[4]
Nutch 1.21 Release - Apache Nutch™
Jul 20, 2025 · The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1. 21, we advise all current users and developers of the 1. ...
[5]
Apache Nutch - Apache Project Information
Apache Nutch is an open source web crawler software project. Category: web-framework; Website: https://nutch.apache.org; Project status: Active; Project data ...
[6]
Nutch - Confluence Mobile - Apache Software Foundation
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project comprises two codebases, ...Missing: origins | Show results with:origins
[7]
[ANNOUNCE] Apache Nutch 1.21 Release-Apache Mail Archives
Jul 21, 2025 · The Apache Nutch team is pleased to announce the release of Apache Nutch v1.21. Nutch is a well matured, production ready Web crawler.
[8]
Hadoop | History or Evolution - GeeksforGeeks
Jan 18, 2019 · ... Apache Nutch project. Apache Nutch project was the process of building a search engine system that can index 1 billion pages. After a lot of ...
[9]
History of Hadoop Project - Tutorial - Vskills
2006 – NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and they moved out of Nutch to form an independent subproject ...Missing: separation | Show results with:separation
[10]
What is Hadoop and What is it Used For? | Google Cloud
One project called Nutch was built by computer scientists Doug Cutting and Mike Cafarella based on Google's early work on MapReduce (more on that later) and ...<|separator|>
[11]
NutchFileFormats - Confluence Mobile - Apache Software Foundation
May 18, 2019 · Nutch version 0.4 ... Nutch 0.4 was released on May 25, 2004 (the previous version 0.3 was from June 17, 2003). The Java source code consists of ...Missing: initial | Show results with:initial
[12]
A Brief History of the Hadoop Ecosystem - Dataversity
May 27, 2021 · But, originally, it was called the Nutch Distributed File System and was developed as a part of the Nutch project in 2004. It officially became ...
[13]
Nutch Project Incubation Status
Jun 1, 2005 · On 2005-06-01, the Nutch project has been voted in by the Lucene PMC to become part of the Lucene project. The Nutch project graduated on 2005-06-01.
[14]
The Apache Software Foundation Announces Apache Nutch™ v2.0
Enterprise-scale Open Source search framework used for crawling intranets to global Web indexing. Forest Hill, MD –10 July 2012– The Apache Software ...
[15]
Board Meeting Minutes - Nutch - Apache Whimsy
Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop data structures and the MapReduce data ...Missing: separation | Show results with:separation
[16]
Julien Nioche on Apache Nutch 2 Features and Product Roadmap
Nov 1, 2012 · Open source web-search framework Apache Nutch version 2.1, which was released three weeks ago, supports improved properties for better Solr ...Missing: history timeline<|separator|>
[17]
Apache Nutch™ – Release
Nutch 1.19 Release. 22 August 2022 – Apache Nutch 1.19 has been released. Nutch 1.18 Release. 21 January 2021 – releasing Apache Nutch 1.18. Page 1 of 1. © ...
[18]
NutchTutorial - Confluence Mobile - Apache Software Foundation
May 18, 2019 · This tutorial explains how to use Nutch with Apache Solr. Solr is an open source full text search framework, with Solr we can search pages ...Missing: components | Show results with:components
[19]
LinkDb (apache-nutch 1.21 API)
Maintains an inverted link map, listing incoming links for each url. Nested Class Summary. Nested Classes. Modifier ...
[20]
AboutPlugins - Confluence Mobile - Apache Software Foundation
Nutch's plugin system is based on the one used in Eclipse 2.x. Plugins are central to how Nutch works. All of the parsing, indexing and searching that Nutch ...Missing: Java SPI
[21]
Nutch Command Line Options of bin/nutch - Apache
The crawl script bin/crawl runs a typical web crawl calling bin/nutch for each step. It is a replacement for the Java Crawl class and the bin/nutch crawl ...
[22]
Package org.apache.nutch.plugin
The Nutch Plugin System provides a way to extend nutch functionality. A large part of the functionality of Nutch are provided by plugins: All of the parsing, ...Missing: Java | Show results with:Java
[23]
PluginCentral - Confluence Mobile - Apache Software Foundation
Plugins provide a large part of the functionality of nutch. This page acts as an up-to-date resource for supported plugins in Nutch.Missing: SPI | Show results with:SPI
[24]
None
### Extension-Point IDs in plugin.xml
[25]
momer/nutch-selenium - GitHub
This plugin allows you to fetch javascript pages using Selenium, while relying on the rest of the awesome Nutch stack!
[26]
Writing a Nutch Plugin - Apache Software Foundation
A build.xml file that tells ant how to build your plugin. A ivy.xml file containing either the description of the dependencies of a module, its published ...
[27]
Apache Nutch - Wikipedia
Apache Nutch is a highly extensible and scalable open source web crawler software project. ... Retrieved August 15, 2009. External links. edit · Official website.
[28]
Index writers in Nutch - Apache Software Foundation
May 18, 2019 · An index writer is a component of the indexing job, which is used for sending documents from one or more segments to an external server.
[29]
IndexStructure - Confluence Mobile - Apache Software Foundation
Indexing filter that indexes all inbound anchor text for a document. Adds basic searchable hostname field to a document. Adds basic searchable URL field to a ...Missing: storage options
[30]
Overview (apache-nutch 1.21 API)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Nutch is a project of the Apache Software Foundation and is part ...
[31]
[#NUTCH-1863] Add JSON format dump output to readdb command
Jan 28, 2021 · Add JSON format dump output to readdb command. Status: Assignee ... Opening up the ability for third parties to consume Nutch crawldb data as JSON ...
[32]
Dump out the Nutch data into the Common Crawl format
Mar 13, 2024 · ... Nutch map serialized data on the proper JSON structure serialize the data into CBOR format optionally, compress the serialized data using ...
[33]
Nutch 1.20 Release - Apache Nutch™
Apr 25, 2024 · The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.20, we advise all current users and developers of the 1.X series to ...Missing: Elasticsearch support
[34]
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
[35]
NutchTutorial - NUTCH - Apache Software Foundation
### Summary of Nutch's Use of Hadoop MapReduce for Distributed Crawling
[36]
Nutch and Hadoop Tutorial - Apache Software Foundation
The purpose of this tutorial is to provide a step-by-step method to get Nutch running with the Hadoop file system on multiple machines.Hadoop Cluster Setup · Deploy Nutch to Single Machine · Performing a Nutch CrawlMissing: separation milestone
[37]
https://geo-bigdata.github.io/2015/papers/S08208.pdf
[38]
[#NUTCH-2781] Increase default Java heap size - ASF JIRA - Issues
Jan 28, 2021 · The Nutch run script (bin/nutch) sets a "conservative" Java heap size of 1000 MB. This default was defined 15 years ago.Missing: vertical scaling single machine threads <1M
[39]
Nutch2Crawling - Confluence Mobile - Apache Software Foundation
A crawl cycle consists of 4 steps, each implemented as an Hadoop job. GeneratorJob; FetcherJob; ParserJob (optionally done during fetch using 'fetcher.parse') ...
[40]
[PDF] Optimizing Apache Nutch For Domain Specific Crawling at Large ...
Once a problem becomes too large for vertical scaling to work scaling horizontally is required. This is why Apache Nutch and Hadoop were designed[10].
[41]
Why does Apache Nutch sometimes get stuck using a single thread ...
Oct 16, 2015 · Now only one thread can work on a queue at a time because fetcher.threads.per.queue=1 to force polite crawling and respecting crawl delays ...
[42]
Metrics - Confluence Mobile - Apache Software Foundation
Sep 30, 2023 · This page provides a narrative on Nutch application metrics. It details which metrics are captured for which Nutch Job's within which Tasks.
[43]
[PDF] Nutch: A Flexible and Scalable Open-Source Web Search Engine
The Nutch project grew out of the first author's experience developing Lucene [13], a Java text indexing library that became part of the Apache Jakarta open ...
[44]
Nutch 1.18 Release - Apache Nutch™
Jan 21, 2021 · The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.18, we advise all current users and developers of the 1.X series to ...Missing: hybrid Storm post-
[45]
The Battle of the Crawlers: Apache Nutch vs. StormCrawler - DZone
Jan 10, 2017 · This benchmark as set out above shows that StormCrawler is 60% more efficient than Apache Nutch. We also found StormCrawler to run more reliably ...Missing: billions hybrid
[46]
Public search engines using Nutch - Apache Software Foundation
Krugle uses Nutch to crawl web pages for code, archives and technically-interesting content. We also use a modified version of Nutch to crawl CVS/Subversion ...
[47]
Out-Googling Google, a la Krugle - CNET
Sep 27, 2007 · While Krugle is built on things like Nutch, Lucene, Apache, Antler, etc., we do not sell in the open-source model. We have benefited from it ...
[48]
CcNutch - Creative Commons Wiki
Jul 24, 2009 · CcNutch is a Creative Commons plugin for the open source Nutch search engine. It was used at http://search.creativecommons.org but is now ...
[49]
Our updated Search - Creative Commons
Sep 3, 2004 · ... search engine based on Nutch, with support for Creative Commons metadata thrown in. We flipped the switch last week and have been testing it ...
[50]
Issues and Opportunities for Full Text Search of Web Archives Using ...
Dec 9, 2007 · For the past two years, the Internet Archive (IA) has used Nutch/Lucene open source tools to generate full text search indexes of archival ...
[51]
[PDF] Archiving “Katrina” - Digital Library Federation
▫ 'Heritrix' means? ... ▫ IA's largest cluster for full-text indexing has been 34 nodes. Page 29. NutchWAX. NutchWAX bundles Nutch/Hadoop with extensions for.<|control11|><|separator|>
[52]
[PDF] A Web Crawler for Automated Document Retrieval in Health Policy
Jul 5, 2021 · This paper discusses a web crawler using Apache Nutch to automate document retrieval in health policy, specifically from the European Medicines ...Missing: vertical | Show results with:vertical
[53]
Nutch 2.X Tutorial - Confluence Mobile - Apache Software Foundation
This document describes how to get Nutch 2.X to use HBase as a storage backend for Gora. It is assumed that you have a working knowledge of configuring Nutch 1.Missing: adoption | Show results with:adoption
[54]
FAQ - Apache StormCrawler
Apache Storm is an elegant framework, with simple concepts, which provides a solid platform for distributed stream processing.Missing: hybrid | Show results with:hybrid
[55]
Integration with Apache Solr, Apache Nutch and Processing?
Can you sketch an use case for Solr+Nutch and Wandora integration? It might help me figure out the value of the feature. We are aware of the Processing.org and ...
[56]
Nutch search engine integration | Drupal.org
Jul 30, 2006 · This module allows you to have basic control over the Nutch crawl lifecycle through the Drupal web interface. Drupal 4.7. The Drupal 4.7 version ...
[57]
Can Apache Nutch be used with Elasticsearch to index web crawl ...
May 1, 2015 · In the case of Apache Nutch (starting in Nutch 1.7+), there is an ElasticSearchWriter class written by the Apache team for integration with ...