Heritrix
Heritrix is an open-source, extensible, web-scale web crawler designed specifically for high-quality web archiving.[1] Developed by the Internet Archive, it systematically captures and preserves digital web content to ensure long-term accessibility for researchers, historians, and future generations.[2] The project derives its name from an archaic term for "heiress," reflecting its purpose of inheriting and safeguarding digital cultural artifacts.[1] Heritrix emphasizes archival integrity through features such as respect for robots.txt directives and META robots tags, adaptive politeness policies to avoid overloading target sites, and customizable crawling behaviors via Java and scripting.[2] It supports embedding within larger systems and produces output in formats like WARC (Web ARChive), facilitating integration with preservation tools.[1] First released on January 5, 2004 (version 0.2.0), Heritrix has evolved through major updates, including version 1.0.0 in August 2004, 2.0.0 in February 2008, and 3.0.0 in December 2009, with ongoing community-driven improvements addressing bugs and enhancements; the latest stable release is version 3.12.0 on October 30, 2025.[1][3] Licensed under the Apache License 2.0, it is freely available for redistribution and modification, hosted on platforms like GitHub and SourceForge.[2] Heritrix powers significant web archiving initiatives, serving as the core technology for services like Archive-It, where it enables large-scale, customizable harvesting for partner organizations worldwide.[4] Its robust design has made it a standard tool in the field, used by institutions to create comprehensive snapshots of the evolving internet.[1]Introduction
Overview
Heritrix is an open-source, extensible, web-scale web crawler implemented in Java, specifically engineered for web archiving to capture and preserve digital content across large portions of the internet.[2] It enables institutions to systematically harvest web resources while respecting archival standards, producing outputs such as ARC and WARC files for long-term storage and replay.[2] Developed primarily by the Internet Archive in collaboration with Nordic national libraries, Heritrix emphasizes modularity and configurability to support diverse archiving needs.[5] The project is licensed under the Apache License 2.0, permitting free use, modification, and distribution by researchers, libraries, and organizations worldwide.[6] Heritrix runs on Linux or other Unix-like systems, with Windows not regularly tested or supported, and requires Java 17 or later for operation.[7] The latest stable release, version 3.12.0, was issued on October 30, 2025.[8]History and Development
Heritrix's development began in 2003 as a collaborative effort between the Internet Archive and several national libraries, including those in the Nordic region, to create an open-source web crawler tailored for archival purposes based on shared specifications.[9][10] This initiative addressed the need for a robust, extensible tool capable of handling large-scale web preservation, drawing on prior web crawling experiences.[11] The first official release occurred on January 5, 2004, as version 0.2.0, marking the project's public debut and enabling initial testing and adoption by archiving institutions.[1] Subsequent refinements led to version 1.0.0 in mid-2004, which incorporated feedback from international workshops and focused on stabilizing core crawling functionalities for broader use.[12] The Internet Archive has driven major updates and maintenance since the project's inception.[9] A pivotal milestone came with the release of version 3.0.0 on December 5, 2009, introducing improved extensibility through modular architecture and better support for distributed crawling, which facilitated adaptation to evolving web technologies.[1] Ongoing development has emphasized web-scale performance, with regular updates addressing scalability, security, and integration challenges. Since 2008, the project has benefited from open-source community involvement via the GitHub repository (internetarchive/heritrix3), where contributors worldwide submit improvements, bug fixes, and extensions to sustain its relevance in archival crawling.[2]Design and Features
Core Principles
Heritrix prioritizes archival quality by implementing polite crawling practices that respect site policies and minimize disruption to web servers. It adheres strictly to robots.txt directives to avoid disallowed paths and employs rate-limiting mechanisms through its frontier manager to prevent server overload, ensuring requests are spaced appropriately based on configured politeness delays. This focus on completeness rather than speed allows for thorough capture of web content, including metadata and linked resources, while avoiding aggressive fetching that could compromise data fidelity or site availability.[2][13] A core tenet of Heritrix is its extensibility, achieved through a modular plugin architecture that enables users to customize key aspects of the crawling process. This design supports the integration of pluggable modules for URI generation, content processing, and frontier management, allowing adaptations for specialized archiving scenarios without altering the core codebase. Implemented in Java, requiring version 17 or later as of release 3.12.0 (October 2024), this architecture facilitates seamless extension by developers to handle unique requirements, such as filtering or transforming data during capture.[13][14][8] Heritrix is engineered for scalability in web-scale operations, supporting distributed crawling across multiple machines to manage vast collections while preserving data integrity. Through tools like the Heritrix Cluster Controller, it coordinates instances on separate hosts, enabling parallel processing of URI queues and efficient load balancing without duplicating efforts or losing archival context. This approach ensures robust performance for large-scale archives, such as national web collections, by maintaining consistent state and error recovery across the cluster.[2][13] As an open-source project under the Apache License 2.0, Heritrix embodies a collaborative ethos that invites community-driven enhancements to address evolving archiving needs, including improved handling of modern web technologies, such as extracting links from JavaScript code. Contributions from users worldwide have led to improvements in handling modern web technologies and optimizing for diverse environments, fostering a ecosystem of extensions that broaden its applicability beyond traditional static archiving.[2][1]Key Capabilities
Heritrix provides robust support for configurable crawling scopes, enabling precise control over the web areas to archive. Crawls begin with seed-based URI selection, where users specify starting URLs—either singly or in batches from text files—that serve as the initial points of exploration. These seeds can be modified post-job creation for flexibility. Scope rules, implemented via DecidingScope and various DecideRules, govern inclusion and exclusion decisions; for instance, SurtPrefixedDecideRule uses Sorted URI Reversed Tokens (SURT) prefixes to restrict crawling to defined domains (e.g., "http://(org,foo,www,)/"), hosts, or path segments, while rules like MatchesRegexDecideRule or NotMatchesFilePatternDecideRule apply regular expressions for fine-grained filtering. Additionally, Heritrix handles dynamic sites through the ExtractorJS processor, which extracts potential URIs from JavaScript source code without executing it, to discover links that may be embedded in scripts.[13][15] In terms of resource handling, Heritrix comprehensively captures HTTP responses, associated metadata (such as status codes, MIME types, and content lengths), and embedded assets including images, videos, and other media referenced in HTML. This ensures archival fidelity by storing full payloads in formats like WARC, with built-in deduplication via URI uniqueness filters to avoid redundant downloads. Options for MIME-type filtering, such as ContentTypeRegExpFilter or ContentTypeMatchesRegexDecideRule, permit exclusion of unwanted types (e.g., "text/html.*" for inclusion) post-fetch, optimizing storage and focusing on relevant content like textual documents over binaries.[13][15] Performance optimizations in Heritrix facilitate efficient, large-scale operations through multi-threaded crawling, where the maxToeThreads parameter controls concurrent fetch threads—recommended at 150-200 for balanced throughput—and politeness policies enforce delays (e.g., via delay-factor and max-delay-ms) to respect server loads. Checkpointing writes the crawler's internal state to stable storage at user-defined intervals, supporting resumable jobs that recover from interruptions or failures without restarting from seeds; this is invoked via the web UI, REST API, or command line, with experimental fast modes for quicker saves. Integration with scalable storage systems enables terabyte-scale archives, as evidenced by Internet Archive crawls producing over 80 terabytes of data from billions of URIs in single runs.[13][16][17] Compliance features ensure ethical and standards-compliant operation, with support for HTTP/1.1, HTTP/2, and HTTP/3 via the FetchHTTP2 module (introduced in version 3.10.0), full HTTPS handling including port 443 detection, and robots.txt policy enforcement (classic, ignore, or custom modes). Built-in authentication covers HTTP Basic/Digest via RFC 2617 credentials with retry logic for 401 responses, as well as HTML form-based logins for POST/GET interactions on protected sites. Proxy configurations are supported through command-line flags like --proxy-host and --proxy-port, or per-job settings like httpProxyHost, allowing operation in networked environments while maintaining archival integrity.[13][15][18]Architecture
Core Components
Heritrix's core architecture revolves around modular components that facilitate scalable web crawling and archiving, with the system designed in Java to allow pluggable extensions for customization. These components handle URI management, processing, boundary enforcement, and data persistence, ensuring efficient operation across distributed environments.[19] The Frontier serves as the central queue manager for URIs, maintaining the state of discovered, queued, and processed URIs while preventing duplicates and enforcing crawl order. It prioritizes URIs based on predefined scopes, politeness policies—such as delays between requests to the same host—and resource availability, distributing work to multiple threads or machines for parallelism. The Frontier is pluggable, with implementations like BdbFrontier using Berkeley DB for persistent storage of queue data and supporting checkpointing for recovery after interruptions. It interacts with other components by receiving initial seeds and newly discovered URIs, then selecting the next URI for processing while logging events in recoverable formats.[13][20][19] Processor chains form a sequential pipeline of modular processors that handle the lifecycle of each URI, from precondition checks to final storage. Organized into five primary chains—pre-fetch (for validation like DNS resolution and robots.txt compliance), fetch (for content retrieval via protocols like HTTP), extractor (for link discovery and metadata parsing), write/index (for data serialization), and post-processing (for cleanup or additional analysis)—these chains apply ordered operations using pluggable modules such as FetchHTTP for HTTP requests or ExtractorHTML for parsing hyperlinks. Each processor updates the URI's state and can modify or reject it, enabling adaptability for tasks like handling JavaScript or authentication. This chain-based design allows developers to insert custom processors without altering core logic, promoting extensibility in large-scale crawls.[19][13][20] The URI generator and scope enforcer work together to define crawl boundaries and populate the Frontier with valid targets. The URI generator initializes the crawl by seeding URIs from configuration files or external sources, while the scope enforcer applies rules—such as regex patterns, SURT (Sorted URI Reversed Tokens) prefixes, or domain restrictions—to filter discovered links and ensure only in-scope URIs are queued. Pluggable scope implementations, like BroadScope for permissive crawling or SurtPrefixScope for targeted domains, integrate with decide rules to accept, reject, or defer URIs dynamically during processing. This duo maintains crawl focus, avoiding off-topic expansion and respecting resource limits set in the crawl order configuration.[13][20] The storage manager provides interfaces for persisting crawl data, including fetched content, metadata, and logs, through pluggable backends that support formats like ARC or WARC. It coordinates writing via processors in the write chain, such as ARCWriterProcessor, which handles file creation, compression, and rotation based on size or time limits, directing output to specified directories or remote systems. Configurable options allow multiple storage paths, hostname-based segmentation, and integration with external archives, ensuring durability and scalability for terabyte-scale collections. This component also manages server caches for shared data like IP resolutions or robots policies, reducing redundant operations across URIs.[13][19]Crawling Mechanisms
Heritrix executes crawl jobs through a structured operational workflow that begins with initialization from seed URIs, which serve as the starting points for discovery and are enqueued into the crawler's frontier for processing.[13] The frontier, a core component responsible for URI management, organizes these URIs into host-specific queues to ensure orderly processing and prevent overwhelming individual servers.[15] As the crawl progresses, URIs are dequeued and fetched primarily via HTTP protocols (supporting versions 1.0 through 3.0) using the FetchHTTP processor within a configurable fetch chain.[15] Upon successful retrieval, link extraction occurs through dedicated processors that parse the content to identify and enqueue new candidate URIs, while disposition decisions—such as storing in-scope content, discarding out-of-scope items, or scheduling revisits—are made via decide rules like AcceptDecideRule or MatchesRegexDecideRule.[13] This lifecycle repeats iteratively until the job reaches configured limits, such as byte quotas or time boundaries, emphasizing a breadth-first approach to systematic web exploration.[15] Politeness enforcement is integral to Heritrix's design to respect server resources and comply with web standards, achieved by imposing configurable delays between requests to the same host.[15] The frontier maintains separate queues per host, processing only one URI at a time per queue to avoid overload, with minimum delays (e.g., 3,000 ms via minDelayMs) and maximum delays (e.g., 30,000 ms via maxDelayMs) modulated by a delay factor (e.g., 5.0) that adapts based on response times.[15] These settings, defined in the job's crawler-beans.cxml configuration, can be overridden via scope sheets for domain-specific politeness levels, ensuring the crawler adheres to robots.txt directives and minimizes disruption to target sites.[13] By limiting concurrent threads (e.g., up to 50 via maxToeThreads), Heritrix balances crawl speed with ethical considerations, reducing the risk of IP blocking.[15] Error handling in Heritrix focuses on robustness against network transients and anomalies, employing automated retries for fetch failures.[15] Transient errors, such as timeouts or temporary server unavailability, trigger up to a maximum number of retries (e.g., 30 via maxRetries), with exponential backoff delays (e.g., starting at 900 seconds via retryDelaySeconds) to allow recovery without immediate reattempts.[15] Persistent failures result in logging to specialized files, including uri-errors.log for URI-specific issues and runtime-errors.log for exceptions, enabling post-crawl analysis of anomalies like HTTP 4xx/5xx codes.[16] Problematic URIs are quarantined by assigning failure status codes (e.g., -8 for retry exhaustion) and removing them from active queues, preventing repeated attempts on irretrievable resources while allowing manual intervention if needed.[13] Resumability ensures long-running crawls can withstand interruptions without data loss, supported by periodic checkpointing of the crawl state.[16] Checkpoints, saved to a designated directory at intervals (e.g., every 60 minutes via checkpointIntervalMinutes), capture the frontier's queue, processed URIs, and configuration, allowing jobs to pause via the web interface and resume from the latest checkpoint using command-line flags like --checkpoint latest.[16] In case of crashes, recovery files such as frontier.recovery.gz in the job's action directory facilitate state restoration, with options for partial recovery (e.g., via frontier.include.gz for selective URI inclusion) to handle large-scale operations efficiently.[16] This mechanism, rooted in Berkeley DB journaling for the frontier, supports seamless continuation even after hours-long downtimes, maintaining crawl integrity across sessions.[13]Output Formats
ARC Files
The ARC file format was developed by the Internet Archive on September 15, 1996, by Mike Burner and Brewster Kahle, as a straightforward container for storing web crawl records in a single file to simplify management of archived digital resources.[21] This format emerged to handle the growing volume of web content captured during early archiving efforts, aggregating multiple resources like HTML pages, images, and other HTTP responses into sequential blocks without requiring separate files for each item.[22] The structure of an ARC file begins with a version block identifying the file details and record fields, followed by one or more document records.[21] Each document record starts with a header line specifying the URI, IP address, archive date, content type (MIME), and length, succeeded by the raw HTTP response body, including headers and payload.[22] By default, the format applies no compression, though implementations like Heritrix often gzip individual records or the entire file for efficiency, resulting in extensions like .arc.gz.[23] This concatenated design supports linear reading but lacks an internal index, relying on external tools for navigation. In Heritrix, the crawler, ARC was the default output format in versions before 3.x, enabling sequential storage of crawled data directly to disk without embedded indexing or metadata beyond basic headers.[23] It facilitated efficient bulk archiving by consolidating resources, with Heritrix typically limiting files to around 100 MB of compressed data for practical storage and processing.[21] Despite its simplicity and widespread early adoption, the format's constraints—such as rigid support primarily for HTTP data and inability to capture complex relationships or non-web content—prompted its deprecation in favor of the more extensible WARC standard.[22]WARC Files
The WARC (Web ARChive) file format serves as Heritrix's primary output format, standardized under ISO 28500:2017 and first introduced in 2009 to address limitations in earlier archiving methods by providing a more flexible and extensible structure for preserving web content and related metadata.[24] This format extends beyond traditional web crawls to support broader digital preservation needs, enabling the storage of diverse data objects in a single, concatenated file.[24] Each WARC file consists of a sequence of self-contained records, where every record begins with a WARC header section—formatted as HTTP-like key-value pairs including mandatory fields such as WARC-Record-ID (a unique URI for the record), WARC-Type (specifying types like "response" for HTTP replies, "metadata" for descriptive information, or "revisit" for unchanged content), Content-Length (indicating payload size), and WARC-Date (timestamp of creation)—followed by a two-line separator and the binary payload.[24] The format accommodates eight record types to capture various aspects of a crawl, such as requests, responses, and conversion metadata, and supports external compression via gzip to reduce storage while maintaining accessibility.[24] In Heritrix, WARC has been the default output since version 3.0, replacing older formats and integrating seamlessly with the crawler's architecture to enable concurrent writing of records from multiple threads, as facilitated by fields like WARC-Concurrent-To for linking related records created simultaneously.[2][25] This implementation enhances metadata richness, including software details (e.g., "heritrix/3.x") and robots policy adherence, which supports advanced replay, analysis, and quality assurance in archival systems.[24] Key advantages of WARC in Heritrix include its ability to handle non-web data objects through extensible record types, deduplication mechanisms via unique record IDs and revisit records to avoid redundant storage of identical payloads, and support for partial file recovery since individual records are independently parseable even if the overall file is damaged.[24] These features make WARC particularly suited for large-scale, long-term web archiving, offering greater robustness compared to its predecessor ARC.[24]Tools and Usage
Command-Line Interfaces
Heritrix provides a primary command-line interface through theheritrix executable, located in the $HERITRIX_HOME/bin directory, which serves as the main tool for launching the crawler engine, managing crawl jobs, and performing basic operations. This executable allows users to start the Heritrix instance with options for authentication, port binding, and job directories, enabling terminal-based control suitable for automated environments. For instance, to launch Heritrix with the web UI enabled on the default port 8443 and credentials admin:admin, the command is $HERITRIX_HOME/bin/heritrix -a admin:admin. Additional options include -j /path/to/jobs to specify the jobs directory (default: $HERITRIX_HOME/jobs), -p 8443 to set the web UI port, and -r jobname to automatically run a specified job upon launch and exit on completion.[16]
Job creation and configuration in Heritrix are primarily handled through editable configuration files rather than direct CLI subcommands, allowing definition of seeds, crawl scopes, and settings via XML or properties formats for reproducibility and scripting. Jobs are organized in directories under the jobs path, where a new job can be initialized by copying a profile directory (such as the default default-profile) and customizing the crawler-beans.cxml file using Spring bean overrides. For seeds, users edit the longerOverrides bean to specify URLs; a simple crawl might define a single seed like <prop name="seeds.textSource.value">http://example.com</prop>, limiting scope to that host with default settings for bandwidth and politeness delays. In contrast, complex crawls involve multiple seeds, such as <prop name="seeds.textSource.value">http://www.myhost1.net http://www.myhost2.net http://www.myhost3.net/pictures</prop>, along with scope rules in the scope bean to include/exclude patterns (e.g., via URI regex filters) and properties like http.maxBytesPerResponse set to 10485760 for file size limits. These configurations can be prepared in scripts for batch job setup, with the heritrix executable then launching the engine to load and execute them.[26][16]
Monitoring and control of running jobs are facilitated through log files and the action directory mechanism, providing CLI-accessible ways to track progress and intervene without relying on the web UI. Runtime metrics, such as URIs processed, bytes downloaded, and bandwidth usage, are logged periodically to progress-statistics.log in the job directory, which can be tailed in a terminal (e.g., tail -f jobs/myjob/progress-statistics.log) for real-time observation; for example, entries might show "URIs: 15000, Bytes: 2.5GB, Avg. KB/s: 500" after an hour of crawling. For termination, users place an empty .abort file in the job's action subdirectory, which Heritrix polls every 30 seconds (configurable) to stop the crawl gracefully, moving the file to done upon processing. Other control files include .seeds for dynamically adding seed URLs (one per line) and .schedule for enqueuing specific URIs with directives like F+ http://[example.com](/page/Example.com) to force inclusion.[16]
The action directory supports scripting for batch operations, enabling automation of job management in terminal workflows. For example, a shell script can generate a .seeds file with multiple URLs from a list, copy it to the action directory to inject seeds mid-crawl, or use .schedule files in loops for targeted enqueuing during long-running jobs. This facility allows integration with system schedulers like cron for periodic crawls; a basic cron entry might execute ./start-crawl.sh at midnight, where the script launches heritrix -r dailyjob after preparing configurations, ensuring unattended operation for recurring archival tasks. While the web UI offers a graphical alternative for interactive monitoring, the CLI tools emphasize scriptable, headless control for production environments.[16]
Web-Based Interface
Heritrix provides a web-based user interface (WUI) accessible via HTTPS on port 8443 by default, allowing users to interactively configure, launch, and monitor crawl jobs from a browser.[7] The interface binds to localhost unless otherwise specified with the-b option and uses digest authentication for security, with default credentials of username "admin" and password "admin" that can be customized via command-line flags or a credentials file to support multiple users under a single administrative role.[16] Upon login, the dashboard offers an overview of active and pending jobs, including status indicators (e.g., "Running" or "Holding"), real-time statistics such as bytes downloaded and URI counts, and access to URI queues managed by the frontier component.[7][13]
Configuration within the WUI occurs through visual panels that enable editing of Spring beans defining crawl parameters, including scopes for URI inclusion/exclusion rules, politeness policies to control request rates per host, and processor chains for handling fetched content.[15] Users can add or modify seeds, set metadata like operator contact details, and preview changes in real-time before building and launching a job, with recommendations to pause ongoing crawls for non-atomic updates to avoid inconsistencies.[7] A scripting console complements these panels, allowing programmatic adjustments to running jobs via JavaScript-like commands for advanced customization.[15]
Monitoring features in the WUI include dynamic displays of crawl progress, such as rates of URI discovery and download volumes, error counts by type (e.g., connection failures), and resource usage metrics like memory and thread activity, often presented in tabular or graphical formats updated on page refresh.[13] An integrated URI inspector allows examination of queued or processed URIs, including their status, disposition, and referral paths, while checkpoint management options facilitate saving crawl states for recovery or resumption.[13] Administrative controls support oversight of multiple concurrent users through shared access under the admin role, with audit trails captured in job-specific logs such as crawl.log for actions, progress-statistics.log for metrics, and alerts.log for errors, enabling collaborative archiving efforts with traceable activity.[16] The interface times out after inactivity for security and relies on a self-signed certificate, requiring browser acceptance for initial access.[7] This visual management approach complements command-line operations for users preferring graphical interaction during web archiving tasks.[7]
Output Processing Tools
Heritrix generates output primarily in WARC (Web ARChive) format, with legacy support for ARC files. As of version 3.12.0 (October 2024), post-crawl processing relies on compatible external tools rather than extensively bundled utilities, focusing on integration with the broader web archiving ecosystem.[8][2] For WARC files, the warctools suite—developed by the Internet Archive—provides essential command-line utilities for inspection and manipulation. Key components include warcdump, which produces human-readable summaries of records (headers and payloads) for debugging, and supports creating metadata records via Python APIs. These tools autodetect WARC or ARC input, enabling versatile workflows for extracting and validating archived content.[27] Legacy tools from earlier versions (pre-3.x), such as ARCReader for ARC file metadata extraction in pseudo-CDX format and scripts like htmlextractor for link verification or hoppath.pl for path analysis, may be available in older distributions but are not emphasized in current documentation. For modern use, users are recommended to employ warctools or ecosystem tools like CDX indexers from the Wayback Machine for scalable processing, such as record integrity checks or indexing.[27] For example,warcdump input.warc.gz generates text output suitable for scripting, while combining with tools like grep allows filtering of HTTP responses. Advanced analytics, including full-text search or deduplication, require external software.[27]