BagIt
BagIt is a set of hierarchical file layout conventions designed to support the storage and transfer of arbitrary digital content, consisting of a "bag" that packages payload files alongside descriptive metadata tags without requiring knowledge of the content's semantics.[1] A bag is structured as a base directory containing a required "data" subdirectory for the payload—treated as opaque octet streams—and tag files that include a "bagit.txt" file declaring the BagIt version, one or more manifest files listing payload file paths with cryptographic checksums (such as MD5 or SHA-256) for integrity verification, and optionally a "bag-info.txt" file with basic metadata like contact information or creation date.[1] This format enables reliable validation of the bag's completeness and authenticity through checksum comparisons, supports direct access to individual files without unpacking the entire structure, and imposes no limits on file sizes or counts, making it suitable for large-scale digital preservation and data exchange.[1] Originally developed by the Library of Congress in collaboration with the California Digital Library around 2007 to facilitate the transfer of born-digital materials, BagIt evolved through community contributions and was formalized as version 1.0 in RFC 8493 by the Internet Engineering Task Force (IETF) in October 2018.[2] It has been widely adopted by cultural heritage institutions, research repositories, and data archives for packaging diverse content types, including digitized collections, research datasets, and software distributions, often integrated with tools like the open-source Bagger application for creation and validation.[2] Extensions such as BagIt Profiles allow for customizable requirements on metadata and structure to meet specific institutional or community needs, enhancing interoperability while maintaining the core specification's simplicity and extensibility.[3]Introduction
Definition and Purpose
BagIt is a set of hierarchical file layout conventions designed to support disk-based storage and network transfer of arbitrary digital content.[1] It provides a simple, standardized way to package digital files without imposing restrictions on the types or formats of the content included, making it applicable to a wide range of data such as documents, images, audio, video, and software artifacts.[1] The core purpose of BagIt is to enable the secure packaging of payload files— the primary digital content—along with integrity checks that prevent data loss or corruption during transfer or long-term archiving.[1] A "bag" serves as a self-describing container that bundles both the payload and descriptive metadata, known as tags, allowing recipients to verify the completeness and accuracy of the package without requiring specialized knowledge of the payload's semantics or structure.[1] This approach ensures that the bag can be handled by generic tools, promoting interoperability across systems and institutions. In the context of digital preservation, BagIt addresses key challenges by focusing on bit-level integrity, where the exact reproduction of digital objects is paramount to maintaining their authenticity over time.[1] By incorporating mechanisms for checksum-based validation, it allows for straightforward detection of alterations or errors, reducing the risk of silent data corruption in archival environments without necessitating complex proprietary software.[1]Key Components
A BagIt package is structured around four primary components that facilitate the secure packaging, transfer, and verification of digital content: the payload, tag files, manifest files, and the optional fetch file. These elements form a self-describing hierarchy that prioritizes data integrity without imposing restrictions on the content type or format.[1] The payload represents the core content of the bag, consisting of all files and subdirectories within thedata/ directory at the root of the package. This component holds the arbitrary digital objects—such as documents, images, videos, or datasets—that are the primary focus for preservation, archiving, or transmission. By isolating the payload in this dedicated directory, BagIt ensures that the actual data remains distinct from descriptive or verification metadata, simplifying management and validation processes.[1]
Tag files serve as metadata descriptors located directly in the bag's root directory, providing contextual information about the package as a whole. The mandatory bagit.txt file declares the BagIt version (e.g., 1.0) and the character encoding (typically UTF-8) employed throughout the bag, establishing the foundational rules for interpretation. An optional bag-info.txt file supplements this with human-readable details, such as the creation date, contact information, or payload size, enhancing usability without affecting the bag's validity. These tags enable quick assessment of the package's provenance and properties.[1]
Manifest files are checksum-based inventories essential for integrity assurance, also residing at the root level and named according to the hashing algorithm used, such as manifest-sha256.txt. Each manifest lists every file in the payload (relative to the data/ directory) paired with its computed checksum, allowing recipients to verify that no files have been altered, lost, or corrupted during transfer or storage. Optional tag manifests (e.g., tagmanifest-sha256.txt) extend this verification to the tag files themselves. The selection of checksum algorithms, like SHA-256 for robust collision resistance, supports reliable detection of modifications.[1]
The optional fetch.txt file addresses scenarios with large or distributed datasets by specifying how to acquire payload files not included locally in the data/ directory. It contains entries detailing remote URLs, expected file sizes, and target paths within the payload, enabling the construction of "partial bags" that defer downloading until needed, which is particularly useful for bandwidth-constrained environments or collaborative data sharing. When present, this file ensures the bag remains complete upon fetching, maintaining overall integrity through the associated manifests.[1]
Specification
Bag Structure
A BagIt bag consists of a hierarchical directory structure designed for the storage and transfer of digital content. The root directory, also referred to as the base directory, must contain the mandatory "data" subdirectory along with tag files, such as the bagit.txt file that declares the BagIt version (detailed further in the Metadata and Manifest Files section). Optional tag directories may also reside at the root level to organize additional tag files, but no other files, payload content, or subdirectories are permitted directly under the root. The "data" subdirectory exclusively houses the payload files, which represent the primary digital content being packaged. These files retain their original relative paths from the source collection and may be organized with arbitrary nesting of subdirectories within "data/" to accommodate complex hierarchies, such as those found in software distributions or multimedia archives. Payload files are treated as opaque octet streams, with no inherent restrictions on their types, sizes, or internal structures beyond the overall bag constraints. All file paths referenced within the bag utilize forward slashes (/) as the directory separator to ensure cross-platform compatibility, irrespective of the underlying operating system. The specification assumes the use of regular files and directories to maintain a portable and verifiable structure. Tag files and path names adhere to UTF-8 encoding without byte order marks (BOMs), prohibiting non-standard encodings that could introduce compatibility issues. While BagIt is fundamentally a filesystem-based format, it supports serialization into monolithic archives, such as ZIP files, for efficient network transfer or storage. In a serialized bag, the archive must preserve the exact directory layout upon extraction, including the root-level tags and "data" subdirectory, to ensure compliance with the core structure.[1]Metadata and Manifest Files
The BagIt specification requires a file namedbagit.txt at the root of every bag to declare the BagIt version and the character encoding used for all tag files. This file must consist of exactly two lines: the first specifying the version in the format BagIt-Version: <major>.<minor>, such as BagIt-Version: 1.0, and the second indicating the encoding in the format Tag-File-Character-Encoding: <encoding-name>, with UTF-8 strongly recommended and no byte order mark (BOM) permitted.[1]
An optional bag-info.txt file provides human-readable metadata about the bag in a simple key-value format, where each line consists of a label followed by a colon, optional whitespace, and a value. Common fields include Source-Organization (the entity creating the bag), Bagging-Date (in YYYY-MM-DD format), and Payload-Oxum (an octet-stream sum in the format <total-octets>.<file-count>, representing the total size and number of payload files, e.g., 279164409832.1198). This file preserves the order of entries and uses the encoding declared in bagit.txt.[1]
Manifest files are required and provide integrity information for the payload by listing each payload file's relative path alongside its checksum, ensuring the content remains unaltered during transfer or storage. At least one such file must exist, named manifest-<algorithm>.txt (e.g., manifest-md5.txt), with each line formatted as <checksum> <relative-path>, where the checksum is a lowercase hexadecimal digest (e.g., d41d8cd98f00b204e9800998ecf8427e data/file.txt). Supported algorithms are MD5, SHA-1, SHA-256, and SHA-512, with SHA-256 and SHA-512 required for validating software; the file must include exactly one entry per payload file. Multiple manifest files using different algorithms may coexist. Filepaths in manifests must percent-encode Line Feed (LF), Carriage Return (CR), Carriage Return Line Feed (CRLF), or percent sign (%) characters following RFC 3986.[1]
Optional tagmanifest files, named tagmanifest-<algorithm>.txt, extend integrity checks to the tag files themselves (excluding payload files) using the same format and supported algorithms as payload manifests. For example, a line might read 3b5f06b0b7d3f5a3e0a4d5f6e7b8c9d0 bag-info.txt, verifying files like bagit.txt and other manifests. These files are generated after the primary tags to include their own checksums where applicable.[1]
An optional fetch.txt file allows for incomplete bags by specifying files to be fetched from remote URLs. Each line follows the format url length filepath, where url is the location to fetch from, length is the expected octet count (or "-" if unspecified), and filepath is the relative path within the data/ directory. This file uses the encoding declared in bagit.txt and enables bags to reference large or external payloads without including them initially.[1]
These metadata and manifest files collectively enable basic validation of bag integrity, as detailed in the validation process.[1]
Validation Process
The validation process for a BagIt package consists of two main stages: confirming completeness and verifying validity, ensuring the package's structure and content integrity.[1] Completeness validation begins by checking the presence of essential components in the root directory: thebagit.txt file declaring the BagIt version, at least one payload manifest file (e.g., manifest-sha256.txt), and the data/ subdirectory containing the payload files. The root directory must not include any extraneous files or subdirectories beyond these required elements, optional tag manifests, bag-info.txt, or fetch.txt. Furthermore, every file referenced in the payload manifests must physically exist within the data/ directory, and every payload file must appear in at least one payload manifest to avoid omissions. If a fetch.txt file is present, the bag remains incomplete until the listed remote files are retrieved and added to the data/ directory.[1]
Validity validation proceeds only after completeness is established and involves recalculating checksums for all files listed in the payload manifests using the specified algorithm (e.g., SHA-256) and comparing them against the values in the manifests. If tag manifests are included, their checksums must similarly be verified against the corresponding tag files. This step confirms that no alterations or corruptions have occurred in the payload or metadata. For bags with fetch.txt, optional remote validation requires downloading files from the provided URLs, checking their byte lengths if specified, integrating them into the data/ directory, and then performing the checksum comparison as part of the payload manifests.[1]
Bags failing completeness checks are classified as incomplete, often due to absent required elements, missing files, or unresolved fetches, while complete bags with checksum discrepancies are deemed invalid. Validation tools are expected to provide detailed reporting of specific failures, such as which files or checksums failed, to facilitate remediation.[1]
Best practices recommend conducting validation immediately following any transfer of the bag to identify transmission errors or corruption early in the workflow. For serial bags—compressed or archived formats like ZIP or TAR—the package must first be deserialized into its full directory structure before applying the completeness and validity checks.[1][4]
History
Development Origins
BagIt originated in 2007 at the Library of Congress (LOC), where it was developed to facilitate the reliable transfer of digital collections between diverse systems and organizations.[5] This effort was driven by the need to handle growing volumes of digital content, from gigabytes to petabytes, in a way that ensured integrity during handoffs without requiring complex software installations.[2] The specification was co-created with the California Digital Library (CDL), reflecting collaborative needs for standardized packaging in digital preservation workflows.[6] The core philosophy behind BagIt draws from the simple "bag it and tag it" approach, where digital content is bundled into a "bag" for transport and accompanied by "tags"—machine-readable metadata—for description, verification, and automated processing.[6] This minimalist design emphasized ease of use across institutions, allowing content creators and recipients to validate packages using standard file system tools and checksums.[7] Key early motivations stemmed from LOC's digitization initiatives, particularly the National Digital Newspaper Program (NDNP), which required error-free delivery of large newspaper collections from external partners.[2] Other projects at LOC faced similar challenges in managing transfers via network or physical media, often involving terabytes of data from web archiving and cultural heritage sources.[5] These drivers highlighted the limitations of unstructured methods, pushing for a convention that separated payload files from metadata while enabling quick integrity checks.[7] By 2008, an initial informal specification had emerged, formalizing the structure from ad-hoc scripts used in NDNP transfers and evolving into a de facto standard for content packaging.[5] This early version outlined basic directory layouts, manifest files, and validation steps, laying the groundwork for broader adoption in preservation practices.[6]Version History
The development of the BagIt specification began with a series of informal drafts in 2008, emerging from collaborations between the Library of Congress and the California Digital Library to facilitate reliable digital content transfer.[8] The initial draft-00 was released on March 24, 2008, introducing core concepts such as hierarchical file packaging with manifests and checksums for integrity verification.[8] Subsequent drafts refined these elements; for instance, draft-01 on May 30, 2008, simplified tag manifests, while draft-02 on July 11, 2008, standardized path separators and introduced the Payload-Oxum metadata field to describe payload content volume.[8] Further iterations, including draft-05 in April 2011, added support for tag directories and clarified validity rules, culminating in version 0.97 released on April 2, 2012, which introduced optional multiple manifests for different checksum algorithms and optional UTF-8 encoding declaration in tag files.[8] Version 1.0 of the BagIt specification was formalized as Internet Engineering Task Force (IETF) Request for Comments (RFC) 8493 in October 2018, marking its transition from draft status to a stable standard.[1] This release imposed stricter syntax requirements, mandating UTF-8 encoding for all tag files to ensure consistent character handling across implementations, whereas previous versions treated it as optional.[1] It also clarified serialization rules for bags, requiring that all payload files be listed in every manifest to prevent partial listings, and recommended SHA-512 as the default checksum algorithm while retaining support for MD5 and SHA-1 for legacy compatibility.[1] These changes aimed to enhance interoperability and robustness without breaking existing bags. Following version 1.0, the BagIt Profiles specification was introduced in 2015 (version 1.0) as an extension mechanism to define custom rules for bags, such as required metadata fields or file organization, without modifying the core BagIt standard.[3] This allowed communities to enforce domain-specific constraints while maintaining BagIt conformance. The profiles evolved through subsequent releases, with version 1.2.0 introducing and requiring the BagIt-Profile-Version field, and the latest version 1.4.0 released in November 2023 incorporating refinements for better validation and serialization of profile documents.[3] The specification emphasizes backward compatibility, requiring implementations to support both version 0.97 and 1.0 bags; for example, version 1.0 parsers must tolerate multiple linear whitespace characters around colons in bag-info.txt headers, a leniency present in earlier versions.[1] Upgrades from pre-1.0 bags can occur in place by adding new manifests with stronger checksums, ensuring minimal disruption to existing workflows.[1]Implementations and Tools
Software Libraries
The BagIt Java library, known as bagit-java, is the original implementation developed by the Library of Congress to support the creation, manipulation, and validation of BagIt packages.[9] Originating from the library's early adoption of the BagIt specification around 2007, it has been maintained and updated to align with evolving standards, with version 5.x representing a complete rewrite using modern Java practices for improved internationalization and linting capabilities (last updated June 2018).[10][9] It supports BagIt versions from 0.93 to 0.97, providing APIs for generating manifests, tag files, and checksums, as well as validating package integrity without built-in serialization to archive formats.[9] This library forms the basis for extensions like the BagIt Library (BIL), which integrates into broader workflows for automated bag handling.[11] The bagit-python library, also developed by the Library of Congress, offers a Python module and command-line interface for handling BagIt operations, available via PyPI for easy integration into scripts and applications (latest version 1.9.0 as of October 2024).[12] It enables bag creation with custom metadata, parallel checksum computation for efficiency, and comprehensive validation against the IETF BagIt specification, making it suitable for both programmatic and utility-based use.[13] Key APIs includemake_bag() for packaging directories with tags like contact information, and is_valid() for integrity checks, ensuring cross-platform compatibility on systems with Python installed.[13] While its exact inception ties to the Library's post-2010 digital preservation efforts, it remains actively maintained to support serialization and fixity generation in preservation pipelines.[14]
Other implementations extend BagIt support to additional languages, including bagit.rb for Ruby, which provides library and command-line tools for bag creation, manifest generation, remote file fetching via fetch.txt, and validation per BagIt spec v0.97.[15] Similarly, bagit-js for Node.js facilitates creation, modification, and validation of BagIt containers by wrapping core functionality, though it relies on underlying Python dependencies for full operation, promoting use in JavaScript environments.[16] For scientific data applications, the research-object/bagit-ro extension defines a profile that serializes Research Objects as BagIt archives, embedding rich metadata like RO-Crate JSON files within the data directory while leveraging BagIt's checksums and structure for integrity and transfer.[17] These libraries emphasize standardized APIs for bagging files, generating manifests and tags, and validating packages, ensuring cross-platform reliability in diverse development workflows.[15][16][17]
Graphical and Command-Line Tools
Bagger is a Java-based graphical user interface (GUI) tool developed by the Library of Congress for creating and validating BagIt packages, enabling non-programmers to package digital content through an intuitive interface (last updated April 2018).[18] It supports drag-and-drop functionality for adding files to the bag's data directory, interactive entry of metadata such as bag-info.txt fields (e.g., contact information, creation date), and generation of required manifest files with checksums for integrity verification.[19] Users can validate bags by checking checksums and structure compliance, with visual feedback on errors, and export completed bags to compressed formats like ZIP or TAR for easy transfer.[20] As an open-source application licensed under the Apache License 2.0, Bagger is freely available and runs on multiple platforms including Windows, macOS, and Linux.[18] For command-line operations, bagit.py serves as a Python-based utility from the Library of Congress, providing scripting capabilities for automated bagging and validation processes suitable for batch handling of multiple files or directories.[13] The tool installs via pip and offers commands likebagit.py /path/to/input /path/to/output to create a bag with SHA-256 checksum manifests, or bagit.py --validate /path/to/bag to verify completeness and integrity without a GUI.[12] It supports fetch.txt generation for remote file inclusion, profile adherence checking against BagIt extensions (e.g., BagIt Profiles), and detailed reporting of validation results to stdout or files, facilitating integration into workflows for large-scale digital preservation.[13] Released under the public domain, bagit.py is lightweight and cross-platform, relying on underlying Python libraries for core functionality.[12]
Archivematica, an open-source digital preservation platform, includes built-in BagIt support through integration with bagit-python, allowing users to ingest, process, and validate BagIt packages via its web-based interface without separate tool installation.[21] This extension handles unzipped or zipped bags during transfer, automatically verifying manifests, generating PREMIS metadata for preservation events, and ensuring compliance with BagIt specifications during archival information package (AIP) creation.[22] Features include reporting on validation outcomes and fetch handling for incomplete bags, with all components licensed under AGPLv3 for free community use.[21]
These tools collectively emphasize accessibility for end-users, with capabilities for profile checking (e.g., against BagIt Profiles for serializations like ZIP), fetch implementation to acquire external payloads, and comprehensive reporting on bag status, all while remaining free and open-source to promote widespread adoption in digital archiving.[13][19]