QPDF

QPDF is a free and open-source command-line program and C++ library designed for structural, content-preserving transformations on PDF files.^[1] It enables users to inspect and manipulate the internal structure of PDF documents, supporting operations such as linearization for web optimization, encryption and decryption with passwords, merging or splitting files, and handling object streams, all while preserving the original content's visual integrity.^[2] Unlike tools focused on rendering or content extraction, QPDF emphasizes low-level PDF syntax handling without high-level abstractions for page content.^[1] Key features of QPDF include the ability to copy objects between PDF files, create new PDFs from user-supplied content, and perform transformations like converting between compressed and uncompressed object streams.^[2] It is lightweight, with few external dependencies, making it suitable for developers integrating PDF manipulation into applications or scripts.^[1] The tool does not support viewing PDFs, extracting text, or converting to other formats, positioning it as a specialized utility for structural analysis and modification.^[3] Developed primarily by Jay Berkenbilt since 2005, QPDF has evolved into a mature project hosted on GitHub, with releases distributed via SourceForge.^[3] Licensed under the Apache License, Version 2.0, it is included in most major Linux distributions and various software repositories, reflecting its reliability and adoption in open-source ecosystems.^[1] Comprehensive documentation, including manuals and examples, is available to guide both command-line usage and library integration.^[2]

Overview

Description

QPDF is a free, open-source command-line program and C++ library designed for performing structural, content-preserving transformations on PDF files.^[1]^[2] It enables low-level manipulation of PDF syntax, objects, and streams, such as normalizing structures or copying elements between files, while ensuring the visual appearance and content integrity remain unchanged.^[1]^[3] The tool fully supports PDF versions up to 1.7 and provides partial compatibility with PDF 2.0, particularly for features like encryption schemes defined in ISO 32000-2.^[4]^[5] Unlike PDF viewers or editors, QPDF does not render documents, extract text, or convert files to other formats; instead, it concentrates on backend operations like object inspection, cross-reference table management, and syntax validation without interpreting higher-level content.^[1]^[2] This makes it suitable for tasks requiring precise control over PDF internals, such as optimization for web delivery or integration into automated workflows, where users supply the content directly.^[3] QPDF is dual-licensed under the Apache License 2.0, which has been the primary license since version 7.0, and the Artistic License 2.0 for compatibility with earlier releases.^[4]^[3] It is multiplatform, supporting Linux, Windows, and macOS, and is available as native packages in most major Linux distributions and other software repositories.^[1]^[6]

Development

Jay Berkenbilt, a software engineer, is the lead developer and primary maintainer of QPDF. He created the project in 2001 as a personal tool for PDF manipulation needs while employed at Apex CoVantage, continuing initial modifications until 2005.^[7] Upon resigning from Apex CoVantage in 2005, Berkenbilt released QPDF as open-source software, transitioning it from private use to public availability. The project has been hosted on SourceForge since 2008, with an active development mirror on GitHub established in 2014 to facilitate version control and collaboration.^[1]^[3] Maintenance relies on primary maintainers Jay Berkenbilt and Manfred Holger (since 2022), supplemented by community contributions submitted via GitHub issues and pull requests, following guidelines outlined in the project's contributing documentation. QPDF has been packaged and included in major Linux distributions, such as Debian and Ubuntu, since 2008, enabling widespread adoption without formal organizational backing.^[8] Each release includes a comprehensive manual for users and developers, while support is provided through the qpdf-announce mailing list on SourceForge for updates and announcements. The latest release as of November 2025 is version 12.3.0 (October 22, 2025). Although lacking a dedicated organization, QPDF integrates into external tools, notably the R package qpdf, which leverages its C++ library for PDF splitting, combining, and compression tasks.^[9]^[10]

Features

Command-line transformations

The qpdf command-line tool enables a range of structural transformations on PDF files while preserving their visual content and logical structure. These operations include reorganizing objects for optimized delivery, securing or unsecured files, modifying page layouts, and reducing file size through compression techniques. All transformations are applied in memory, ensuring that the output PDF remains semantically equivalent to the input unless explicitly altered.^[5] Linearization optimizes PDFs for web viewing by rearranging objects to enable progressive rendering, where the first page loads quickly even on slow connections. The --linearize option performs this in two passes: the first identifies key offsets, and the second writes the file with a linearization dictionary and cross-reference streams adjusted for streaming. For example, the command qpdf input.pdf --linearize output.pdf generates a web-optimized version. This feature is particularly useful for large documents intended for online distribution, as it reduces initial load times without altering content.^[11] Encryption and decryption provide security controls using owner and user passwords, with support for key lengths of 40-bit, 128-bit, or 256-bit. For 256-bit encryption, qpdf employs AES; lower lengths use RC4, though the latter requires the --allow-weak-crypto flag due to its vulnerabilities. The --encrypt option sets permissions, such as restricting printing or editing, via the structure qpdf --encrypt key-length user-password owner-password input.pdf output.pdf; for instance, qpdf input.pdf --encrypt 256 userpass "" -- output.pdf applies 256-bit AES encryption with a user password. Decryption uses --decrypt, as in qpdf input.pdf --decrypt output.pdf, provided the correct password is supplied via --password. Unicode passwords are handled by transcoding to PDF Doc encoding for compatibility.^[12] Page operations allow basic manipulations like rotation, splitting, and merging without affecting the content streams. Rotation is achieved with --rotate=[+|-]angle[:page-range], where angles are 90, 180, or 270 degrees; for example, qpdf input.pdf --rotate=+90:1-5 output.pdf rotates the first five pages 90 degrees clockwise. Splitting uses --split-pages[=n], creating separate files for every n pages, such as qpdf input.pdf --split-pages=2 output-%d.pdf to produce output-1.pdf and output-2.pdf. Merging and selective page extraction employ --pages, starting from an empty file if needed: qpdf --empty --pages file1.pdf 1-3 file2.pdf z:5-7 -- output.pdf combines pages 1-3 from file1.pdf and pages 7, 6, 5 (reversed order of pages 5-7) from file2.pdf. Direct scaling or n-up printing (arranging multiple pages per sheet) is not natively supported in the CLI and requires external tools or library integration.^[13]^[14] Compression and optimization focus on reducing file size by filtering streams and consolidating objects, primarily using the deflate (FlateDecode) algorithm. The --compress-streams=y flag ensures all eligible streams are compressed, as in qpdf input.pdf --compress-streams=y output.pdf. Object deduplication occurs via --object-streams=generate, which packs indirect objects into compressed streams for efficiency: qpdf input.pdf --object-streams=generate output.pdf. Removal of unused objects is facilitated by --check, which validates the structure and generates a normalized output excluding invalid or unreferenced elements: qpdf input.pdf --check output.pdf. Additional filters like LZWDecode are applied through --decode-level=generalized before recompression, and --recompress-flate optimizes existing zlib streams. Image optimization with --optimize-images converts non-JPEG images to DCT (JPEG) if smaller, using a default quality of 75, adjustable via --jpeg-quality. These options support other standard PDF filters but prioritize content preservation over aggressive alteration.^[15]^[16] PDF creation via the CLI generates minimal files from existing content or empty templates, but requires user-supplied graphics operators for custom streams, as qpdf does not generate visual content itself. Starting with --empty produces a blank PDF: qpdf --empty output.pdf, onto which pages can be added using --pages. For instance, qpdf --empty --pages input.pdf 1 -- output.pdf creates a single-page file from the first page of input.pdf. This approach is suited for assembling basic documents but relies on input files for actual content streams.^[17] Despite these capabilities, qpdf's CLI transformations have limitations: it preserves PDF content streams faithfully but may inadvertently alter metadata or annotations if not explicitly preserved via options like --preserve-unreferenced-objects. There is no support for interactive forms, JavaScript execution, or advanced features like digital signatures, as the tool focuses on structural changes rather than semantic interpretation.^[18]

Library capabilities

The libqpdf C++ library provides the core programmatic interface for QPDF, enabling developers to embed PDF manipulation capabilities directly into applications. The primary class, QPDF, handles high-level document operations, including loading and processing PDF files via methods like processFile, which supports input from files, buffers, or streams while automatically resolving indirect objects through an internal cache mechanism.^[19] Low-level access to PDF elements is facilitated by the QPDFObjectHandle class, which represents objects such as dictionaries, arrays, and streams, allowing for inspection, modification, and creation of indirect objects using factory methods like newIndirect or parse.^[19] For stream handling, the Pipeline abstract base class and its subclasses (e.g., Pl_Buffer, Pl_File) enable parsing and writing of content streams, supporting filtered and raw data processing during read and write operations.^[19] Key API operations include reading and writing PDFs through QPDF in conjunction with QPDFWriter, which outputs files in standard PDF format or linearized for web optimization, and supports QDF mode for JSON-like serialization of the document structure, introduced in version 11.0.0 with enhancements for external stream handling.^[20] Advanced features encompass decryption and re-encryption of protected files using user or owner passwords, manipulation of cross-reference tables via QPDF's private table for tracking object offsets and generations, and creation of new indirect objects with reserved numbers to avoid conflicts during merges.^[19] The library offers partial support for PDF 2.0 syntax, including recognition of explicitly UTF-8 encoded strings as specified in the standard.^[4] Integration with libqpdf involves including headers from the include/qpdf directory for utility classes, though the full library requires compilation and linking; since version 11, building uses CMake with the qpdf::libqpdf target for simplified dependency management across platforms.^[3] C-language bindings, exposed via qpdf-c.h, have been available since version 10, allowing direct use in C programs or languages that interface with shared libraries.^[21] The library is thread-safe for concurrent instances but not for shared objects, making it suitable as a backend for R's qpdf package for document transformations in statistical workflows, or custom processors that require in-process batch operations without invoking external shell commands.^[20] Performance-wise, libqpdf is designed for efficiency with large files, capable of processing PDFs exceeding available system memory by reading and writing without fully loading the document into RAM, particularly when using JSON output to externalize streams.^[4] It employs stream-based I/O, including options for memory-mapped file access in certain configurations, to minimize overhead during object resolution and content stream operations on substantial documents.^[19]

Usage

Command-line examples

The qpdf command-line tool provides a versatile interface for performing content-preserving transformations on PDF files. Below are practical examples of common operations, demonstrating syntax for basic tasks. These examples assume the tool is installed and accessible via the command line; input and output files should be specified with full paths as needed.^[5] For basic conversion and normalizing PDF syntax while preserving original stream data, use the following command, which reads the input file, applies minimal transformations to ensure compliance with PDF standards, and writes the result to an output file without altering stream filters:

qpdf --stream-data=preserve input.pdf output.pdf
qpdf --stream-data=preserve input.pdf output.pdf

This is useful for repairing or standardizing PDFs without recompressing or decoding streams, maintaining the file's original structure where possible.^[22]^[23] To merge multiple PDF files into a single output, create an empty base PDF and append pages from source files:

qpdf --empty --pages file1.pdf file2.pdf -- output.pdf
qpdf --empty --pages file1.pdf file2.pdf -- output.pdf

This command generates output.pdf containing all pages from file1.pdf followed by those from file2.pdf, inheriting metadata from the empty base; additional files can be listed similarly for multi-file merges.^[24] Splitting a PDF into individual page files is achieved with the --split-pages option, which outputs numbered files by default:

qpdf --split-pages input.pdf
qpdf --split-pages input.pdf

This produces files named input-001.pdf, input-002.pdf, and so on, each containing one page from input.pdf; the %d placeholder in a custom output pattern like --split-pages=input-%03d.pdf allows formatted numbering.^[24]^[23] Decrypting a password-protected PDF requires specifying the password and the --decrypt flag:

qpdf --password=pass --decrypt input.pdf output.pdf
qpdf --password=pass --decrypt input.pdf output.pdf

Here, pass is replaced with the actual password; the command removes encryption from input.pdf and saves the unencrypted version as output.pdf, preserving all other content. For files without passwords, --decrypt alone suffices.^[25]^[23] Inspecting the internal structure, such as the trailer dictionary, aids in debugging or analysis:

qpdf --show-object=trailer input.pdf
qpdf --show-object=trailer input.pdf

This dumps the trailer object to stdout, revealing metadata like the root catalog, size, and cross-reference information without modifying the file. Other objects can be shown with --show-object=obj,gen.^[26]^[23] For advanced optimization like linearizing a PDF for web viewing while enabling stream compression, combine flags as follows:

qpdf --linearize --compress-streams=y input.pdf output.pdf
qpdf --linearize --compress-streams=y input.pdf output.pdf

Linearization reorganizes the file for byte-serving, allowing partial downloads, and compression reduces size by applying Flate filters to uncompressed streams.^[16]^[23] Error handling and validation can be performed using flags like --check, which verifies the PDF's structure and encryption:

qpdf --check input.pdf
qpdf --check input.pdf

This exits with status 0 if valid or non-zero otherwise, reporting issues; for piping output to stdout instead of a file, add --stdout to any transformation command, such as qpdf --stream-data=preserve input.pdf --stdout > output.pdf. Warnings can be suppressed with --no-warn if needed during processing.^[27]^[23]

Library integration

To integrate the QPDF library into C++ applications, begin by including the necessary header file and linking against the library during compilation. The primary header for core functionality is <qpdf/QPDF.hh>, which provides access to the QPDF class for loading and manipulating PDF files. Headers are installed in the include/qpdf directory, and applications should include them directly without modifying include paths.^[20] A basic setup involves creating a QPDF object and processing an input file, which parses the PDF structure while preserving content. For example:

cpp
#include <qpdf/QPDF.hh>
#include <iostream>

int main() {
    QPDF doc;
    try {
        doc.processFile("input.pdf");
        std::cout << "PDF loaded successfully." << std::endl;
    } catch (QPDFException const& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    return 0;
}
#include <qpdf/QPDF.hh>
#include <iostream>

int main() {
    QPDF doc;
    try {
        doc.processFile("input.pdf");
        std::cout << "PDF loaded successfully." << std::endl;
    } catch (QPDFException const& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    return 0;
}

This uses processFile to load the file, optionally providing a password for encrypted PDFs. The method throws a QPDFException if the file is invalid or inaccessible.^[20] Common operations include writing the modified PDF to output and applying encryption. Output is handled via the QPDFWriter class, included from <qpdf/QPDFWriter.hh>. To write a flattened version (which resolves indirect objects and annotations), instantiate QPDFWriter with the QPDF object and specify options:

cpp
#include <qpdf/QPDFWriter.hh>

QPDFWriter writer(doc, "output.pdf");
writer.setFlattenAnnotations(true);  // Flatten annotations if needed
writer.write();
#include <qpdf/QPDFWriter.hh>

QPDFWriter writer(doc, "output.pdf");
writer.setFlattenAnnotations(true);  // Flatten annotations if needed
writer.write();

For encryption, configure parameters before writing, such as using 256-bit AES encryption with user and owner passwords:

cpp
QPDFWriter writer(doc, "encrypted.pdf");
writer.setR6EncryptionParameters(256, "userpass", "ownerpass", 0);  // 0 for default permissions
writer.write();
QPDFWriter writer(doc, "encrypted.pdf");
writer.setR6EncryptionParameters(256, "userpass", "ownerpass", 0);  // 0 for default permissions
writer.write();

This applies Revision 6 encryption, treating user and owner passwords equivalently for access. Permissions can restrict printing, copying, or editing as needed.^[19] Object manipulation leverages QPDFObjectHandle, a smart pointer-based class for accessing and modifying PDF elements without manual memory management. To access pages, retrieve the root dictionary and its /Pages key:

cpp
QPDFObjectHandle root = doc.getRoot();
QPDFObjectHandle pages = root.getKey("/Pages");
int num_pages = pages.getKey("/Count").getIntValue();
QPDFObjectHandle root = doc.getRoot();
QPDFObjectHandle pages = root.getKey("/Pages");
int num_pages = pages.getKey("/Count").getIntValue();

Adding a page appends it to the document using addPage, typically with a QPDFObjectHandle from another PDF or a newly created page object:

cpp
QPDF other_doc;
other_doc.processFile("other.pdf");
QPDFObjectHandle new_page = other_doc.getObjectByNumber(4, 0);  // Example page object
doc.addPage(new_page, false);  // Append (false for end)
QPDF other_doc;
other_doc.processFile("other.pdf");
QPDFObjectHandle new_page = other_doc.getObjectByNumber(4, 0);  // Example page object
doc.addPage(new_page, false);  // Append (false for end)

This copies the page while resolving dependencies. Smart pointers in QPDFObjectHandle ensure automatic reference counting and deallocation.^[20] Error handling relies on exceptions from the QPDFException hierarchy, thrown for parsing errors, invalid objects, or encryption issues. Wrap operations in try-catch blocks and use doc.anyWarnings() to check for non-fatal issues post-processing. For validation, processFile inherently checks file integrity; invalid inputs trigger exceptions. Memory management is handled via RAII with smart pointers, avoiding explicit deletes.^[20] For build integration, use CMake to find and link the library, assuming QPDF is installed via its build system or a package manager:

cmake
cmake_minimum_required(VERSION 3.16)
project(MyApp LANGUAGES CXX)
find_package(qpdf REQUIRED)
add_executable(myapp main.cc)
target_link_libraries(myapp qpdf::libqpdf)
cmake_minimum_required(VERSION 3.16)
project(MyApp LANGUAGES CXX)
find_package(qpdf REQUIRED)
add_executable(myapp main.cc)
target_link_libraries(myapp qpdf::libqpdf)

This locates headers and libraries automatically. For cross-language use, bindings like pikepdf provide Python FFI access to the C API, while R can interface via similar wrappers.^[28]^[20]^[29] Best practices include always validating inputs with processFile before manipulation to catch malformed PDFs early. When generating output, prefer QPDFXRefStream via QPDFWriter::setUseXRefStreams(true) for modern, compressed cross-reference streams compatible with PDF 1.5+. Avoid direct editing of compressed streams, as it risks corrupting content; instead, use high-level methods like addPage or replaceObject to maintain structural integrity.^[20]

History

Origins and early development

QPDF was originally created in 2001 by Jay Berkenbilt during his employment at Apex CoVantage, initially to support structural analysis of PDF files for research and internal tools.^[7] The software addressed a need for low-level PDF handling in environments where rendering was not required, such as decryption and object inspection.^[7] From 2001 to 2005, Berkenbilt made periodic modifications to QPDF for personal and work-related purposes, enhancing its utility for non-rendering tasks like examining PDF internals without proprietary dependencies.^[7] A key motivation was the scarcity of free tools for such manipulations, particularly to process encrypted academic PDFs that were otherwise inaccessible without commercial software.^[7] In 2005, after leaving Apex CoVantage—with the company's permission to retain ownership—Berkenbilt continued developing QPDF.^[7] The first public release occurred on April 29, 2008, as version 2.0 on SourceForge under the Artistic License 2.0, with basic command-line interface and library support geared toward Linux users; Windows compatibility was absent until subsequent versions.^[4]

Major releases and evolution

QPDF's development has seen several major releases since its initial public availability, each introducing significant enhancements to functionality, compatibility, and build processes. Version 2.0, released in April 2008, marked the first widely noted public release, adding support for Windows platforms and the ability to build as a dynamic link library (DLL), broadening its accessibility beyond Unix-like systems.^[4] Version 7.0, released on September 15, 2017, shifted the project's licensing to the Apache 2.0 license from the previous Artistic License 2.0, facilitating greater adoption in open-source ecosystems while improving encryption handling to better conform to PDF 1.7 specifications, including enhanced support for security parameters and key derivation.^[4]^[3] Version 10.0, released in 2020, introduced performance enhancements, support for external encryption libraries like OpenSSL, and expansions to the C API for better integration. It also provided initial handling for aspects of PDF 2.0.^[4] Version 11.0, released on September 10, 2022, represented a pivotal evolution by introducing JSON output version 2 in QDF mode for structured manipulation of PDF internals, a comprehensive expansion of the C API, partial compatibility with PDF 2.0 (ISO 32000-2), and a transition to the CMake build system for more modern and portable compilation. It also incorporated security patches for vulnerabilities including those affecting earlier versions like CVE-2021-25786.^[4]^[30] Subsequent releases focused on refinement and security. Version 12.0 in 2025 included API changes and further optimizations. The latest stable release, version 12.2.0 on May 4, 2025, emphasized robustness with enhancements to object resolution for handling complex or damaged PDFs and performance improvements tailored for processing large files, reducing memory usage and processing time in high-volume scenarios.^[4] Over time, QPDF has evolved from a primarily command-line interface (CLI) tool for basic PDF transformations to a robust C++ library emphasizing structural preservation and extensibility, driven by community contributions through its GitHub repository where issues and pull requests facilitate ongoing maintenance.^[3] This shift is evident in its integration into broader ecosystems, such as the qpdf R package available on CRAN since 2019, which leverages the library for PDF manipulation within statistical computing workflows.^[10]