Apache PDFBox
Apache PDFBox is an open-source Java library developed under the Apache Software Foundation for creating, manipulating, and extracting content from PDF documents.[1]
Originally initiated in 2002 by Ben Litchfield as a project on SourceForge, PDFBox entered the Apache Incubator on February 7, 2008, and later graduated to become a top-level Apache project.[2][3]
The library supports a range of functionalities, including extracting Unicode text from PDFs, splitting and merging PDF files, filling and extracting data from PDF forms, validating PDFs against standards such as PDF/A-1b using the Preflight tool, printing PDFs via the Java API, saving PDFs as images in formats like PNG and JPEG, creating PDFs with embedded fonts and images, and digitally signing PDFs.[1]
Released under the Apache License version 2.0, PDFBox is community-driven with contributions from volunteers and includes command-line utilities for common tasks.[1][4]
As of October 2025, the latest stable releases are version 3.0.6 and the long-term support branch 2.0.35.[1]
Introduction
Overview
Apache PDFBox is a pure-Java, open-source library designed for working with PDF documents, enabling the creation of new PDFs, manipulation of existing ones through operations such as splitting, merging, altering, rendering, printing, and verifying, as well as extraction of text, images, and metadata.[1] It provides a comprehensive API for programmatic PDF handling without requiring native dependencies, making it suitable for integration into Java-based applications. The library emphasizes compliance with PDF specifications, supporting standards from version 1.0 through 1.7, with partial compatibility for emerging features in ISO 32000-2 (PDF 2.0).[1][5]
Key use cases for Apache PDFBox include document automation in enterprise environments, such as batch processing of large volumes of PDFs for archival or reporting systems, and content extraction to facilitate indexing in search engines like Apache Lucene, where it integrates directly to convert PDF content into searchable text.[6] It is also employed in tools like Apache Tika for metadata and text retrieval, supporting broader text analysis workflows in projects such as Apache Nutch for web crawling and indexing.[2]
In comparison to similar libraries like iText and PDFClown, Apache PDFBox stands out for its Apache License 2.0, which imposes no commercial restrictions or copyleft requirements, allowing unrestricted use in proprietary software without licensing fees.[1] While iText offers advanced features under an AGPL license that may necessitate commercial upgrades for closed-source applications, PDFBox prioritizes broad accessibility and community-driven development.[7]
Licensing and Development
Apache PDFBox is distributed under the Apache License 2.0, a permissive open-source license that grants users a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works, distribute, perform, display, and sublicense the software, provided that appropriate notices are retained.
The project operates as a top-level initiative of the Apache Software Foundation since October 21, 2009, governed by a Project Management Committee (PMC) responsible for technical decisions, release approvals, and community management. The PMC comprises 21 members, including Ben Litchfield, the project's initial creator from 2002, who serves as a PMC member alongside the current chair, Andreas Lehmkühler.[8]
Development occurs through the official Apache Git repository, mirrored on GitHub at apache/pdfbox for broader accessibility, with issue tracking managed via Apache Jira and builds handled using Maven.[9][10] The project adheres to Apache Software Foundation security protocols, including private reporting of vulnerabilities to [email protected] and periodic reviews to address risks in PDF processing.[11]
As of 2025, the GitHub mirror garners approximately 2,900 stars, reflecting significant community interest, while the project boasts over 100 contributors historically and 21 active committers driving ongoing enhancements.[4] It integrates seamlessly with other Apache projects, such as Tika for content extraction and Lucene for PDF indexing.[12]
Sustained by volunteer efforts from its global developer community, Apache PDFBox receives foundational support from the Apache Software Foundation, eliminating the need for corporate sponsorship or funding mandates.
Features
Core Capabilities
Apache PDFBox provides foundational support for creating new PDF documents from scratch through its core PDDocument class, which initializes an empty document structure compliant with PDF specifications. Users can add pages using PDPage objects and incorporate text, simple graphics, and basic elements via the PDPageContentStream class, enabling the drawing of lines, shapes, and text strings directly onto pages. This process allows for the embedding of standard fonts and images to build complete documents without relying on external templates.[1]
For content extraction, PDFBox offers robust tools to retrieve textual content using the PDFTextStripper class, which processes PDF pages to output Unicode text while ignoring layout formatting for plain extraction. Images can be accessed and saved from document resources via PDResources and related methods in PDPage, supporting formats like JPEG and PNG. Metadata, including title, author, and creation date, is accessible through the PDDocumentInformation class, providing a structured view of document properties.[1]
Document loading and saving are handled efficiently with the Loader class in PDFBox 3.0, where Loader.loadPDF parses existing PDF files from various input sources like files or streams into a PDDocument object. Modifications can then be persisted using the save methods of PDDocument, which support output to files, streams, or incremental updates to minimize rewriting. Basic validation is facilitated through the Preflight module, which checks PDF syntax, structure, and compliance with ISO standards such as PDF/A-1b (ISO 19005-1), identifying issues like invalid objects or missing required elements.[13]
In terms of performance, as of PDFBox 3.0, the library supports efficient handling of large documents through incremental parsing, which reduces the initial memory footprint, and new IO classes utilizing java.nio and memory-mapped files for faster reading. These optimizations enable processing of sizable files with lower RAM consumption by reusing sources and caching limited pages in memory, while supporting sequential access for improved loading times.[13]
Advanced Functions
Apache PDFBox enables sophisticated document manipulation through utilities such as PDFMergerUtility, which merges multiple PDF documents by appending pages, optimizing resource usage, and handling structure trees in legacy mode to preserve document integrity.[14] The library also supports splitting documents via the Splitter class in the same package, dividing a single PDF into multiple files based on page ranges or other criteria.[15] Overlaying content onto existing PDFs is facilitated by the Overlay class, which superimposes pages from an overlay document onto target pages while managing font subsets and alignment.[16] Form filling is managed through PDAcroForm, which represents interactive forms and allows programmatic population of fields, import of FDF data, and export of form values.[17]
Security features in PDFBox include encryption and decryption via StandardProtectionPolicy, which applies password-based protection with configurable key lengths and permissions, such as restricting printing or editing.[18] For digital signatures, ExternalSigningSupport provides an interface for external signing workflows, enabling retrieval of document bytes for hashing and subsequent embedding of CMS/PKCS#7 signatures to ensure authenticity and non-repudiation.[19]
Rendering capabilities extend to converting PDFs into visual formats using PDFRenderer, which generates BufferedImage outputs at specified DPI and color types, suitable for image extraction or display in Swing components via Graphics2D contexts.[20] Printing is supported through PDFPrintable, an implementation of the Java Printable interface that handles page scaling, orientation, and borders for direct output to printers.[21]
Accessibility enhancements involve adding structural tags via PDStructureTreeRoot, the root node of the document's logical structure tree, which organizes content into a hierarchy compliant with PDF/UA standards for screen readers.[22] Alternative text for images is set through properties in PDStructureElement or annotations, while language metadata is embedded in the document catalog or PDDocumentInformation to specify primary and regional languages.[23][24]
Error handling for malformed PDFs is addressed by the lenient parsing mode in COSParser, which skips invalid constructs during document loading to enable recovery, and custom parsers like PDFStreamParser with forceParsing flags to process unparseable streams incrementally.[25][26]
Architecture and Components
Core Libraries
The core libraries of Apache PDFBox form the foundational programmatic components for PDF processing in Java applications. These libraries provide the essential models, parsing mechanisms, and utilities for handling PDF structures, fonts, and metadata without relying on external dependencies beyond standard Java APIs.
The central library, pdfbox (package org.apache.pdfbox), serves as the primary interface for PDF document manipulation. It includes high-level classes such as PDDocument, which represents the in-memory model of an entire PDF file, enabling loading, creation, and saving of documents, and PDPage, which models individual pages with properties like size, resources, and content streams. This library supports parsing PDF syntax, generating new PDFs from scratch, and extracting contents like text and images, all built on a modular document object model that adheres to the PDF specification. It requires Java 8 or later for compatibility with modern JVM features.[1][27]
The fontbox library (package org.apache.fontbox) specializes in font handling, providing low-level access to font files and integration with PDF rendering. It supports standard Type 1 fonts via classes like PDType1Font for built-in PDF fonts (e.g., Helvetica, Times-Roman) and TrueType fonts through parsers that extract glyphs, metrics, and embedding data. A key feature is font subsetting, implemented via TTFSubsetter, which embeds only the glyphs used in a document to reduce file sizes and improve performance. This library operates independently but is tightly coupled with pdfbox for text rendering in PDFs.[28]
The xmpbox library (package org.apache.xmpbox) focuses on Extensible Metadata Platform (XMP) support, Adobe's standard for embedding structured metadata in PDFs using RDF/XML format. It enables parsing, validation, and creation of XMP packets, including schemas for Dublin Core, EXIF, and custom properties, allowing developers to add descriptive data like author, keywords, and rights information directly into PDF files. This library ensures compliance with XMP specifications for interoperability with tools like Adobe Acrobat.[29]
These libraries exhibit clear interdependencies: pdfbox relies on fontbox for accurate text rendering and font embedding during PDF generation or extraction, and on xmpbox for processing metadata streams within documents. Together, the core JAR files—pdfbox, fontbox, and xmpbox—total approximately 4 MB in size for version 3.0.x, making them lightweight for integration into applications.[30][31]
Regarding API stability, Apache PDFBox maintains backward compatibility within major versions through deprecation policies, where breaking changes are announced via migration guides and marked for removal in advance, ensuring minimal disruption for existing codebases across minor releases.[13]
Apache PDFBox includes a suite of command-line tools, accessible via the PDFBox-CLI interface, which enable straightforward PDF manipulation and analysis without requiring custom programming. Note that in version 3.0 and later, these tools use a subcommand structure (e.g., 'merge' instead of 'PDFMerger'). They are packaged in the standalone pdfbox-app-x.y.z.jar file, downloadable from the official Apache repository, and executed using the Java command-line interface, such as java -jar pdfbox-app-3.0.6.jar <subcommand> [options]. For instance, the ExtractText tool extracts textual content from PDF documents into formats like plain text, HTML, or Markdown, supporting parameters for page ranges (-startPage and -endPage), encoding (default UTF-8), and output files (-o).[32][27]
Other key tools include PDFMerger, which combines multiple input PDF files into a single output document via commands like java -jar pdfbox-app-3.0.6.jar merge -o merged.pdf -i file1.pdf -i file2.pdf, preserving page order and structure. The PDFDebugger utility facilitates visual inspection of a PDF's internal object structure, launched with java -jar pdfbox-app-3.0.6.jar debug input.pdf, and supports password-protected files (-password) for decryption during analysis. Additional tools cover encryption/decryption, image extraction, document splitting, and printing, all leveraging the core parsing capabilities for efficient operation.[32]
The Preflight module serves as a dedicated validation tool within PDFBox, checking PDF documents for compliance with PDF/A-1 standards (PDF/A-1a and PDF/A-1b) to ensure long-term archival integrity. It implements parsers aligned with ISO 19005-1 specifications, targeting basic text and image preservation with unicode support. This configuration class allows customization of validation processes, such as metadata verification and structural checks relevant to standards like PDF/UA for tagged content and reading order. Preflight generates detailed reports on errors, such as missing fonts or invalid color profiles, making it essential for quality assurance in document workflows.[1]
An optional supporting component is the jbig2-imageio plugin, which extends PDFBox's image processing to handle JBIG2-compressed images, a lossless format often used in scanned documents to achieve high compression ratios without quality loss. Integrated as a Java ImageIO plugin, it enables reading and decoding JBIG2 streams embedded in PDFs, particularly useful for legacy or high-volume scanning applications; users must include the jbig2-imageio-x.y.z.jar in their classpath for activation. This plugin, originally from Levigo and now maintained under Apache, addresses limitations in standard image handling for bi-level content.[33][34]
While these modules and tools provide ready-to-use functionality, they have inherent limitations: many operations, such as text extraction or validation, are read-only and do not support in-place modifications, and advanced customizations—like conditional overlays or scripted workflows—necessitate developing Java applications with the core libraries. Invocation requires a compatible Java environment (JDK 8+), and certain features, like printing, depend on system-level permissions.[32][30]
History
Origins and Early Development
Apache PDFBox originated in 2002 as an open-source project initiated by Ben Litchfield on SourceForge, primarily designed to extract text from PDF documents for indexing by the Apache Lucene search engine. This effort addressed the need for a free, Java-based alternative to proprietary PDF libraries, such as Adobe's PDF development kit, which were restrictive for integration into open-source search applications. Litchfield served as the primary maintainer, with early development focusing on basic PDF parsing capabilities to support text extraction and content access.[2]
Prior to its integration into the Apache Software Foundation, PDFBox evolved under the BSD license, releasing initial versions that emphasized core parsing functions. Version 0.7.0 marked a significant early milestone by introducing foundational tools for PDF manipulation and integration with search technologies. Community contributions were facilitated through SourceForge forums, where users provided feedback and enhancements to improve reliability for search engine requirements.
On February 7, 2008, PDFBox was proposed for and accepted into the Apache Incubator, transitioning from its SourceForge hosting and relicensing to the Apache License 2.0 to align with Foundation standards.[35] The project graduated to top-level Apache status in September 2009. This move formalized its development under Apache governance while preserving its focus on open accessibility for PDF processing in Java environments.[35]
Major Releases and Milestones
The development of Apache PDFBox has progressed through distinct series of releases, each introducing significant enhancements in functionality, compliance, and performance. The 1.x series, spanning from 2009 to 2016, marked the project's transition to full Apache status and focused on foundational capabilities. The inaugural Apache release, version 0.8.0-incubating, arrived in September 2009, establishing PDFBox as an open-source Java library for PDF manipulation under the Apache Software Foundation. Subsequent releases in this series built upon this base, with version 1.0.0 released on February 15, 2010, introducing core features such as PDF creation, content extraction, and basic rendering support. By version 1.8.13 in December 2016, the series had stabilized text extraction processes through numerous bug fixes, including resolutions for issues like leading space handling in text conversion and infinite loops in parsing, enhancing reliability for document processing tasks.[35][36][37]
The 2.x series, from 2016 to 2025, represented a major architectural evolution, emphasizing modularity and expanded standards compliance. Version 2.0.0, released on March 21, 2016, restructured the library into separate modules, including pdfbox-tools for utilities, pdfbox-examples for demonstrations, and pdfbox-debugger for inspection, facilitating easier integration and maintenance. This release introduced full Unicode font support with subsetting, improved PDF/A validation via the Preflight tool, and matured digital signature capabilities through updated dependencies like Bouncy Castle, enabling robust signing and verification of PDF documents. A key milestone was the integration with Apache Tika beginning around 2008, where PDFBox served as the primary parser for PDF content and metadata extraction in Tika's toolkit, boosting its adoption in broader document processing ecosystems. The series continued with iterative updates for security and stability, culminating in version 2.0.35 on October 2, 2025, which included critical security fixes and minor enhancements.[38][39][40]
Initiated in 2023, the 3.x series advanced compliance with modern PDF specifications and optimized resource usage. Version 3.0.0, released on August 17, 2023, aligned with PDF 2.0 (ISO 32000-2) requirements through incremental parsing and refined handling of advanced features like transparency and embedded content. It introduced performance optimizations, such as a new IO module leveraging java.nio for memory-mapped files, reducing memory footprint during parsing and rendering compared to prior versions. Accessibility received attention with enhancements to text extraction and glyph metrics in the PDFDebugger tool. Deprecations included the removal of legacy static instances for Standard 14 fonts, requiring migration to new constructors like PDType1Font(Standard14Fonts.FontName), and elimination of outdated Charsets utilities in favor of java.nio.charset.StandardCharsets; comprehensive migration guides were provided to ease upgrades from 2.x. The latest release, 3.0.6 on October 16, 2025, further bolstered accessibility features and incorporated bug fixes for form handling and rendering.[41][13][42]
Usage
Basic Implementation
To integrate Apache PDFBox into a Java application, the library requires Java 8 or higher, with Java 11 or later recommended for optimal performance and compatibility in version 3.x releases.[13] Basic operations, such as creating or extracting text from PDFs, do not require external fonts, as the library includes support for the standard 14 PDF base fonts like Helvetica and Times-Roman.
For project setup using Maven, include the following dependency in the pom.xml file to use the latest stable version:
xml
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.6</version>
</dependency>
```[](https://pdfbox.apache.org/download.html)
Essential imports for basic implementation include `org.apache.pdfbox.pdmodel.PDDocument`, `org.apache.pdfbox.pdmodel.PDPage`, `org.apache.pdfbox.pdmodel.PDPageContentStream`, `org.apache.pdfbox.pdmodel.font.PDType1Font`, and `org.apache.pdfbox.text.PDFTextStripper`.
Creating a new PDF document involves instantiating a `PDDocument`, adding a `PDPage`, and drawing content via a `PDPageContentStream`. The following example demonstrates generating a simple single-page PDF with text:
```java
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
public class SimplePDFCreator {
public static void main(String[] args) {
try (PDDocument document = new PDDocument()) {
PDPage page = new PDPage();
document.addPage(page);
try (PDPageContentStream contentStream = new PDPageContentStream(document, page)) {
contentStream.beginText();
contentStream.setFont(PDType1Font.HELVETICA, 12);
contentStream.newLineAtOffset(25, 700);
contentStream.showText("Hello, World! This is a basic PDF created with Apache PDFBox.");
contentStream.endText();
}
document.save(new File("simple.pdf"));
} catch (IOException e) {
e.printStackTrace();
}
}
}
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.6</version>
</dependency>
```[](https://pdfbox.apache.org/download.html)
Essential imports for basic implementation include `org.apache.pdfbox.pdmodel.PDDocument`, `org.apache.pdfbox.pdmodel.PDPage`, `org.apache.pdfbox.pdmodel.PDPageContentStream`, `org.apache.pdfbox.pdmodel.font.PDType1Font`, and `org.apache.pdfbox.text.PDFTextStripper`.
Creating a new PDF document involves instantiating a `PDDocument`, adding a `PDPage`, and drawing content via a `PDPageContentStream`. The following example demonstrates generating a simple single-page PDF with text:
```java
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
public class SimplePDFCreator {
public static void main(String[] args) {
try (PDDocument document = new PDDocument()) {
PDPage page = new PDPage();
document.addPage(page);
try (PDPageContentStream contentStream = new PDPageContentStream(document, page)) {
contentStream.beginText();
contentStream.setFont(PDType1Font.HELVETICA, 12);
contentStream.newLineAtOffset(25, 700);
contentStream.showText("Hello, World! This is a basic PDF created with Apache PDFBox.");
contentStream.endText();
}
document.save(new File("simple.pdf"));
} catch (IOException e) {
e.printStackTrace();
}
}
}
This code uses try-with-resources statements to automatically close the PDDocument and PDPageContentStream, preventing resource leaks. The save method writes the document to an OutputStream or File, and any failures, such as invalid paths or I/O errors, are caught as IOException. In version 3.x, COSVisitorException from earlier releases has been consolidated into IOException for simplified error handling.[13]
Extracting text from an existing PDF relies on the PDFTextStripper class, which processes the document stream and outputs content to a Writer. For position-aware extraction, methods like getTextPositions can retrieve coordinates of text elements. The example below loads a PDF and extracts all text:
java
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class SimpleTextExtractor {
public static void main(String[] args) {
try (InputStream input = new FileInputStream("input.pdf");
PDDocument document = Loader.loadPDF(input)) {
PDFTextStripper stripper = new PDFTextStripper();
StringWriter writer = new StringWriter();
stripper.processStream(document, writer);
String text = writer.toString();
System.out.println(text);
// For positions, access via stripper.getTextPositions(document.getPages().get(0));
} catch (IOException e) {
e.printStackTrace();
}
}
}
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class SimpleTextExtractor {
public static void main(String[] args) {
try (InputStream input = new FileInputStream("input.pdf");
PDDocument document = Loader.loadPDF(input)) {
PDFTextStripper stripper = new PDFTextStripper();
StringWriter writer = new StringWriter();
stripper.processStream(document, writer);
String text = writer.toString();
System.out.println(text);
// For positions, access via stripper.getTextPositions(document.getPages().get(0));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Here, Loader.loadPDF is the recommended method in version 3.x for loading documents from streams or files, replacing the deprecated PDDocument.load. The processStream method handles the extraction, writing directly to the provided Writer, while IOException covers parsing or I/O issues. Try-with-resources ensures the document is closed after processing, even if extraction fails due to malformed PDFs.[13]
Advanced Applications and Examples
Apache PDFBox enables several advanced applications beyond basic PDF creation and manipulation, including digital signing, interactive form handling, printing integration, content overlays, and standards validation. These features leverage specialized modules and classes within the library, allowing developers to address complex requirements in document processing workflows, such as secure authentication, automated data entry, and compliance checking.[1]
Digital signing is facilitated through the PDSignature class in the pdmodel.interactive.digitalsignature package, which represents a digital signature attached to a PDF document. This supports the creation of signatures compliant with PDF standards, enabling secure verification of document integrity and authenticity. For instance, developers can load a PDF, generate a signature using external cryptographic providers, and embed it into the document catalog for validation. The process involves setting signature attributes like the byte range and filter, as detailed in the Adobe PDF specification referenced in the API documentation.
Interactive form handling is managed via the PDAcroForm class, which provides access to AcroForm structures for extracting or populating field data in PDF forms. This allows automation of form filling, such as importing values from external data sources into text fields, checkboxes, or radio buttons, and flattening forms to render them non-editable. A representative example involves retrieving the form from the document catalog, iterating over fields with getFields(), and setting values using setValue() before saving the modified document. This capability is particularly useful in enterprise applications for processing submitted forms programmatically.[1]
Printing integration is supported by the printing package, which includes PDFPrintable for rendering PDF pages onto Java's Graphics2D context and PDFPageable for handling multi-page documents in the standard Java printing API. Developers can scale, orient, and print PDFs directly from applications, with options for fit-to-page or actual size rendering. An example usage loads a PDDocument, creates a PDFPageable instance, and passes it to PrinterJob.print(). This enables seamless incorporation of PDF printing into desktop or server-based software.
Content overlays allow superimposing one PDF onto another, useful for adding watermarks, headers, or stamps across pages. The Overlay class in the multipdf package handles this by mapping source pages to target positions (foreground or background) and applying rotations if needed. For example, to overlay a logo PDF onto specific pages of a base document, one specifies page mappings via a Map<Integer, String> and invokes overlay() on the loaded documents. Command-line support via OverlayPDF extends this to batch processing.[32]
Standards validation, particularly for PDF/A compliance, is performed using the preflight package, which includes PreflightDocument to check documents against predefined rulesets like PDF/A-1b. This process parses the PDF structure, validates metadata, fonts, and color spaces, and reports violations. A typical application loads a document into a PreflightDocument with a configuration for the target format, then calls validate() to generate a report, aiding archival and long-term preservation workflows. Command-line Preflight tool automates this for file validation.[32]
Additional advanced examples include converting PDFs to images via the PDFRenderer class, which renders pages to BufferedImage for formats like PNG or JPEG, supporting high-resolution output for archiving or web display. Merging and splitting large documents is handled by the PDFMergerUtility and PDFSplitter classes, enabling efficient processing of multi-document workflows without loading entire files into memory. These utilities demonstrate PDFBox's scalability for enterprise-scale applications.