Pandoc
Pandoc is a free and open-source universal document converter that enables the transformation of files between a wide array of markup formats, including lightweight formats like Markdown and reStructuredText, as well as more structured ones such as HTML, LaTeX, Microsoft Word (docx), EPUB, and over 50 others for both input and output.[1] First released in 2006, Pandoc was developed by John MacFarlane as a Haskell library and command-line tool, providing robust support for advanced features like embedded LaTeX mathematics (rendered via MathJax or other engines), automatic citations and bibliographies using Citation Style Language (CSL), footnotes, tables, and metadata handling.[1] Its modular architecture allows customization through Lua filters, templates, and extensions, making it a versatile foundation for publishing workflows, academic writing, and automated document processing.[2] Licensed under the GNU General Public License (GPL), Pandoc emphasizes extensibility and cross-platform compatibility, with ongoing development ensuring compatibility with evolving standards in document markup and rendering.[1]History and Development
Origins and Initial Release
Pandoc was developed by John MacFarlane, a professor of philosophy at the University of California, Berkeley, as a Haskell library and command-line tool designed to convert between various markup formats.[3][4] The project originated around 2004 as a personal experiment while MacFarlane was learning Haskell, initially motivated by the need for a reliable Markdown parser to manage his lecture notes and academic documents.[4] He sought to address the shortcomings of existing tools, such as reStructuredText, which lacked sufficient flexibility for scholars transitioning between formats like plain text, web-ready outputs, and typesetting systems without compromising document structure.[4] The initial public release occurred in 2006, distributed as a tarball on MacFarlane's website. This early version emphasized basic conversions among Markdown, HTML, and LaTeX, enabling users to parse and transform documents via an abstract syntax tree (AST) for format-agnostic processing.[4] From its inception, Pandoc has been licensed under the GNU General Public License version 2 or later, ensuring open-source accessibility and encouraging community contributions for academic and publishing workflows.[5]Evolution and Major Milestones
Pandoc's development has progressed steadily since its version 1.0 release in September 2008, which introduced extended Markdown support as a core feature for versatile document conversion.[6] Over the subsequent years, the project evolved through numerous updates, culminating in the stable release of version 3.8.2.1 on October 20, 2025, which included enhancements in performance, such as fixes for citation processing regressions, and improved format compatibility, like better handling of Org mode dynamic blocks and Docx nested comments.[7] This progression reflects ongoing refinements to core parsing, writing, and filtering mechanisms, driven by user feedback and technological advancements in markup standards. Key milestones mark significant expansions in functionality. Version 2.0, released in October 2017, introduced built-in Lua filter support, enabling users to extend Pandoc's abstract syntax tree manipulation without external dependencies, a feature that greatly enhanced customization for complex document transformations. In November 2019, version 2.8 integrated native CiteProc for citation processing, allowing seamless handling of bibliographies and references directly within Pandoc, reducing reliance on separate tools. Version 3.0, released in January 2023, further expanded format support, adding readers and writers for emerging standards like JATS XML and initial Typst output, alongside a rewritten CommonMark parser for stricter compliance. The project's growth has been fueled by community contributions through its GitHub repository at jgm/pandoc, where volunteers submit bug fixes, new extensions, and feature proposals, with thousands of issues addressed since inception.[2] Primary maintenance is led by John MacFarlane, supported by a network of contributors who ensure cross-platform availability on Unix-like systems and Windows via precompiled binaries.[8]Technical Foundation
Implementation in Haskell
Pandoc is implemented primarily in the Haskell programming language, which provides a strong static type system that catches errors at compile time, ensuring robust parsing and conversion processes without the runtime errors prevalent in dynamic languages.[4] This type safety has been instrumental in maintaining the tool's reliability over its development, as the compiler enforces correctness and prevents common mistakes during code changes.[4] The core of Pandoc consists of a Haskell library that handles the conversion logic, allowing it to be embedded in other Haskell applications for custom document processing, while the command-line executable is built directly from this library for standalone use.[5] This dual structure enables developers to leverage Pandoc's functionality programmatically, such as integrating it into web services or scripts that require markup transformations.[5] Users can compile and install Pandoc from source using Haskell's package managers, such as Cabal or Stack; for instance, with Stack, one runsstack install pandoc-cli after setting up the environment, which handles dependencies and builds the executable.[8] Alternatively, pre-built binaries are available for major platforms including Windows (via MSI or ZIP), macOS (via package installer or ZIP), and Linux (via tarball or DEB packages), providing a straightforward installation without requiring a Haskell toolchain.[8]
Haskell's functional programming paradigm further enhances Pandoc's design by supporting pure functions for document transformations, which promotes modularity and predictability in handling complex markup structures, reducing the risk of side effects in conversions.[4] This approach aligns with the tool's goal of reliable, composable operations across diverse formats.[4]
Architecture and Core Components
Pandoc employs a modular architecture that facilitates the conversion of documents between diverse formats through a pipeline of parsing, intermediate representation, and rendering stages. Input documents are processed by reader modules, which parse the source text into a universal abstract syntax tree (AST), serving as the central intermediate representation that captures the semantic structure of the content rather than its visual styling. This AST is then transformed by writer modules to generate output in the target format, enabling efficient M × N conversions where M input formats map to N output formats without redundant parsing logic.[5][9] The AST, defined in thepandoc-types Haskell package, structures documents as a Pandoc value comprising metadata (via a Meta map for elements like title and authors) and a list of block-level elements, with inline elements nested within blocks to represent textual content semantically. Blocks encompass structural components such as paragraphs (Para), headers (Header), lists (OrderedList or BulletList), code blocks (CodeBlock), and tables (Table), while inlines include text strings (Str), emphasis (Emph), links (Link), and citations. This design prioritizes semantic preservation, allowing Pandoc to maintain document intent across formats, though it may discard format-specific styling details like precise font metrics. Users can inspect the AST directly using the pandoc -t native command, which outputs the parsed structure in a human-readable form.[10][5]
Key components include the reader modules, which handle parsing for input formats by converting raw text into the AST while interpreting semantic cues like markup syntax; writer modules, which traverse the AST to produce formatted output tailored to the destination format; and the template system, which applies customizable layouts to standalone documents via variables for elements like headers, footers, and tables of contents. Templates are managed through the Text.Pandoc.Templates module and can be overridden with user-provided files, ensuring flexibility in output presentation without altering core conversion logic.[5][9]
Error handling in Pandoc emphasizes graceful degradation, where unsupported features in input or output formats are skipped or approximated, accompanied by warnings that alert users to potential information loss, such as unrenderable complex tables or embedded media. These warnings can be suppressed with flags like --quiet or treated as failures via --fail-if-warnings, and errors are propagated through a PandocError monad for programmatic handling in API usage.[5][9]
The system's modularity supports extensibility, permitting the development of custom readers and writers directly in Haskell via the Pandoc API or through Lua for lightweight modifications like filters that traverse and alter the AST between parsing and rendering phases. This design allows integration of new formats or behaviors without recompiling the core tool, leveraging Haskell's type safety for robust implementations.[5][9]
Core Functionality
Document Conversion Process
Pandoc's document conversion process is initiated through its command-line interface, where users specify input files, output formats, and various options to transform documents between supported formats. The basic syntax follows the structurepandoc [options] [input-file]..., allowing multiple input files to be concatenated and processed together. If no input files are provided, Pandoc reads from standard input (stdin), and by default, it writes output to standard output (stdout); the -o option specifies an output file, such as pandoc -o output.[html](/page/HTML) input.[md](/page/.md). This design supports piping in Unix-like environments, enabling seamless integration into scripts or workflows, for example, cat input.[md](/page/.md) | pandoc -o output.pdf.[5]
Several key options allow users to customize the conversion. The --from FORMAT and --to FORMAT flags explicitly specify the input and output formats, overriding Pandoc's automatic detection based on file extensions; for instance, --from [markdown](/page/Markdown) parses the input as Markdown, while --to [html](/page/HTML) generates HTML output. The --template FILE option applies a custom template to the output, useful for tailoring document layouts in formats like HTML or LaTeX. Additionally, --metadata KEY=VAL sets document variables, such as title or author, which can influence rendering, like pandoc --metadata [title](/page/Title)="My [Document](/page/Document)" input.md -o output.[html](/page/HTML). These options ensure flexible control over the conversion without altering the core process.[5]
The underlying process flow involves three main stages: parsing the input into an abstract syntax tree (AST), optional application of filters to modify the AST, and generation of the output from the AST. During parsing, Pandoc's readers convert the source document into a structured AST representation, capturing semantic elements regardless of the input format. Filters, if specified via the --filter option, can then transform this AST—such as adding or removing elements—before writers render it into the target format, preserving structural integrity like headings, lists, and tables across conversions. This AST-based approach ensures that the document's logical structure is maintained, though stylistic details like fonts or margins may vary depending on the output format's capabilities.[5]
Practical examples illustrate the process's versatility. To convert a Markdown file to PDF, which typically involves LaTeX as an intermediate step, the command pandoc input.md -o output.pdf --pdf-engine=pdflatex parses the Markdown into an AST, generates LaTeX code while preserving headings and tables, and compiles it to PDF using the specified engine. Similarly, converting HTML to DOCX with pandoc input.html -o output.docx maintains the document's lists and sections in the resulting Word file, demonstrating how Pandoc bridges web and office formats without loss of core structure. For citation handling during these conversions, users can enable bibliography processing via options like --citeproc, as detailed in the relevant section.[5]
Citation and Bibliography Handling
Pandoc incorporates CiteProc, a built-in processor for handling citations and bibliographies, which is based on the Citation Style Language (CSL).[5] This feature enables the automatic formatting of inline citations and the generation of reference lists in various scholarly styles during document conversion.[11] CiteProc processes bibliographic data provided in external files, supporting formats such as BibTeX, BibLaTeX, CSL JSON, CSL YAML, or RIS, specified via the--bibliography command-line option or the bibliography metadata field in the input document.[5]
To use CiteProc, authors insert inline citations using Pandoc's dedicated syntax, such as [@citekey] for a basic reference or [@citekey, 23-25] for a specific page range, which Pandoc replaces with formatted text according to the chosen style.[5] Enabling the processor requires the --citeproc flag during conversion; without it, citations remain as raw markup.[5] Output styles, including APA, Chicago, and MLA, are defined by CSL stylesheets, which can be supplied via the --csl option or the csl metadata field (defaulting to Chicago author-date if omitted).[5] For example, converting a Markdown document with --citeproc --csl=apa.csl will produce APA-formatted citations and a bibliography section at the document's end.[5]
CiteProc offers several features to enhance flexibility and internationalization. It supports locale-specific formatting for 52 languages through CSL locale files, allowing adjustments for date formats, sorting rules, and terminology in non-English contexts.[11][12] Automatic numbering of references is handled when required by the style, such as in numerical citation schemes.[5] Additionally, it integrates with external tools like Zotero for bibliography management and the CSL editor for custom style creation and modification.[13][14]
Despite these capabilities, CiteProc has notable limitations. It depends entirely on external bibliography files and does not include built-in database management or storage for references, requiring users to maintain separate files or integrate with reference managers.[5] This external reliance can complicate workflows in environments without dedicated bibliography tools.
Supported Formats
Input Formats
Pandoc supports a diverse array of input formats, allowing it to ingest documents from numerous markup languages, word processors, and structured data sources. These formats are parsed into an intermediate abstract syntax tree (AST) representation, facilitating conversion to other outputs. The tool recognizes over 40 input formats as of version 3.1 in 2023, with ongoing expansions in subsequent releases through 2025.[5] Core input formats include CommonMark Markdown, which supports extensions for advanced features such as footnotes, definition lists, tables, and fenced code blocks; these can be enabled or disabled using flags like+footnotes or -definition_list when specifying the format (e.g., pandoc -f commonmark+footnotes). Other foundational formats are reStructuredText (rst), which handles directives, roles, and structural elements like sections, lists, and tables; HTML, parsing tags and attributes while preserving inline elements; LaTeX, which interprets mathematical expressions, environments, and macros but may simplify complex layouts; and MediaWiki markup, supporting wiki-style links, templates, and tables.[5]
Additional supported formats extend Pandoc's versatility to include Org-mode (org) for outline-based documents with emphasis on tasks and agendas; Textile for lightweight markup with simple syntax for blocks and attributes; DocBook XML for technical documentation with hierarchical tagging; Jupyter notebooks (.ipynb) as JSON-structured files containing code cells, markdown, and outputs; EPUB for reflowable e-books, extracting content from its XHTML components; plain text (treated as a basic markup variant); AsciiDoc, added in version 2.17 for handling attributes, includes, and admonitions; and binary formats like DOCX (Microsoft Word) and ODT (OpenDocument Text), which are unzipped and parsed via XML intermediaries without requiring external libraries.[5][15]
Further formats encompass DocBook, DokuWiki, EPUB (version 2 and 3), FictionBook2 (fb2), GitHub Flavored Markdown (gfm), Haddock markup, JATS XML for journal articles, JSON and native Pandoc AST for programmatic use, Muse, OPML for outlines, T2T (txt2tags), TWiki, TikiWiki, and Vimwiki. Parsing for these emphasizes structural fidelity, such as preserving headings, lists, images, and metadata across formats, though some low-level styling (e.g., precise spacing in LaTeX) may be abstracted. Binary inputs like DOCX and ODT undergo conversion to Pandoc's AST via built-in readers, supporting elements like styles, headers, and embedded media.[5]
Output Formats
Pandoc supports a wide array of output formats, enabling the conversion of documents into various markup languages, word processor files, and presentation formats. These outputs are generated through Pandoc's core conversion engine, which translates the abstract syntax tree (AST) derived from input documents into the target format's structure. Primary binary outputs include PDF, produced via LaTeX engines like pdflatex or alternative HTML-based engines such as WeasyPrint; DOCX for Microsoft Word, which preserves custom styles when a reference document is provided using the--reference-doc option; ODT for OpenDocument Text, supporting similar style inheritance and options like --link-images for embedded media; HTML5 for web-ready documents, allowing CSS styling via --css and math rendering with MathJax; EPUB for e-books, featuring options like --epub-cover-image for covers and --epub-embed-font for font inclusion; RTF for Rich Text Format, generating standalone files with basic formatting; and Beamer for LaTeX-based slide presentations, configurable with themes (e.g., via the theme variable) and aspect ratios like 16:9.[5]
Markup-based outputs encompass LaTeX for typesetting, which includes variables for document classes, geometry, and font families to control layout; reStructuredText for documentation, supporting table-of-contents generation and reference links; Markdown variants such as commonmark, gfm (GitHub Flavored Markdown), and others, with extensions for fenced code blocks and wrapping options; Textile for lightweight web markup; and DocBook for XML-based publishing, enabling section numbering and MathML for equations. These formats prioritize semantic structure over visual styling, though LaTeX and HTML outputs allow extensive customization through templates and metadata.[5]
Rendering enhancements across outputs include support for themes in Beamer and HTML via template variables, CSS integration for HTML and EPUB to apply visual styles, and preservation of document styles in binary formats like DOCX and ODT by referencing existing templates. Bibliographies are handled via the built-in citeproc processor, which integrates with the --citeproc option and requires a bibliography file (e.g., in BibTeX or CSL JSON) along with a CSL stylesheet for consistent citation rendering; this works in all output formats, appending a formatted bibliography section while replacing inline citations according to the chosen style, such as Chicago author-date.[5]
As of version 3.8.2.1, released in October 2025, Pandoc includes a new xml input and output format for representing the Pandoc AST in a human-readable XML structure, enhancing programmatic access and interoperability. These updates build on prior capabilities, focusing on precision in scholarly and publishing contexts without altering core rendering mechanisms.[15][16]
Extensions and Customization
Lua Filters and Plugins
Lua filters enable users to customize Pandoc's document conversion by applying Lua scripts that transform the abstract syntax tree (AST) between parsing and writing phases. Introduced in Pandoc version 2.0 on October 30, 2017, they utilize a built-in Lua 5.4 interpreter embedded directly in the Pandoc executable, avoiding external dependencies and the serialization overhead of JSON filters.[17][18] Lua's selection stems from its lightweight design and straightforward embeddability in Haskell, Pandoc's core language, facilitating efficient script execution without additional installations.[19][18] To use a Lua filter, specify it via the--lua-filter=script.lua option, where the script defines a table of functions keyed to AST element types, such as Header, Para, or Table. Each function processes an input element and returns either the unmodified element, nil to remove it, a new Pandoc object to replace it, or a list of objects to insert multiple elements in its place. Filters traverse the AST in a type-wise or top-down manner, with global variables like FORMAT providing output-specific context for conditional logic.[18]
Practical examples include converting emphasized text to small caps, as in a filter that replaces pandoc.Strong elements with pandoc.SmallCaps containing the original content, or centering images by wrapping them in a Div with CSS alignment attributes. For table reformatting, a filter can iterate over Table rows to adjust alignments or add borders via attributes, enhancing output for formats like HTML or LaTeX. More specialized applications appear in tools like pandoc-scholar, a Lua filter suite that augments academic documents with features such as automated bibliography sections and figure caption styling.[18][20]
Lua filters also support custom integrations, such as modifying elements for compatibility with pandoc-crossref to enable automatic numbering of figures and tables during conversion. Performance benefits are notable; for instance, a simple Lua filter processes the Pandoc manual in 1.03 seconds, outperforming equivalent Haskell (1.36 seconds) and Python (1.40 seconds) implementations on the same hardware.[21][18] However, filters execute per document in each Pandoc run, maintaining no persistent state across invocations, which restricts them to stateless transformations. Common pitfalls include failing to return modified elements, leading to unintended removals, and relying on locale-sensitive patterns that may vary by system.[18] These scripts manipulate the AST, allowing targeted extensions without altering Pandoc's core behavior.[18]