Fact-checked by Grok 2 weeks ago

Pandoc

Pandoc is a free and open-source universal document converter that enables the transformation of files between a wide array of markup formats, including lightweight formats like and , as well as more structured ones such as , , (docx), , and over 50 others for both input and output. First released in 2006, Pandoc was developed by John MacFarlane as a library and command-line tool, providing robust support for advanced features like embedded mathematics (rendered via MathJax or other engines), automatic citations and bibliographies using Citation Style Language (CSL), footnotes, tables, and metadata handling. Its modular architecture allows customization through filters, templates, and extensions, making it a versatile foundation for publishing workflows, academic writing, and automated document processing. Licensed under the GNU General Public License (GPL), Pandoc emphasizes extensibility and cross-platform compatibility, with ongoing development ensuring compatibility with evolving standards in document markup and rendering.

History and Development

Origins and Initial Release

Pandoc was developed by John MacFarlane, a professor of philosophy at the , as a Haskell library and command-line tool designed to convert between various markup formats. The project originated around 2004 as a personal experiment while MacFarlane was learning , initially motivated by the need for a reliable Markdown parser to manage his lecture notes and academic documents. He sought to address the shortcomings of existing tools, such as , which lacked sufficient flexibility for scholars transitioning between formats like plain text, web-ready outputs, and typesetting systems without compromising document structure. The initial public release occurred in , distributed as a tarball on MacFarlane's website. This early version emphasized basic conversions among , , and , enabling users to parse and transform documents via an () for format-agnostic processing. From its inception, Pandoc has been licensed under License version 2 or later, ensuring open-source accessibility and encouraging community contributions for academic and publishing workflows.

Evolution and Major Milestones

Pandoc's development has progressed steadily since its version 1.0 release in September 2008, which introduced extended support as a core feature for versatile document conversion. Over the subsequent years, the project evolved through numerous updates, culminating in the stable release of version 3.8.2.1 on October 20, 2025, which included enhancements in performance, such as fixes for citation processing regressions, and improved format compatibility, like better handling of dynamic blocks and Docx nested comments. This progression reflects ongoing refinements to core parsing, writing, and filtering mechanisms, driven by user feedback and technological advancements in markup standards. Key milestones mark significant expansions in functionality. , released in October 2017, introduced built-in filter support, enabling users to extend Pandoc's manipulation without external dependencies, a feature that greatly enhanced customization for complex document transformations. In November 2019, version 2.8 integrated native CiteProc for citation processing, allowing seamless handling of bibliographies and references directly within Pandoc, reducing reliance on separate tools. , released in 2023, further expanded format support, adding readers and writers for emerging standards like XML and initial Typst output, alongside a rewritten CommonMark parser for stricter compliance. The project's growth has been fueled by community contributions through its GitHub repository at jgm/pandoc, where volunteers submit bug fixes, new extensions, and feature proposals, with thousands of issues addressed since inception. Primary maintenance is led by John MacFarlane, supported by a network of contributors who ensure cross-platform availability on systems and Windows via precompiled binaries.

Technical Foundation

Implementation in Haskell

Pandoc is implemented primarily in the Haskell programming language, which provides a strong static type system that catches errors at compile time, ensuring robust parsing and conversion processes without the runtime errors prevalent in dynamic languages. This type safety has been instrumental in maintaining the tool's reliability over its development, as the compiler enforces correctness and prevents common mistakes during code changes. The core of Pandoc consists of a Haskell library that handles the conversion logic, allowing it to be embedded in other applications for custom document processing, while the command-line executable is built directly from this library for standalone use. This dual structure enables developers to leverage Pandoc's functionality programmatically, such as integrating it into web services or scripts that require markup transformations. Users can compile and install Pandoc from source using Haskell's package managers, such as or ; for instance, with , one runs stack install pandoc-cli after setting up the , which handles dependencies and builds the . Alternatively, pre-built binaries are available for major platforms including Windows (via or ), macOS (via package installer or ), and (via tarball or DEB packages), providing a straightforward without requiring a Haskell . Haskell's functional programming paradigm further enhances Pandoc's design by supporting pure functions for document transformations, which promotes modularity and predictability in handling complex markup structures, reducing the risk of side effects in conversions. This approach aligns with the tool's goal of reliable, composable operations across diverse formats.

Architecture and Core Components

Pandoc employs a modular architecture that facilitates the conversion of documents between diverse formats through a pipeline of parsing, intermediate representation, and rendering stages. Input documents are processed by reader modules, which parse the source text into a universal abstract syntax tree (AST), serving as the central intermediate representation that captures the semantic structure of the content rather than its visual styling. This AST is then transformed by writer modules to generate output in the target format, enabling efficient M × N conversions where M input formats map to N output formats without redundant parsing logic. The , defined in the pandoc-types package, structures documents as a Pandoc value comprising (via a Meta map for elements like title and authors) and a list of block-level elements, with inline elements nested within blocks to represent textual content semantically. Blocks encompass structural components such as paragraphs (Para), headers (Header), lists (OrderedList or BulletList), code blocks (CodeBlock), and (Table), while inlines include text strings (Str), emphasis (Emph), (Link), and citations. This design prioritizes semantic preservation, allowing Pandoc to maintain document intent across formats, though it may discard format-specific styling details like precise font metrics. Users can inspect the directly using the pandoc -t native command, which outputs the parsed structure in a human-readable form. Key components include the reader modules, which handle parsing for input formats by converting raw text into the while interpreting semantic cues like markup syntax; writer modules, which traverse the to produce formatted output tailored to the destination format; and the template system, which applies customizable layouts to standalone documents via variables for elements like headers, footers, and tables of contents. Templates are managed through the Text.Pandoc.Templates and can be overridden with user-provided files, ensuring flexibility in output presentation without altering core conversion logic. Error handling in Pandoc emphasizes graceful degradation, where unsupported features in input or output formats are skipped or approximated, accompanied by warnings that alert users to potential information loss, such as unrenderable complex tables or embedded media. These warnings can be suppressed with flags like --quiet or treated as failures via --fail-if-warnings, and errors are propagated through a PandocError for programmatic handling in usage. The system's modularity supports extensibility, permitting the development of custom readers and writers directly in via the Pandoc API or through for lightweight modifications like filters that traverse and alter the between parsing and rendering phases. This design allows integration of new formats or behaviors without recompiling the core tool, leveraging 's for robust implementations.

Core Functionality

Document Conversion Process

Pandoc's document conversion process is initiated through its , where users specify input files, output formats, and various options to transform documents between supported formats. The basic syntax follows the structure pandoc [options] [input-file]..., allowing multiple input files to be concatenated and processed together. If no input files are provided, Pandoc reads from standard input (stdin), and by default, it writes output to standard output (stdout); the -o option specifies an output file, such as pandoc -o output.[html](/page/HTML) input.[md](/page/.md). This design supports in environments, enabling seamless integration into scripts or workflows, for example, cat input.[md](/page/.md) | pandoc -o output.pdf. Several key options allow users to customize the conversion. The --from FORMAT and --to FORMAT flags explicitly specify the input and output formats, overriding Pandoc's automatic detection based on file extensions; for instance, --from [markdown](/page/Markdown) parses the input as , while --to [html](/page/HTML) generates output. The --template FILE option applies a custom to the output, useful for tailoring document layouts in formats like or . Additionally, --metadata KEY=VAL sets variables, such as or author, which can influence rendering, like pandoc --metadata [title](/page/Title)="My [Document](/page/Document)" input.md -o output.[html](/page/HTML). These options ensure flexible control over the conversion without altering the core process. The underlying process flow involves three main stages: parsing the input into an (AST), optional application of filters to modify the AST, and generation of the output from the AST. During parsing, Pandoc's readers convert the source document into a structured AST representation, capturing semantic elements regardless of the input format. Filters, if specified via the --filter option, can then transform this AST—such as adding or removing elements—before writers render it into the target format, preserving structural integrity like headings, lists, and tables across conversions. This AST-based approach ensures that the document's logical structure is maintained, though stylistic details like fonts or margins may vary depending on the output format's capabilities. Practical examples illustrate the process's versatility. To convert a Markdown file to PDF, which typically involves LaTeX as an intermediate step, the command pandoc input.md -o output.pdf --pdf-engine=pdflatex parses the Markdown into an , generates LaTeX code while preserving headings and tables, and compiles it to PDF using the specified engine. Similarly, converting to DOCX with pandoc input.html -o output.docx maintains the document's lists and sections in the resulting Word file, demonstrating how Pandoc bridges web and office formats without loss of core structure. For citation handling during these conversions, users can enable bibliography processing via options like --citeproc, as detailed in the relevant section.

Citation and Bibliography Handling

Pandoc incorporates CiteProc, a built-in processor for handling citations and bibliographies, which is based on the Citation Style Language (CSL). This feature enables the automatic formatting of inline citations and the generation of reference lists in various scholarly styles during document conversion. CiteProc processes bibliographic data provided in external files, supporting formats such as BibTeX, BibLaTeX, CSL JSON, CSL YAML, or RIS, specified via the --bibliography command-line option or the bibliography metadata field in the input document. To use CiteProc, authors insert inline citations using Pandoc's dedicated syntax, such as [@citekey] for a basic reference or [@citekey, 23-25] for a specific page range, which Pandoc replaces with formatted text according to the chosen style. Enabling the processor requires the --citeproc flag during conversion; without it, citations remain as raw markup. Output styles, including , , and MLA, are defined by CSL stylesheets, which can be supplied via the --csl option or the csl metadata field (defaulting to Chicago author-date if omitted). For example, converting a Markdown document with --citeproc --csl=apa.csl will produce APA-formatted citations and a section at the document's end. CiteProc offers several features to enhance flexibility and . It supports locale-specific formatting for 52 languages through CSL locale files, allowing adjustments for date formats, rules, and in non-English contexts. Automatic numbering of references is handled when required by the style, such as in numerical citation schemes. Additionally, it integrates with external tools like for bibliography management and the CSL editor for custom style creation and modification. Despite these capabilities, CiteProc has notable limitations. It depends entirely on external bibliography files and does not include built-in database management or storage for references, requiring users to maintain separate files or integrate with reference managers. This external reliance can complicate workflows in environments without dedicated bibliography tools.

Supported Formats

Input Formats

Pandoc supports a diverse array of input formats, allowing it to ingest documents from numerous markup languages, word processors, and structured data sources. These formats are parsed into an intermediate (AST) representation, facilitating conversion to other outputs. The tool recognizes over 40 input formats as of version 3.1 in 2023, with ongoing expansions in subsequent releases through 2025. Core input formats include CommonMark Markdown, which supports extensions for advanced features such as footnotes, definition lists, tables, and fenced code blocks; these can be enabled or disabled using flags like +footnotes or -definition_list when specifying the format (e.g., pandoc -f commonmark+footnotes). Other foundational formats are (rst), which handles directives, roles, and structural elements like sections, lists, and tables; , parsing tags and attributes while preserving inline elements; , which interprets mathematical expressions, environments, and macros but may simplify complex layouts; and markup, supporting wiki-style links, templates, and tables. Additional supported formats extend Pandoc's versatility to include Org-mode (org) for outline-based documents with emphasis on tasks and agendas; for lightweight markup with simple syntax for blocks and attributes; XML for technical documentation with hierarchical tagging; Jupyter notebooks (.ipynb) as JSON-structured files containing code cells, markdown, and outputs; for reflowable e-books, extracting content from its XHTML components; plain text (treated as a basic markup variant); , added in version 2.17 for handling attributes, includes, and admonitions; and binary formats like DOCX () and (OpenDocument Text), which are unzipped and parsed via XML intermediaries without requiring external libraries. Further formats encompass , , (version 2 and 3), FictionBook2 (fb2), (gfm), , for journal articles, and native Pandoc AST for programmatic use, , for outlines, T2T (txt2tags), , TikiWiki, and Vimwiki. Parsing for these emphasizes structural fidelity, such as preserving headings, lists, images, and metadata across formats, though some low-level styling (e.g., precise spacing in ) may be abstracted. Binary inputs like DOCX and undergo conversion to Pandoc's AST via built-in readers, supporting elements like styles, headers, and embedded media.

Output Formats

Pandoc supports a wide array of output formats, enabling the of documents into various markup languages, files, and formats. These outputs are generated through Pandoc's conversion , which translates the (AST) derived from input documents into the target format's structure. Primary binary outputs include PDF, produced via engines like pdflatex or alternative HTML-based engines such as WeasyPrint; DOCX for , which preserves custom styles when a reference document is provided using the --reference-doc option; ODT for Text, supporting similar style inheritance and options like --link-images for embedded media; HTML5 for web-ready documents, allowing CSS styling via --css and math rendering with MathJax; for e-books, featuring options like --epub-cover-image for covers and --epub-embed-font for font inclusion; RTF for , generating standalone files with basic formatting; and Beamer for -based slide presentations, configurable with themes (e.g., via the theme variable) and aspect ratios like 16:9. Markup-based outputs encompass for typesetting, which includes variables for document classes, geometry, and font families to control layout; for documentation, supporting table-of-contents generation and reference links; Markdown variants such as commonmark, gfm (GitHub Flavored Markdown), and others, with extensions for fenced code blocks and wrapping options; for lightweight web markup; and for XML-based publishing, enabling section numbering and for equations. These formats prioritize semantic structure over visual styling, though and outputs allow extensive customization through templates and metadata. Rendering enhancements across outputs include support for themes in Beamer and via template variables, CSS integration for and to apply visual styles, and preservation of document styles in binary formats like DOCX and by referencing existing templates. Bibliographies are handled via the built-in citeproc processor, which integrates with the --citeproc option and requires a bibliography file (e.g., in or CSL ) along with a CSL stylesheet for consistent rendering; this works in all output formats, appending a formatted section while replacing inline citations according to the chosen style, such as author-date. As of version 3.8.2.1, released in October 2025, Pandoc includes a new xml input and output format for representing the Pandoc in a human-readable XML structure, enhancing programmatic access and . These updates build on prior capabilities, focusing on precision in scholarly and contexts without altering rendering mechanisms.

Extensions and Customization

Lua Filters and Plugins

Lua filters enable users to customize Pandoc's document conversion by applying Lua scripts that transform the (AST) between parsing and writing phases. Introduced in Pandoc on October 30, 2017, they utilize a built-in Lua 5.4 interpreter embedded directly in the Pandoc executable, avoiding external dependencies and the serialization overhead of JSON filters. Lua's selection stems from its lightweight design and straightforward embeddability in , Pandoc's core language, facilitating efficient script execution without additional installations. To use a Lua filter, specify it via the --lua-filter=script.lua option, where the script defines a table of functions keyed to AST element types, such as Header, Para, or Table. Each function processes an input element and returns either the unmodified element, nil to remove it, a new Pandoc object to replace it, or a list of objects to insert multiple elements in its place. Filters traverse the AST in a type-wise or top-down manner, with global variables like FORMAT providing output-specific context for conditional logic. Practical examples include converting emphasized text to small caps, as in a filter that replaces pandoc.Strong elements with pandoc.SmallCaps containing the original content, or centering images by wrapping them in a Div with CSS alignment attributes. For table reformatting, a filter can iterate over Table rows to adjust alignments or add borders via attributes, enhancing output for formats like HTML or LaTeX. More specialized applications appear in tools like pandoc-scholar, a Lua filter suite that augments academic documents with features such as automated bibliography sections and figure caption styling. Lua filters also support custom integrations, such as modifying elements for compatibility with pandoc-crossref to enable automatic numbering of figures and tables during conversion. Performance benefits are notable; for instance, a simple Lua filter processes the Pandoc manual in 1.03 seconds, outperforming equivalent Haskell (1.36 seconds) and Python (1.40 seconds) implementations on the same hardware. However, filters execute per document in each Pandoc run, maintaining no persistent state across invocations, which restricts them to stateless transformations. Common pitfalls include failing to return modified elements, leading to unintended removals, and relying on locale-sensitive patterns that may vary by system. These scripts manipulate the AST, allowing targeted extensions without altering Pandoc's core behavior.

Integration with Other Tools

Pandoc integrates seamlessly with text editors such as and Vim through dedicated s that enable real-time previewing of document conversions. For , the pandoc-mode package facilitates interaction with Pandoc's , allowing users to export and preview documents directly within the editor. Similarly, the vim-pandoc supports asynchronous Pandoc execution for generating previews in PDF or formats while editing, enhancing productivity in Vim workflows. In dynamic document authoring, Pandoc underpins tools like and , enabling the creation of reproducible reports and publications. leverages Pandoc for converting R-flavored Markdown to various outputs, integrating code execution with formatted text for statistical analysis and visualization. , a multisource system built on Pandoc, extends this capability to support , , , and , allowing users to generate interactive documents, websites, and books from a single source. For PDF generation, Pandoc relies on as the primary engine, invoking tools like pdflatex, xelatex, or lualatex via the --pdf-engine option to produce high-quality typeset documents from or inputs. This integration requires installing packages such as amsmath and graphicx for full functionality. Pandoc enhances publishing pipelines when combined with for , where files serve as lightweight, diff-friendly sources that can be automatically converted to final formats during builds. For instance, Actions workflows can invoke Pandoc to generate multi-format outputs like and PDF from repository commits, streamlining collaborative documentation processes. In Jupyter environments, Pandoc supports bidirectional conversion of .ipynb notebooks, exporting them to , , or while preserving code cells and outputs. As a library, Pandoc can be embedded in custom applications via its , which exposes modules for parsing inputs into an () and rendering outputs, facilitating programmatic document manipulation in tools like web or scripts. In the 2025 ecosystem, third-party tools like the mcp-pandoc provide compatibility with tools through Model Context Protocol (MCP) , enabling to automate document formatting and conversions. Projects like the mcp-pandoc expose Pandoc's functionality via MCP, allowing large language models to invoke format transformations in agentic workflows.

References

  1. [1]
    Pandoc - index
    Page layout formats. → InDesign ICML ↔︎ Typst. Wiki markup formats. ↔︎ ... Pandoc is free software, released under the GPL. Copyright 2006–2025 John ...InstallingGetting startedDemosMarkdownTry pandoc!
  2. [2]
    jgm/pandoc: Universal markup converter - GitHub
    Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library.
  3. [3]
    John MacFarlane - Home
    - **Affiliation**: John MacFarlane is a Professor of Philosophy at the University of California, Berkeley.
  4. [4]
    John MacFarlane - Haskell Foundation
    Nov 14, 2023 · ... John MacFarlane, a professor of philosophy at UC Berkeley, but also the author of the popular Pandoc document conversion tool, which has ...
  5. [5]
    Pandoc User's Guide
    Copyright 2006–2024 John MacFarlane (jgm@berkeley.edu). Released under the GPL, version 2 or greater. This software carries no warranty of any kind. (See ...
  6. [6]
    [ANN] pandoc 1.0.0.1
    Sep 13, 2008 · I'm pleased to announce the release of pandoc version 1.0.0.1. Pandoc is a general text markup format converter. In addition to strict markdown ...
  7. [7]
  8. [8]
    Installing pandoc
    The simplest way to get the latest pandoc release is to use the installer. Download the latest installer. For alternative ways to install pandoc, see below ...
  9. [9]
    Using the pandoc API
    Pandoc can be used as a Haskell library, to write your own conversion tools or power a web application. This document offers an introduction to using the ...
  10. [10]
  11. [11]
  12. [12]
  13. [13]
  14. [14]
    Releases - Pandoc
    Pandoc.ImageSize: Detect more JPEG file signatures (R. N. West and John MacFarlane, #11049). Unpack compressed object streams in PDFs and look inside ...
  15. [15]
    JATS - Pandoc
    Pandoc will try to generate a valid JATS document even when information is missing, filling in placeholders or using empty values.Missing: 3.8 ICML
  16. [16]
    Release pandoc 2.0 · jgm/pandoc
    **Release Date and Lua Filters Introduction for Pandoc 2.0**
  17. [17]
    Pandoc Lua Filters
    Pandoc Lua filters manipulate the AST using Lua, avoiding JSON overhead. They are tables of functions, and are passed via the --lua-filter option.
  18. [18]
    Pandoc filters
    Starting with pandoc 2.0, pandoc includes built-in support for writing filters in lua. The lua interpreter is built in to pandoc, so a lua filter does not ...
  19. [19]
    pandoc-scholar/pandoc-scholar: Create beautiful and semantically ...
    If you prefer to mix-and-match selected functionalities provided by pandoc-scholar, you can now use the respective Lua filters directly. Integration with tools ...
  20. [20]
    Pandoc Extras
    Lua filters · Spell checking. · Word count. · Embed textually-specified diagrams in Mermaid, Dot/GraphViz, PlantUML, Asymptote, CeTZ, and TikZ. · Create subfigures ...
  21. [21]
    Introduction | Pandoc-mode - Joost Kremers
    pandoc-mode allows you to set or change template variables through the menu. The variables are in the general writer options menu, the metadata in the reader ...Missing: integration | Show results with:integration
  22. [22]
    vim-pandoc - SourceForge
    Sep 24, 2024 · vim-pandoc provides facilities to integrate Vim with the pandoc document converter and work with documents written in its markdown variant.<|separator|>
  23. [23]
  24. [24]
  25. [25]
    Building Publishing Workflows with Pandoc and Git
    Nov 1, 2013 · The piece that makes this really handy for an editorial workflow is another tool by Pandoc's creator, John MacFarlane, called Gitit. Gitit ...<|control11|><|separator|>
  26. [26]
  27. [27]
    MCP server for document format conversion using pandoc. - GitHub
    A Model Context Protocol server for document format conversion using pandoc. This server provides tools to transform content between different document formats.