Pandoc is an AST boundary, not a magic file converter

This real UC Berkeley profile photograph of John MacFarlane fits Pandoc because the project is best understood as a working writer-programmer's tool: a way to make scholarly writing, markup, citations, and publishing targets pass through one explicit document model.[9]

Pandoc is usually introduced as a universal document converter. That is accurate, but it encourages the wrong expectation. The project is not magic glue between every possible pair of file formats. Its durable architecture is more disciplined: a reader parses an input document into Pandoc's native abstract syntax tree, optional filters modify that tree, and a writer emits the target format.[1][2]

That pipeline is the reason Pandoc has lasted. It turns document conversion from a combinatorial set of pairwise translators into a shared intermediate contract. A Markdown reader, a DOCX writer, a LaTeX writer, a reveal.js slide writer, and a Lua filter do not each need to know the whole world. They need to understand where they touch the document model.[1][2][3]

As of 2026-07-05T01:35:33Z UTC, the public GitHub API reported 45,219 stars, 3,913 forks, 1,062 open issues, main as the default branch, GPL-2.0 licensing, and a latest push timestamp of 2026-07-04T16:52:14Z for jgm/pandoc.[7] The release feed showed Pandoc 3.10 published on 2026-06-04, following 3.9.0.2 and 3.9.0.1 in March.[8] Those numbers matter only as maintenance context. The real adoption question is whether a team can live inside Pandoc's document boundary instead of asking it to preserve everything a proprietary desktop file ever implied.

The cover image is a real UC Berkeley profile photograph of John MacFarlane, Pandoc's original author.[9] That is not decorative biography. MacFarlane's own tools page describes Pandoc as a general markup converter that he uses for lecture notes, letters, slides, and websites, alongside related projects such as CommonMark, djot, citeproc, texmath, and syntax-highlighting tools.[4] Pandoc's architecture makes more sense when read as a working author's infrastructure, not as a one-off command-line trick.

Readers and writers keep the format problem bounded

The cleanest Pandoc mental model is not "input file becomes output file." It is input -> reader -> AST -> writer -> output. The user guide describes this modular design directly: readers parse source formats into a native representation, and writers convert that representation into target formats.[1] That is why the supported-format list can be broad without requiring a separate converter for every pair.

For engineering teams, this distinction changes how adoption should be tested. A Markdown-to-DOCX workflow, a DOCX-to-Markdown cleanup pass, a LaTeX-to-HTML export, and an EPUB build are not equivalent just because the command is always pandoc. Each path asks a different reader and writer to agree through the intermediate representation. If the source depends on page geometry, hidden Word styles, layout-specific table behavior, or output-only features, the boundary may be visible immediately.[1]

Pandoc is candid about that limitation. Its intermediate representation is less expressive than many formats it converts between, so users should not expect perfect conversion between every source and every target. It aims to preserve structural document elements, not every formatting detail such as margins, and some complex elements may not fit neatly into the model.[1] That caveat is not a weakness to hand-wave away. It is the contract.

The practical rule is simple: choose a canonical source format before choosing output formats. If the source of truth is Markdown plus metadata, then PDF, HTML, EPUB, DOCX, and slides become generated artifacts. If the source of truth is a highly styled Word file, then Pandoc may still help extract structure, but it should not be treated as a layout-preserving clone of Word's rendering engine. The pipeline works best when structure matters more than desktop-publishing fidelity.

The AST is the loss budget

Pandoc's AST is where the real architecture sits. The filters documentation explains that Pandoc parses text into an intermediate representation, then writes that representation into the target format, with the AST format defined by Text.Pandoc.Definition in pandoc-types.[2] That means filters, templates, citation processing, and downstream tools are not operating on arbitrary string replacements. They are operating on document elements.

That choice gives teams an unusual amount of leverage. A filter can normalize headings, add attributes, rewrite links, wrap code blocks, insert warnings, transform figure captions, or generate cross-format behavior without editing each output writer separately.[2][3] Quarto's documentation is useful independent evidence here: Quarto exposes Pandoc filters as an extension mechanism and notes that citation processing and several Quarto features sit on that filter pathway.[6]

The same choice imposes discipline. Anything that cannot be represented cleanly in the AST becomes a conversion decision. Maybe it becomes raw target-specific content. Maybe it is dropped. Maybe it survives only for one output family. This is why a serious Pandoc workflow should include test fixtures: one document with footnotes, citations, tables, images, math, callouts, links, metadata, and the weirdest examples the organization actually uses. Run that through every target before declaring the toolchain solved.

The strongest Pandoc deployments treat the AST as a loss budget. They ask which information must survive every target, which information can be target-specific, and which information belongs outside the source document in templates, metadata, CSL styles, build scripts, or CSS. That is a healthier adoption posture than treating Pandoc as a black box and blaming it later for faithfully enforcing a simpler document model than the source file implied.[1][2]

Filters are where publishing policy becomes code

Pandoc filters are often introduced as an advanced customization feature, but they are closer to the system's policy layer. The official filters documentation says a filter is a program that modifies the AST between reader and writer; traditional filters can be written in any language by consuming and producing the JSON representation.[2] The Lua filter system tightens the loop by letting filters run through Pandoc's embedded Lua environment while still working over the same document tree.[3]

That matters because most institutional writing systems need more than conversion. They need consistent heading IDs, link policy, citation behavior, figure handling, accessibility attributes, custom admonitions, internal style checks, and output-specific compromises. A filter can encode those rules once at the document-model boundary instead of scattering them through author instructions and post-export cleanup.[2][3][6]

This is also where Pandoc differs from a pile of format plug-ins. A filter is not only an automation step after conversion. It sits before the writer, so it can affect every supported target that understands the transformed structure. A team can make a rule like "all external links get a class," "all code blocks with this attribute get wrapped," or "all figures with missing alt text fail the build" and apply that rule before HTML, DOCX, EPUB, or LaTeX rendering diverges.

The boundary is still real. Filters are not free maintainability. A large filter stack can become a private markup language with no user manual. Lua code can encode assumptions that only one maintainer understands. JSON filters in other languages add dependency and packaging concerns. The right pattern is to keep filters small, document their input and output expectations, and reserve them for rules that belong to the document model rather than one output skin.

Citations and Git show why Pandoc is publishing infrastructure

Pandoc's value becomes clearer when it is placed inside a publishing workflow instead of a single command. MacFarlane's tools page puts Pandoc next to citation, math, markup, and syntax-highlighting projects, which is a strong clue about the intended shape: writing should be plain enough to edit, structured enough to transform, and explicit enough to publish into several outputs.[4]

The Haskell Foundation's interview with MacFarlane gives the historical texture. He describes Pandoc as emerging from his own use of lightweight markup and Haskell, with the early project growing out of practical writing needs rather than a vendor product plan.[5] That origin still shows. Pandoc is opinionated in the way good Unix-style tools are opinionated: keep source text visible, keep the transformation path scriptable, and let other tools own version control, editing, review, and distribution.

Simon Fraser University's publishing workflow article is older but still useful independent context because it shows the same architecture applied outside one developer's habits. It connects Pandoc, Git, and Gitit into a workflow where text files remain versioned, editable, and publishable through different surfaces.[10] The article is not current release documentation, but the point has aged well: Pandoc becomes more valuable when it participates in a source-controlled editorial process rather than acting as a last-minute export button.

That is why Pandoc is especially attractive to scholars, standards groups, documentation teams, technical publishers, educators, and open-source projects. The artifacts can be regenerated. The style sheet can be changed. The bibliography can be rebuilt. The same source can feed web pages, PDF handouts, EPUBs, slides, or archival plain text. The organization gets a repeatable publishing path instead of a folder of final-final documents.

Where Pandoc fits

Pandoc is strongest when the source is mostly text, the important content is structural, and outputs are generated deliberately. It fits research papers, course notes, book drafts, policy documents, technical documentation, static sites, slide decks, internal handbooks, and publishing pipelines where Git, review, templates, and build scripts already make sense.[4][10]

It is weaker when the real requirement is pixel-perfect round-tripping, heavy visual page composition, proprietary desktop-layout behavior, or a source format whose meaning lives mostly in invisible styling state. In those cases, Pandoc can still be useful as an extraction or migration tool, but the team should expect review and cleanup. A faithful AST is not a faithful application renderer.[1][2]

The conservative rollout is straightforward. Pick one document class. Declare the source of truth. Write a small target matrix: HTML, PDF, DOCX, EPUB, slides, or whatever the team truly needs. Create a fixture document that includes the hard cases. Keep templates, CSL styles, filters, and build commands in version control. Pin a Pandoc version for release builds, then test upgrades against the fixture before changing the production pipeline.[1][3][8]

That is the architecture note. Pandoc is not valuable because it claims to speak many formats. It is valuable because it makes the middle of document conversion explicit. Readers admit source formats into a shared AST. Filters encode policy before targets diverge. Writers emit the deliverables. The limitation is the strength: once a team accepts the boundary, document publishing becomes less mysterious and more repeatable.

cronfeed.work