Parse PDFs. Power LLMs.

Your Model is Only As Smart As What You Feed It. Most PDF parsers were built before LLMs existed. They drop tables, scramble reading order, flatten structure into noise. PDF4LLM fixes that.

Powering document pipelines atPerplexityJenniMoxx
Powering document pipelines atPerplexityJenniMoxx
Powering document pipelines atPerplexityJenniMoxx
Powering document pipelines atPerplexityJenniMoxx
Powering document pipelines atPerplexityJenniMoxx
Powering document pipelines atPerplexityJenniMoxx
Powering document pipelines atPerplexityJenniMoxx
Powering document pipelines atPerplexityJenniMoxx

The Gap in Every RAG Pipeline

A PDF isn't a Document.
It's a Rendering Instruction.

There's no concept of “heading”, “table”, or “reading order” inside a PDF file, only coordinates, fonts, and draw commands. Every parser has to reconstruct meaning from that. Most get it wrong. PDF4LLM resolves structure, reading sequence, table layout, and document hierarchy before a single token reaches your model.

Pipeline Architecture Visualization
Step
1.RAW PDF
%%PDF-1.7
BT /F1 11.5 Tf
72.0 720.5 Td
(Large Language) Tj
0 -14.2 Td
q 0.8 0 0 0.8
144 612 cm
/Im1 Do Q
ET
/F2 9 Tf
72 540 324 m S
(Method) Tj
2.PDF4LLM
Reading order resolved
Columns detected
Table structure mapped
Hierarchy extracted
Images located
Footnotes tagged
3.MARKDOWN
# Large Language Models
## Abstract
This study examines the impact of LLMs...
![Accuracy](fig1.png)
## Key Findings
| Method | Prec | Rec |
| :--- | :--- | :--- |
| Raw | 61% | 58% |
| Struct | 95% | 91% |
4.CHUNKS
Chunk · 001
Chunk · 002
Chunk · 003
5.EMBEDDINGS
0.031, −0.184, 0.220, 0.089
−0.147, 0.312, −0.056, 0.201
0.178, −0.093, 0.445, 0.012
−0.231, 0.067, 0.389, −0.114
0.052, −0.276, 0.133, 0.298
6.VECTOR STORE
IDchunk_001
Sourcereport.pdf
Page1
Score0.982
Modeltext-emb-3
1.RAW PDF
%%PDF-1.7
BT /F1 11.5 Tf
72.0 720.5 Td
(Large Language) Tj
0 -14.2 Td
q 0.8 0 0 0.8
144 612 cm
/Im1 Do Q
ET
/F2 9 Tf
72 540 324 m S
(Method) Tj
2.PDF4LLM
Reading order resolved
Columns detected
Table structure mapped
Hierarchy extracted
Images located
Footnotes tagged
3.MARKDOWN
# Large Language Models
## Abstract
This study examines the impact of LLMs...
![Accuracy](fig1.png)
## Key Findings
| Method | Prec | Rec |
| :--- | :--- | :--- |
| Raw | 61% | 58% |
| Struct | 95% | 91% |
4.CHUNKS
Chunk · 001
Chunk · 002
Chunk · 003
5.EMBEDDINGS
0.031, −0.184, 0.220, 0.089
−0.147, 0.312, −0.056, 0.201
0.178, −0.093, 0.445, 0.012
−0.231, 0.067, 0.389, −0.114
0.052, −0.276, 0.133, 0.298
6.VECTOR STORE
IDchunk_001
Sourcereport.pdf
Page1
Score0.982
Modeltext-emb-3

Why You Need a Parser

There's a Hidden Cost to
Skipping the Parser

Skip the parser and your AI model rasterizes every page into images first, then bills you at vision token rates for content that was machine-readable text all along. Vision tokens can cost 10–20× more than text tokens. Feed enough documents and you're paying a premium for a problem that didn't need to exist.

Vision Language Model

~35sper page

Great for scans

Necessary for handwritten or scanned docs

Overkill for digital PDFs

Heavy GPU inference on simple text layers

Slow & Expensive

Reconstructs text visually instead of reading it

Cost at Scale

$14.40per 1,000 pages

Vision token rates on machine-readable text

PDF4LLM

~0.17sper page

Perfect for born-digital PDFs

Extracts text directly from the source layer

Instant & precise

No rasterization, no vision overhead

Structured topology

Reconstructs reading order mathematically

Cost at Scale

~$0.06per 1,000 pages

Text token rates. Calculated on Google Cloud Compute.

From Raw PDF to Structured Intelligence

PDF4LLM breaks a document into its components, identifies each one, and delivers structured data directly to your LLM. PDF4LLM does this by training Graph Neural Networks on PDF internals rather than rendered images. Giving you greater accuracy, CPU-only, at 10× the speed.

Scroll to begin

Supported Inputs

Supports a wide range of input formats PDF, XPS, EPUB, CBZ, MOBI, FB2, SVG, TXT, images, and more. Office formats (DOCX, XLSX, PPTX) are available with a commercial license.

The Part Other Parsers Skip.
The Part That Actually Matters.

Multi-column layouts, sidebars, footnotes — in the right order.

Get the sequence wrong and your model reasons over noise, not content. PDF4LLM reconstructs the order a human would read, not the order the renderer drew it.

Reading Order
Reading Order

No GPU. No cloud dependency.
No tradeoff.

Every other parser that reaches this level of extraction accuracy requires a GPU and still takes seconds per page. PDF4LLM runs CPU-only. Same output quality. No infrastructure tax.

0M+
Monthly PyPI downloads
0.0K
GitHub stars
0+ yrs
MuPDF engine heritage
0
GPUs required

The same baseline extraction quality. Whatever runtime you're already in.

Don't change parsers because you changed languages. One extraction quality. One commercial relationship.

PyMuPDF4LLM
PDF4LLM logomark
Python logo

PyMuPDF4LLM

The original. Built for Python's AI/ML ecosystem. Perfect for data scientists and LLM developers who want the best PDF extraction quality for RAG, fine-tuning, and everything in between.

Layout-aware content extraction
  • PDFMarkdown with layout
  • Page chunking for RAG
  • Table extraction
  • Image extraction per page
PDF4LLM (.NET)
PDF4LLM logomark
.NET logo

PDF4LLM (.NET)

Enterprise-grade PDF intelligence for .NET 8+. Built on the same MuPDF engine as PyMuPDF4LLM, but architected for C# and .NET developers. Get the same extraction quality without switching languages or parsers.

Built-in barcode parsing
  • PDFMarkdown with layout
  • Page chunking for RAG
  • Table extraction
  • Image extraction per page
PDF4LLM (JS)
PDF4LLM logomark
JavaScript logo

PDF4LLM (JS)

SOON

WASM-powered PDF-to-Markdown for Node.js. Perfect for client-side applications, serverless functions, and anyone in the JavaScript ecosystem who wants to integrate PDF4LLM's extraction quality without leaving JS.

Runs in the browser, no server needed
  • PDFMarkdown (WASM)
  • RAG-ready chunking with overlap
  • Serverless-first architecture
MuPDF WebViewer logo

Show and understand documents with one engine, powered by AI citation.

MuPDF WebViewer renders PDFs for your users. PDF4LLM extracts the same content for your AI pipeline. Because both run on the same MuPDF C core, the extraction preserves the exact coordinates of every block of text, so when your LLM returns an answer, you can locate the source passage directly in the viewer.

Learn More

Start with clean documents.
Everything else gets easier.

One install. Your pipeline stays the same. Your model finally gets the input it deserves.

1
2
3
4
5
# pip install pymupdf4llm

import pymupdf4llm
md = pymupdf4llm.to_markdown("document.pdf")
# -> Clean markdown, tables intact, images extracted, ready for embedding