When RAG Meets Document Parsing: A Comprehensive Overview — AI Innovations and Insights 48

Jun 11, 2025

Welcome back, let’s dive into Chapter 48 of this delightful series!

For RAG, the extraction of information from documents is an inevitable scenario. The quality of the final output largely depends on how effectively the content is extracted from the source.

I’ve talked about document parsing from different angles in the past (AI Exploration Journey: PDF Parsing and Document Intelligence). This post pulls together insights from a recent RAG survey and some of my earlier work to offer a clear and concise summary of how RAG systems parse and integrate four types of knowledge.

Figure 1: Diverse knowledge utilized by RAG, including structured, semi-structured, unstructured and multimodal knowledge. [Source].

Structured Knowledge: When Data Plays by the Rules

Knowledge Graphs: Easy to Query, Great to Use, Hard to Integrate

Knowledge graphs map out entities and their relationships in a clean, connected way, making them ideal for machines to navigate and query.

RAG systems love structured sources like these—they’re precise and semantically rich. But the real challenge isn’t finding the data, it’s putting it to good use.

How can we extract meaningful subgraphs from massive knowledge graphs?
How can we align structured graph data with natural language?
As the graph grows, can the system still keep up?

A few promising solutions are starting to bridge that gap:

GRAG retrieves subgraphs from multiple documents to create more focused inputs.
KG-RAG uses a Chain of Explorations (CoE) algorithm to improve question answering over knowledge graphs.
GNN-RAG brings graph neural networks to retrieve and process information from KGs, doing a round of reasoning before the data even reaches the LLM.
SURGE framework taps into knowledge graphs to generate more relevant, knowledge-aware dialogue, improving interaction quality.
In specialized domains, tools like SMART-SLIC, KARE, ToG2.0, and KAG (AI Innovations and Insights 23: KAG, AlphaMath, and Offloading) have shown just how powerful KGs can be as external knowledge sources, helping RAG systems become both more accurate and more efficient.

Tables: Compact, Dense, and Difficult

Tables are a type of structured data too—but they’re very different from knowledge graphs. A few rows and columns can pack in an incredible amount of information. But getting machines to understand that information? That’s a whole different story.

Hidden relationships, inconsistent formatting, domain-specific quirks… tables often feel like a mix of structure and chaos. Luckily, there are tools built to handle exactly this kind of mess:

TableRAG (AI Innovations and Trends 09: Cursor Tool's RAG Features, TableRAG, and Llama OCR) combines query expansion, schema and cell retrieval to figure out what really matters—before passing anything to the language model.
TAG and Extreme-RAG go a step further by integrating Text-to-SQL capabilities, so the language model can essentially “speak database.”

Bottom line? If you can parse tables well, they’re an absolute goldmine of information.

Semi-Structured Data: HTML, JSON, and the Web’s Messy Middle

Semi-structured data is like the “middle child” of the data world—not fully structured, but not entirely unstructured either. It’s more flexible than a knowledge graph, yet more organized than a raw PDF. Think HTML pages, JSON files, XML, emails—formats that carry some structure, just not always in a consistent or complete way.

HTML, in particular, is everywhere. And every website has its own quirks. Sure, there’s structure—tags, attributes, elements—but there’s also a ton of unstructured text and images mixed in.

To effectively parse HTML, a range of tools and open-source libraries have been developed to turn HTML content into structured formats like Document Object Model (DOM) trees. Some widely used libraries worth mentioning include BeautifulSoup, htmlparser2, html5ever, MyHTML, and Fast HTML Parser.

In addition, RAG frameworks like HtmlRAG (AI Innovations and Insights 20: HtmlRAG, AFLOW, ChunkRAG, and MarkItDown) leverages HTML format instead of plain text in RAG systems, preserving semantic and structural information.

If we want RAG systems to actually understand web pages—and not just hallucinate their way through—HTML parsing is where it all starts.

Unstructured Knowledge: PDFs, Raw Text, and Organized Chaos

Now we're in the deep end. Unstructured data—free-form text, PDF documents, scanned reports—is everywhere.

PDFs especially are a nightmare: inconsistent layouts, embedded images, complex formatting. But they’re essential in fields like academia, law, and finance. So how do we make them RAG-ready?

We can use smarter OCR, layout analysis, and visual-linguistic fusion:

Levenshtein OCR and GTR combine visual and linguistic cues to improve recognition accuracy.
OmniParser and Doc-GCN focus on preserving structure.
ABINet uses bidirectional processing to boost OCR performance.

At the same time, a wave of open-source tools is making it easier to convert PDFs into Markdown—a format more friendly for LLMs. These tools? I've pretty much covered them all already!

GPTPDF (AI Innovations and Trends 05: iText2KG, Meta-Chunking, and gptpdf) uses visual models to parse tables, formulas, and other tricky layouts, then turns them into Markdown—fast and cheap enough to run at scale.
Marker (Demystifying PDF Parsing 02: Pipeline-Based Method) focuses on wiping out noisy elements while keeping the original formatting, which makes it a favourite for research papers and lab reports.
PDF-Extract-Kit (PDF-Extract-Kit model library is used by MinerU) supports high-quality content extraction, including formula recognition and layout detection,
Zerox OCR (AI Innovations and Trends 10: LazyGraphRAG, Zerox, and Mindful-RAG) snapshots each page, feeds the images through GPT models to generate Markdown, effectively managing complex document structures.
MinerU (AI Innovations and Insights 29: EdgeRAG and MinerU) is a comprehensive solution that retains the original document structure, including titles and tables, and supports OCR for corrupted PDFs.
MarkItDown (AI Innovations and Insights 20: HtmlRAG, AFLOW, ChunkRAG, and MarkItDown) is a versatile tool that converts a wide range of file types, including PDFs, media, web data, and archives, into Markdown.

Multimodal Knowledge: Images, Audio, and Video Join the Party

Traditional RAG systems were designed for text-based data, which makes them struggle when it comes to handling and retrieving information from other formats like images, audio, or video. As a result, their responses can feel shallow or incomplete—especially in cases where non-text content carries the most meaning.

To address these challenges, multimodal RAG systems have introduced foundational methods to integrate and retrieve across different modalities. The core idea is to align various modalities—text, image, audio, video—into a shared embedding space, enabling unified processing and retrieval. For instance,

CLIP aligns vision and language in a shared space.
Wav2Vec 2.0 and CLAP focus on connecting audio with text.
In the video domain, models such as ViViT are designed to capture both spatial and temporal features.

These are the building blocks. And as these systems evolve, we'll see RAG applications that can pull insights from across documents, slides, and spoken content—all in one shot.

Final Thoughts

In practice, currently, I've found MinerU to be the best open-source tool for parsing PDFs.

Naturally, if you want to build your own document parser, there will be plenty of intricate details. But the payoff is worth it: greater control over source code, improved document security, and more trustworthy results.

I'll share some engineering insights in future articles when I get the chance.

We're moving beyond the era of plain-text language models. If we can teach machines to make sense of the diverse formats humans use to share knowledge, maybe they can help us make better sense of the world, too.

AI Exploration Journey