Welcome back, let’s dive into Chapter 52 of this insightful series!
Document Visual Question Answering (DocVQA) is about answering questions based on multi-modal documents that mix text, tables, and images — the kind you often see in reports, manuals, or handbooks. The task comes with three key challenges:
Multi-page reasoning — finding the right information across long documents that span several pages.
Cross-page references — connecting details from different parts of the document to get a complete answer.
Multi-modal content — understanding and combining information from various formats like text, visuals, and structured data.
SimpleDoc provides a simple retrieval-augmented approach that leverages the power of modern Visual Language Models (VLMs)—without the complexity of multi-agent setups.

As shown in Figure 1, SimpleDoc uses a two-step page retrieval process:
First, it relies on preprocessed embeddings and page summaries to narrow down the most relevant content.
Then, during answer generation, a reasoning agent steps in to review the retrieved pages — deciding whether it has enough to answer the question, or if it needs to produce a new query and fetch more information.

As shown in Figure 2, let’s take a closer look at these two stages.
First, it processes each page of a document offline, extracting visual embeddings and generating summaries with an LLM.
Then comes the online reasoning loop: it retrieves relevant pages using embeddings and summary-based re-ranking, and passes them to a memory-guided VLM agent that generates an answer — or, if needed, refines the query and tries again.

As shown in Figure 3, in the first round, the agent retrieves Pages 6, 13, and 14 via embedding + summary filters, but they only cover the experimental setup and metrics—no alignment scores in sight. Spotting the gap, the agent refines the query to hunt for any section or table that compares scores across temperature settings. Then Page 7 surfaces with Table 3, revealing that temperature 0.1 achieves the highest alignment score of 85.9.
Thoughts and Insights
Instead of building a complex multi-agent system, SimpleDoc takes a much cleaner route: it combines dual-path retrieval (embedding + summary) with iterative memory. It’s not trying to impress with architectural novelty—it’s designed for what really matters right now: efficiency and deployability.
In text-based RAG, we use to slicing documents into chunks. But multimodal documents—PDFs, slides, scanned pages—are naturally page-centric. And it’s a quiet signal: future RAG systems will need to rethink granularity strategies when moving into multimodal territory.
That said, using “pages” as the atomic unit works most of the time—but not always. Long pages or dense visuals like charts and tables can break this assumption. A promising direction? Multi-scale fusion retrieval, combining page-level and patch-level signals to handle more nuanced content.
Now let’s talk about retrieval. SimpleDoc uses iterative rounds of search + memory accumulation instead of a one-shot top-K approach. In theory, this adds precision. In practice, the jury’s still out. For open-ended or multi-hop questions, iterative query refinement may lead the system astray—especially if the first query takes a wrong turn. Once that happens, every subsequent step compounds the error.
Also, the memory system—while simple—is a blunt tool. It just keeps adding summaries and context into one growing block. There’s no trimming, no prioritization. Over time, this can cause memory bloat and interference, especially on long tasks, hurting output quality instead of helping.