Architecture is the Paradigm: Evolving MRAG from Text-Centric to Intelligent Control — AI Innovations and Insights 45
Welcome back! We're now diving into Chapter 45 of this ongoing series.
Multimodal Retrieval-Augmented Generation (MRAG) takes RAG (AI Exploration Journey: RAG) to the next level by bringing in more than just text—it adds images, videos, and other types of data into the retrieval and generation processes.
MRAG has been gaining traction as a key direction in the evolution of RAG. I’ve touched on it several times in past posts (AI Exploration Journey: Multimodal RAG).
This time, we’re looking at a new survey that offers a fresh perspective. It introduces some interesting ideas about the architecture behind multimodal RAG. Let’s take a closer look.
MRAG1.0: Pseudo-MRAG
As shown in Figure 1, MRAG1.0—often called "pseudo-MRAG"—looks a lot like traditional RAG. It follows the familiar three-step setup: Document Parsing and Indexing, Retrieval, and Generation.
The core process hasn’t changed much, but there’s one key difference: how documents are parsed. Instead of treating everything the same, MRAG1.0 uses specialized models to turn different types of data into modality-specific captions. These captions are then stored with the text, ready to be used in the later stages.

Despite early promise, MRAG 1.0 faced several major limitations:
Cumbersome Document Parsing: Turning images and other modalities into text captions added system complexity and often lost important details—especially the fine-grained ones needed for accurate results.
Bottleneck of Retrieval: The system struggled with high recall accuracy. Chunking often broke up key phrases, and converting multimodal inputs to plain text led to further information loss, limiting what could be retrieved.
Challenges in Generation: Combining text, captions, and other inputs into coherent prompts was tricky. The system was highly sensitive to input quality, so any loss in earlier stages could easily lead to weak or irrelevant outputs.
In short, MRAG 1.0 ran into a performance ceiling. Its dependence on text-based representations and traditional retrieval methods exposed critical weaknesses in how it handled and generated from multimodal data. Future versions need smarter models, better information preservation, and tighter integration across the pipeline.
MRAG2.0: A Shift to True Multimodality
Unlike the earlier version, which mainly relied on text, MRAG 2.0 fully embraces multimodal inputs and retains original data like images and audio in its knowledge base. Thanks to powerful MLLMs (Multimodal Large Language Models), it can now generate responses directly from multimodal data, greatly reducing information loss.

As shown in Figure 2, MRAG2.0 incorporates several key optimizations:
Smarter Captioning: Instead of juggling multiple models for different data types, MRAG 2.0 uses one or more unified MLLMs to generate captions across modalities, simplifying parsing and improving consistency.
Multimodal Retrieval: The retrieval system now supports both multimodal queries and outputs. It combines text-based search with direct access to raw multimodal content, making results more accurate and comprehensive.
Better Generation: The generation module has been upgraded to handle multimodal inputs directly. This means it can craft responses using original images, text, and more—leading to more accurate and context-aware answers, especially for complex, multimodal questions.
But there are new challenges ahead:
Using multimodal inputs can sometimes weaken the clarity of text-only queries.
Multimodal retrieval still lags behind its text-based counterpart, which can hold back the overall accuracy of the system.
Diverse data formats add complexity to generation—figuring out how to organize and present this variety is an ongoing challenge.
MRAG3.0: A Smarter, More Complete Multimodal System
As illustrated in Figure 3, MRAG3.0 marks a major leap forward from earlier versions, introducing smarter architecture and broader functionality.

Here are the key upgrades:
Smarter Parsing with Visual Context: Instead of just extracting text, MRAG 3.0 keeps screenshots of document pages during parsing, reducing information loss and improving retrieval accuracy.
True End-to-End Multimodality: Earlier versions mainly focused on multimodal inputs and storage. MRAG3.0 takes it further by also supporting multimodal outputs—text, images, videos—all integrated in a single response.
Expanded Scenarios: The system now handles more diverse scenarios, such as Retrieval-Augmented QA, VQA (Visual Question Answering), Multimodal Generation, Fusion Outputs.
Multimodal Search Planning Module: To improve how the system retrieves information, this module adds two powerful tools:
Retrieval Classification: Dynamically decides whether to search using text, image, or not at all—avoiding wasted or harmful lookups.
Query Reformulation: Rewrites the user's query using visual cues or earlier results to make searches more precise, especially in complex, multi-step scenarios.
Thoughts and Insights
Overall, MRAG 1.0 is still fundamentally a text-centric system—what it offers is pseudo-multimodality wrapped in a thin layer of surface integration. MRAG 2.0 marked a shift from "text-driven" to "semantic alignment," but its core approach still relies heavily on modal concatenation. It lacks deeper cognitive-level integration.
MRAG 3.0, on the other hand, signals an important turning point. It introduces the idea that systems shouldn't just retrieve passively—they need to decide whether to retrieve at all. This reflects a broader move from static retrieval pipelines to dynamic reasoning chains. In this new paradigm, retrievers start acting more like agents.
The evolution of the MRAG architecture clearly mirrors a larger trend in AI: the shift from tool-like systems to autonomous agents. With MRAG 3.0, we’re stepping into the early stages of intelligent planning and multimodal interaction. But there’s still a long way to go.
At its core, MRAG 3.0 still depends on projecting different modalities—images, text, video—into a shared embedding space for retrieval and generation. But this rests on a strong assumption: that all modalities can be compressed into semantically equivalent vector representations. That raises real questions—are we risking semantic drift? Are we amplifying ambiguities between modalities?