MinerU-Diffusion: A New Path Beyond Autoregressive OCR — AI Innovations and Insights 131

May 08, 2026

The uncomfortable truth is that some OCR systems look smarter than they are because language helps them fill in the blanks. But when the page stops being predictable, real visual reading becomes much harder to fake.

Where Autoregressive OCR Starts to Break Down

Most existing OCR and visual language models (VLMs) rely heavily on autoregressive decoding, meaning they generate text tokens sequentially, one by one, from left to right.

Figure 1: AR-based OCR decodes tokens left to right, causing latency, error propagation, and reliance on language priors when semantics are disrupted. MinerU-Diffusion reframes OCR as inverse rendering and uses block-wise masked diffusion to refine tokens in parallel under visual conditioning, with a tunable speed–accuracy trade-off. **Image by author**.

While this approach works well for standard text generation tasks, it’s far from ideal for document OCR. Here’s why:

Speed Issues: Documents, especially lengthy ones filled with tables, formulas, and complex layouts, require generating many tokens. Decoding each token sequentially leads to significant latency, slowing the entire recognition process.
Error Propagation: Autoregressive methods are highly sensitive to early mistakes. A single recognition error can distort the context for subsequent tokens, causing a cascade of inaccuracies that build upon one another.
Over-Reliance on Language Priors: In Semantic Shuffle benchmark, AR models often lean heavily on linguistic cues and semantic coherence. This means they may “guess” rather than clearly perceive the actual text. When the semantic structure is disrupted or ambiguous, AR performance typically drops dramatically.
OCR as Inverse Rendering: Fundamentally, document OCR is better thought of as “inverse rendering.” The goal is to reconstruct structured information (like text, layouts, tables, and equations) from a two-dimensional image. The correct interpretation primarily depends on visual evidence and spatial arrangements. Forcing a strict left-to-right serialization is merely an "implementation artifact" for representation convenience, rather than a fundamental property of how documents are actually structured.
A Strong Fit for Diffusion: Unlike open-ended text generation (like chatting with ChatGPT), OCR is a near-deterministic task with limited semantic ambiguity. This makes OCR a strong candidate for masked diffusion, where masked tokens can be predicted in parallel conditioned on the image and partially observed sequence, producing a tunable speed–accuracy trade-off.

Figure 2: Overview of the document OCR inverse rendering process via different decoding methods.. The model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods. [Source].

Given these considerations, document OCR systems would greatly benefit from decoding strategies that are parallelized, globally consistent, and strongly grounded in visual features. Rather than forcing OCR into the sequential patterns of autoregressive language generation, it’s more natural to employ methods designed specifically to exploit visual structure.

MinerU-Diffusion: From Left-to-Right OCR to Parallel Visual Decoding

MinerU-Diffusion uses diffusion-based decoding instead of the traditional autoregressive method, enabling the model to simultaneously confirm or correct multiple tokens through visual context. This approach boosts processing speed, reduces error propagation, and decreases reliance on linguistic context for guessing content.

Figure 3: (a) The confidence threshold controls decoding parallelism in MinerU-Diffusion. Compared to MinerU2.5, this method achieves up to 3.26× speedup. (b) MinerU-Diffusion maintains a strong accuracy–efficiency trade-off, achieving 2.12× speedup with 99.9% and 3.01× speedup with 98.8% relative accuracy. (c) Diffusion decoding progressively reconstructs structured text from masked tokens under visual conditioning: black tokens are confirmed, red tokens are being updated, and yellow tokens remain masked, enabling parallel generation with global consistency, in contrast to autoregressive left-to-right decoding. [Source].

The method can be understood through four practical components.

1. Unified Output Format

Text, layout annotations, table symbols, and formula indicators are all represented as a unified sequence of tokens.

For document parsing, the model outputs a structured sequence rather than only plain text; task-specific prompts can still produce plain text, LaTeX, or table markup.

2. Diffusion-Based Decoding Replacing Autoregression

During training, tokens are randomly masked, prompting the model to predict these masked elements based on the surrounding context and visual evidence from the document image.

At inference, the model progressively reconstructs masked positions, using already decoded context and visual features rather than generating strictly left to right. Over multiple iterative rounds, uncertain tokens are progressively revealed and corrected rather than sequentially generating each token from left to right.

3. Block-wise Diffusion

Diffusing across an entire document sequence can be slow and unstable, so sequences are divided into smaller blocks:

Within blocks: Diffusion is parallelized, and context is considered bidirectionally.
Between blocks: A coarse, front-to-back dependency helps preserve sequence coherence and reduce long-range drift.
System Efficiency: The causal (front-to-back) structure across blocks naturally enables efficient KV-caching during inference, reducing memory and computation costs compared to full-attention diffusion models.

This design maintains fast parallel decoding while mitigating position drift and error accumulation common in lengthy documents.

4. Confidence-Driven Dynamic Decoding + Two-Stage Training

During inference, tokens with high confidence are confirmed first, while low-confidence tokens undergo further iterative correction. Confidence thresholds balance decoding speed and accuracy.

Figure 4: Training of MinerU-Diffusion. Left: the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. Right: the structured block-attention mask used during training, where tokens attend bidirectionally within each block and causally to all preceding blocks, enabling parallel diffusion refinement within blocks while preserving coarse autoregressive structure across blocks. [Source].

After multimodal initialization, training happens in two stages: initial broad-scale training provides general capabilities, followed by an uncertainty-driven refinement. The model automatically mines challenging examples (like complex tables or ambiguous boundaries) by measuring its own inference consistency, focusing its learning on the hardest cases to enhance robustness.

In short, MinerU-Diffusion treats document OCR as the inverse problem of reconstructing structured text from images, leveraging block-wise diffusion to parallelly refine tokens, and employing confidence-driven scheduling and challenging-case training to boost decoding speed, stability, and reliability.

Evaluation

Document Parsing Evaluation

MinerU-Diffusion’s capability in full-page document parsing is evaluated using OmniDocBench v1.5, measuring its performance through various metrics such as text edit distance, formula correctness (CDM), table extraction quality (TEDS), and reading order.

The results showed that MinerU-Diffusion achieved an overall score of 88.94 without using ground-truth layouts. When provided with ground-truth layouts, the score improved significantly to 93.37, coming very close to the performance of strong autoregressive OCR systems. This mainly shows that once layout errors are removed, its recognition quality is highly competitive.

Efficiency Evaluation

Figure 6: Threshold sensitivity analysis of TPF, TPS, and accuracy. TPF denotes tokens per forward, and TPS refers to throughput measured on an NVIDIA H200 GPU with a batch size of 1. [Source].

Efficiency was tested by adjusting the confidence thresholds, which determine how many tokens the model finalizes in a single decoding step. Lower thresholds led to faster decoding speeds, while higher thresholds improved stability.

MinerU-Diffusion achieved up to a 3.2× decoding speedup, maintaining a clear advantage in speed even at high accuracy levels.

Thoughts

At its core, MinerU-Diffusion transforms OCR decoding from sequential token-by-token generation into a visually-driven, block-wise diffusion process: tokens are refined in parallel within each block, while blocks retain a coarse front-to-back dependency.

Coupled with uncertainty-driven curriculum training, this shift represents a fundamental change at the decoding paradigm level, not merely swapping out the underlying model backbone.

But I have a concern.

Block boundaries could introduce new sources of subtle errors. While MinerU-Diffusion mitigates this by allowing tokens to causally attend to preceding blocks, they are strictly cut off from future blocks. Structures like headers, footers, table cells, or formulas spanning line breaks might still be disrupted if they fall near these boundaries. Such systemic fragmentation might not clearly surface through averaged evaluation metrics.

Reference:

Thanks for reading AI Exploration Journey! Feel free to share it.

AI Exploration Journey

Discussion about this post

Ready for more?