<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AI Exploration Journey]]></title><description><![CDATA[A passionate AI researcher who enjoys delving into underlying principles and writing in-depth articles.]]></description><link>https://aiexpjourney.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!5JuB!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c45940b-721b-4f3b-ad25-2541b18dcd88_599x599.png</url><title>AI Exploration Journey</title><link>https://aiexpjourney.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 10 May 2026 12:45:09 GMT</lastBuildDate><atom:link href="https://aiexpjourney.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Florian June]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[aiexpjourney@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[aiexpjourney@substack.com]]></itunes:email><itunes:name><![CDATA[Florian]]></itunes:name></itunes:owner><itunes:author><![CDATA[Florian]]></itunes:author><googleplay:owner><![CDATA[aiexpjourney@substack.com]]></googleplay:owner><googleplay:email><![CDATA[aiexpjourney@substack.com]]></googleplay:email><googleplay:author><![CDATA[Florian]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[MinerU-Diffusion: A New Path Beyond Autoregressive OCR — AI Innovations and Insights 131]]></title><description><![CDATA[The uncomfortable truth is that some OCR systems look smarter than they are because language helps them fill in the blanks. But when the page stops being predictable, real visual reading becomes much harder to fake.AI Exploration Journey is a reader-supported publication.]]></description><link>https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond</guid><pubDate>Fri, 08 May 2026 01:38:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Uyw3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>The uncomfortable truth is that some OCR systems look smarter than they are because language helps them fill in the blanks.</strong> But when the page stops being predictable, real visual reading becomes much harder to fake.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Where Autoregressive OCR Starts to Break Down</h2><p>Most existing OCR and visual language models (VLMs) rely heavily on autoregressive decoding, meaning they generate text tokens sequentially, one by one, from left to right. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Uyw3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Uyw3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 424w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 848w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Uyw3!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png" width="1200" height="759.065934065934" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:921,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:4272528,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Uyw3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 424w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 848w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: AR-based OCR decodes tokens left to right, causing latency, error propagation, and reliance on language priors when semantics are disrupted. MinerU-Diffusion reframes OCR as inverse rendering and uses block-wise masked diffusion to refine tokens in parallel under visual conditioning, with a tunable speed&#8211;accuracy trade-off. <strong>Image by author</strong>.</figcaption></figure></div><p>While this approach works well for standard text generation tasks, it&#8217;s far from ideal for document OCR. Here&#8217;s why:</p><ol><li><p><strong>Speed Issues:</strong> Documents, especially lengthy ones filled with tables, formulas, and complex layouts, require generating many tokens. Decoding each token sequentially leads to significant latency, slowing the entire recognition process.</p></li><li><p><strong>Error Propagation:</strong> Autoregressive methods are highly sensitive to early mistakes. A single recognition error can distort the context for subsequent tokens, causing a cascade of inaccuracies that build upon one another.</p></li><li><p><strong>Over-Reliance on Language Priors:</strong> In Semantic Shuffle benchmark, AR models often lean heavily on linguistic cues and semantic coherence. This means they may &#8220;guess&#8221; rather than clearly perceive the actual text. When the semantic structure is disrupted or ambiguous, AR performance typically drops dramatically.</p></li><li><p><strong>OCR as Inverse Rendering:</strong> Fundamentally, document OCR is better thought of as &#8220;inverse rendering.&#8221; The goal is to reconstruct structured information (like text, layouts, tables, and equations) from a two-dimensional image. The correct interpretation primarily depends on visual evidence and spatial arrangements. Forcing a strict left-to-right serialization is merely an "implementation artifact" for representation convenience, rather than a fundamental property of how documents are actually structured.</p></li><li><p><strong>A Strong Fit for Diffusion:</strong> Unlike open-ended text generation (like chatting with ChatGPT), OCR is a near-deterministic task with limited semantic ambiguity. This makes OCR a strong candidate for masked diffusion, where masked tokens can be predicted in parallel conditioned on the image and partially observed sequence, producing a tunable speed&#8211;accuracy trade-off.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AlNT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AlNT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 424w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 848w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 1272w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AlNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png" width="1456" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148795,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!AlNT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 424w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 848w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 1272w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Overview of the document OCR inverse rendering process via different decoding methods.. The model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>Given these considerations, document OCR systems would greatly benefit from decoding strategies that are parallelized, globally consistent, and strongly grounded in visual features. Rather than forcing OCR into the sequential patterns of autoregressive language generation, it&#8217;s more natural to employ methods designed specifically to exploit visual structure.</p><h2>MinerU-Diffusion: From Left-to-Right OCR to Parallel Visual Decoding</h2><p>MinerU-Diffusion uses diffusion-based decoding instead of the traditional autoregressive method, enabling the model to simultaneously confirm or correct multiple tokens through visual context. This approach boosts processing speed, reduces error propagation, and decreases reliance on linguistic context for guessing content.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Xcp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Xcp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 424w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 848w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png" width="1456" height="1103" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1103,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:530923,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Xcp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 424w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 848w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3: (a) The confidence threshold controls decoding parallelism in MinerU-Diffusion. Compared to MinerU2.5, this method achieves up to 3.26&#215; speedup. (b) MinerU-Diffusion maintains a strong accuracy&#8211;efficiency trade-off, achieving 2.12&#215; speedup with 99.9% and 3.01&#215; speedup with 98.8% relative accuracy. (c) Diffusion decoding progressively reconstructs structured text from masked tokens under visual conditioning: black tokens are confirmed, red tokens are being updated, and yellow tokens remain masked, enabling parallel generation with global consistency, in contrast to autoregressive left-to-right decoding. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>The method can be understood through four practical components.</p><h4>1. Unified Output Format</h4><p>Text, layout annotations, table symbols, and formula indicators are all represented as a unified sequence of tokens. </p><p>For document parsing, the model outputs a structured sequence rather than only plain text; task-specific prompts can still produce plain text, LaTeX, or table markup.</p><h4>2. Diffusion-Based Decoding Replacing Autoregression</h4><p>During training, tokens are randomly masked, prompting the model to predict these masked elements based on the surrounding context and visual evidence from the document image.</p><p>At inference, the model progressively reconstructs masked positions, using already decoded context and visual features rather than generating strictly left to right. Over multiple iterative rounds, uncertain tokens are progressively <strong>revealed and</strong> corrected rather than sequentially generating each token from left to right.</p><h4>3. Block-wise Diffusion</h4><p>Diffusing across an entire document sequence can be slow and unstable, so sequences are divided into smaller blocks:</p><ul><li><p><strong>Within blocks</strong>: Diffusion is parallelized, and context is considered bidirectionally.</p></li><li><p><strong>Between blocks</strong>: A coarse, front-to-back dependency helps preserve sequence coherence and reduce long-range drift. </p></li><li><p><strong>System Efficiency</strong>: The causal (front-to-back) structure across blocks naturally enables <strong>efficient KV-caching</strong> during inference, reducing memory and computation costs compared to full-attention diffusion models.</p></li></ul><p>This design maintains fast parallel decoding while mitigating position drift and error accumulation common in lengthy documents.</p><h4>4. Confidence-Driven Dynamic Decoding + Two-Stage Training</h4><p>During inference, tokens with high confidence are confirmed first, while low-confidence tokens undergo further iterative correction. Confidence thresholds balance decoding speed and accuracy.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vqsu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vqsu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 424w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 848w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 1272w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png" width="1456" height="493" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:493,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150475,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vqsu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 424w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 848w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 1272w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4: Training of MinerU-Diffusion. Left: the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. Right: the structured block-attention mask used during training, where tokens attend bidirectionally within each block and causally to all preceding blocks, enabling parallel diffusion refinement within blocks while preserving coarse autoregressive structure across blocks. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>After multimodal initialization, training happens in two stages: initial broad-scale training provides general capabilities, followed by an <strong>uncertainty-driven</strong> refinement. The model automatically mines challenging examples (like complex tables or ambiguous boundaries) <strong>by measuring its own inference consistency</strong>, focusing its learning on the hardest cases to enhance robustness.</p><p>In short, MinerU-Diffusion treats document OCR as the inverse problem of reconstructing structured text from images, leveraging block-wise diffusion to parallelly refine tokens, and employing confidence-driven scheduling and challenging-case training to boost decoding speed, stability, and reliability.</p><h2>Evaluation</h2><h4>Document Parsing Evaluation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!82wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!82wd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 424w, https://substackcdn.com/image/fetch/$s_!82wd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 848w, https://substackcdn.com/image/fetch/$s_!82wd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 1272w, https://substackcdn.com/image/fetch/$s_!82wd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!82wd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png" width="1456" height="585" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:585,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:186752,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!82wd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 424w, https://substackcdn.com/image/fetch/$s_!82wd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 848w, https://substackcdn.com/image/fetch/$s_!82wd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 1272w, https://substackcdn.com/image/fetch/$s_!82wd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 5: Comprehensive evaluation of document parsing on OmniDocBench v1.5. &#8593; denotes higher is better, &#8595; denotes lower is better. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>MinerU-Diffusion&#8217;s capability in full-page document parsing is evaluated using OmniDocBench v1.5, measuring its performance through various metrics such as text edit distance, formula correctness (CDM), table extraction quality (TEDS), and reading order.</p><p>The results showed that MinerU-Diffusion achieved an overall score of <strong>88.94</strong> without using ground-truth layouts. When provided with ground-truth layouts, the score improved significantly to <strong>93.37</strong>, coming very close to the performance of strong autoregressive OCR systems. This mainly shows that once layout errors are removed, its recognition quality is highly competitive.</p><h4>Efficiency Evaluation</h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fyib!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fyib!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 424w, https://substackcdn.com/image/fetch/$s_!fyib!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 848w, https://substackcdn.com/image/fetch/$s_!fyib!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 1272w, https://substackcdn.com/image/fetch/$s_!fyib!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fyib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png" width="1456" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214763,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fyib!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 424w, https://substackcdn.com/image/fetch/$s_!fyib!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 848w, https://substackcdn.com/image/fetch/$s_!fyib!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 1272w, https://substackcdn.com/image/fetch/$s_!fyib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Figure 6: Threshold sensitivity analysis of TPF, TPS, and accuracy. TPF denotes tokens per forward, and TPS refers to throughput measured on an NVIDIA H200 GPU with a batch size of 1. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>Efficiency was tested by adjusting the confidence thresholds, which determine how many tokens the model finalizes in a single decoding step. Lower thresholds led to faster decoding speeds, while higher thresholds improved stability. </p><p>MinerU-Diffusion achieved up to a <strong>3.2&#215; decoding speedup</strong>, maintaining a clear advantage in speed even at high accuracy levels.</p><h2>Thoughts</h2><p>At its core, MinerU-Diffusion transforms OCR decoding from sequential token-by-token generation into a visually-driven, block-wise diffusion process: tokens are refined in parallel within each block, while blocks retain a coarse front-to-back dependency.</p><p>Coupled with uncertainty-driven curriculum training, this shift represents a fundamental change at the decoding paradigm level, not merely swapping out the underlying model backbone.</p><p><strong>But I have a concern.</strong> </p><p>Block boundaries could introduce new sources of subtle errors. <strong>While </strong>MinerU-Diffusion<strong> mitigates this by allowing tokens to causally attend to preceding blocks, they are strictly cut off from future blocks.</strong> Structures like headers, footers, table cells, or formulas spanning line breaks might still be disrupted if they fall near these boundaries. Such systemic fragmentation might not clearly surface through averaged evaluation metrics.</p><div><hr></div><p>Reference: </p><ul><li><p>Paper: <a href="https://arxiv.org/pdf/2603.22458v1">MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding</a>.</p></li><li><p>Code: <a href="https://github.com/opendatalab/MinerU-Diffusion">https://github.com/opendatalab/MinerU-Diffusion</a>.</p></li></ul><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading AI Exploration Journey! Feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p>]]></content:encoded></item><item><title><![CDATA[MultiDocFusion: From Flat Chunks to Hierarchy-Aware RAG — AI Innovations and Insights 130]]></title><description><![CDATA[For RAG, a lot of us begin with the same almost invisible premise: chunk the document, embed the chunks, retrieve the top matches, and the rest will take care of itself.]]></description><link>https://aiexpjourney.substack.com/p/multidocfusion-from-flat-chunks-to</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/multidocfusion-from-flat-chunks-to</guid><pubDate>Mon, 04 May 2026 02:04:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cKD1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>For RAG, a lot of us begin with the same almost invisible premise: chunk the document, embed the chunks, retrieve the top matches, and the rest will take care of itself. </p><p>That story holds up right until a long industrial PDF arrives on your desk and makes one thing obvious: documents were never meant to be understood as confetti.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Hidden Pitfalls of Naive RAG Chunking</h2><p>Many RAG systems retrieve relevant chunks from documents and then ask an LLM to answer from that retrieved context.</p><p>However, there&#8217;s a catch: to make documents searchable, especially longer ones, they usually have to be split into smaller pieces. Traditional methods for splitting these documents can be overly simplistic, such as slicing by a fixed number of words or strictly relying on semantic meaning.</p><p>In practice, this approach often leads to problems. Real-world documents aren&#8217;t simply chunks of plain text, they often include:</p><ul><li><p>Multiple levels of headings and subheadings</p></li><li><p>Tables, figures, and other layout-heavy elements</p></li><li><p>Content spanning across multiple pages</p></li><li><p>Scanned PDFs that require OCR processing</p></li></ul><p>When documents are just &#8220;split by length,&#8221; it&#8217;s akin to randomly chopping up a textbook, headings end up isolated from their sections, paragraphs get separated from accompanying tables, and critical context can get lost altogether. </p><p>This fragmentation can make retrieval return disjointed evidence, which can reduce answer quality. The problem becomes especially pronounced when dealing with industrial documents, financial reports, legal contracts, and scanned materials.</p><h2>Core Idea</h2><p>To make RAG over long documents more faithful, three kinds of signals need to be modeled together:</p><ul><li><p>Visual layout (what the document looks like)</p></li><li><p>Textual content (what the document says)</p></li><li><p>Structural hierarchy (how the document is organized)</p></li></ul><p>As shown in Figure 1, MultiDocFusion combines visual layout, text, and hierarchy; the name &#8220;Fusion&#8221; refers to that integration, while &#8220;MultiDoc&#8221; also reflects support for diverse document formats and corpus-level multi-document RAG scenarios.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ux6L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ux6L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 424w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 848w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 1272w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ux6L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png" width="728" height="421.8426966292135" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:722,&quot;width&quot;:1246,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:178052,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194862047?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ux6L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 424w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 848w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 1272w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: The pipeline for MultiDocFusion. The figure illustrates the step-by-step process for handling a long industrial document. (a) DP extracts layout structures; (b) OCR recognizes and annotates text; (c) DSHP-LLM constructs a hierarchical tree from identified section headers and general nodes; (d) DFS-based Grouping constructs coherent hierarchical chunks for retrieval tasks. The color-coded blocks represent document elements: yellow for Root and Title, red for Section Headers, and green for general nodes (tables, figures, and text blocks). [<a href="https://arxiv.org/pdf/2604.12352v1">Source</a>].</figcaption></figure></div><h4>An Intuitive Analogy</h4><p>Traditional fixed-length methods are more like slicing a book at fixed token boundaries.</p><p>MultiDocFusion, on the other hand, acts like an intelligent reader: it dynamically reconstructs the document's structural tree (like building its own table of contents on the fly) and carefully groups content that naturally belongs together.</p><p>This makes it particularly effective for handling long, complex documents that reflect real-world complexities.</p><h2>MultiDocFusion: How Does It Actually Work?</h2><p>Think of it as a four-step process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cKD1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cKD1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 424w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 848w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 1272w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cKD1!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png" width="1200" height="787.0879120879121" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:955,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:1176450,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194862047?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cKD1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 424w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 848w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 1272w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: MultiDocFusion in 4 steps. Image by author.</figcaption></figure></div><h4>Step 1: Understanding the Document Layout (Document Parsing)</h4>
      <p>
          <a href="https://aiexpjourney.substack.com/p/multidocfusion-from-flat-chunks-to">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Trace2Skill: Distilling Agent Experience into Transferable Skills — AI Innovations and Insights 129]]></title><description><![CDATA[In a previous article (Agent Skills: Distill Tasks into Discoverable, Reusable Capabilities), I explained what agent skills are.]]></description><link>https://aiexpjourney.substack.com/p/trace2skill-distilling-agent-experience</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/trace2skill-distilling-agent-experience</guid><pubDate>Thu, 30 Apr 2026 02:50:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6KRW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd271851-fea3-4580-9c68-91c1528ec09e_1356x498.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In a previous article (<a href="https://aiexpjourney.substack.com/p/agent-skills-distill-tasks-into-discoverable">Agent Skills: Distill Tasks into Discoverable, Reusable Capabilities</a>), I explained what agent skills are. </p><p>Today, the focus shifts to a new method for acquiring those skills.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Agents Need a Master Manual, Not a Pile of Sticky Notes </h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4wiG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4wiG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4wiG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png" width="728" height="485.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:1291266,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194611304?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4wiG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: An example of Skill. Image by author.</figcaption></figure></div><p>A skill is more than just a single file, it's a structured knowledge directory containing a root SKILL.md (the instruction manual) alongside auxiliary resources like executable scripts and reference files.</p><p>As agents tackle more specialized domains, the demand for highly specialized, comprehensive skill documentation grows, but human production simply cannot keep up.</p><p>There are also two major challenges:</p><p>First, asking a large language model to generate skills purely from its parametric knowledge often misses the real pain points, operational nuances, and common failure modes of a target domain. The resulting skills might look plausible on paper, but they rarely provide meaningful help in actual tasks.</p><p>Second, many existing online update methods based on trajectory data tend to fragment knowledge and overfit to local experience.</p><ul><li><p>Fragmentation: Many approaches extract a lesson from each trajectory, ending up with a heap of disconnected skills that are hard to search and apply.</p></li><li><p>Sequential updates: Updating skills after every new trajectory is like trying to revise a guide without ever fully understanding the domain. It&#8217;s easy for a single trajectory to skew the knowledge.</p></li></ul>
      <p>
          <a href="https://aiexpjourney.substack.com/p/trace2skill-distilling-agent-experience">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MinerU2.5-Pro: Teaching a 1.2B Model to Parse Like a Giant — AI Innovations and Insights 128]]></title><description><![CDATA[In my earlier articles, I already covered the core idea behind MinerU (From Big Picture to Details: MinerU 2.5 Redefines Document Parsing &#8212; AI Innovations and Insights 77).]]></description><link>https://aiexpjourney.substack.com/p/mineru25-pro-teaching-a-12b-model</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/mineru25-pro-teaching-a-12b-model</guid><pubDate>Sun, 26 Apr 2026 01:02:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!E_t9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16f2cd1e-0714-4246-b92b-1561c75c5f7e_1424x646.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my earlier articles, I already covered the core idea behind MinerU (<a href="https://aiexpjourney.substack.com/p/from-big-picture-to-details-mineru">From Big Picture to Details: MinerU 2.5 Redefines Document Parsing &#8212; AI Innovations and Insights 77</a>). </p><p>In this post, let&#8217;s take a look at its latest progress.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Key Constraint in Document Parsing</h2><p>A recurring pattern has emerged: state-of-the-art models with different architectures and parameter sizes often make almost identical mistakes on challenging real-world PDF samples. </p><p>Simply switching to a larger model is unlikely to solve the problem. Instead, the bottleneck lies in a shared set of data-related challenges that limit progress across the board.</p><p>As a result, the key constraint on further advances in document parsing may no longer be model architecture, but the training data itself.</p><p>This data bottleneck has two main dimensions:</p><ul><li><p>First, insufficient coverage. Training datasets contain plenty of common page layouts, but long-tail scenarios such as complex nested tables, dense mathematical formulas, and <strong>unconventional multi-column layouts</strong> remain underrepresented.</p></li><li><p>Second, there is an annotation quality paradox: many of the most informative samples, especially the difficult ones, are exactly the cases where automatic annotation is least reliable. Because of this, simply scaling raw data volume is insufficient: without better sampling and more reliable labels, it can amplify both distribution bias and annotation noise.</p></li></ul><p>Given these challenges, an important question arises: if the architecture remains unchanged, can systematic data engineering and staged training still deliver significant gains, and can a more discriminative evaluation protocol reveal those gains more faithfully? </p><p><strong>More provocatively: can a 1.2B-parameter model rely purely on data to outperform giants with over 200x more parameters?</strong></p><h2>MinerU2.5-Pro: Data Engine</h2>
      <p>
          <a href="https://aiexpjourney.substack.com/p/mineru25-pro-teaching-a-12b-model">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Agent Skills: Distill Tasks into Discoverable, Reusable Capabilities — AI Innovations and Insights 127]]></title><description><![CDATA[Building Agents is like training a new colleague who&#8217;s brilliant but forgetful.]]></description><link>https://aiexpjourney.substack.com/p/agent-skills-distill-tasks-into-discoverable</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/agent-skills-distill-tasks-into-discoverable</guid><pubDate>Thu, 23 Apr 2026 01:04:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fBjM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0305ca1-e38d-4189-8940-c87f5e5b77ea_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Building Agents is like training a new colleague who&#8217;s brilliant but forgetful. </p><p>They know what to do, just not always in the right order or at the right time.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Skills are Needed?</h2><p>Over the past two years, discussions around Agents have often focused on bigger models, longer context windows, and more tools. </p><p>But the real challenge in <strong>production-level</strong> Agents isn&#8217;t whether the model can do something, it&#8217;s whether it can reliably complete a task the same way every time.</p><p>As Agent&#8217;s tasks grow longer, environments become more complex, and branches multiply, language models tend to drift. Steps get skipped, sequences get jumbled, tools are used inconsistently, and stopping conditions vary. </p><p>These challenges are known as procedural burden of agency.</p><p>Before diving into the introduction of Skills, let&#8217;s take a look at where they fit within an Agent.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!deG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!deG_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 424w, https://substackcdn.com/image/fetch/$s_!deG_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 848w, https://substackcdn.com/image/fetch/$s_!deG_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!deG_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!deG_!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png" width="1200" height="864.5604395604396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1049,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:1006422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194257779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!deG_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 424w, https://substackcdn.com/image/fetch/$s_!deG_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 848w, https://substackcdn.com/image/fetch/$s_!deG_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!deG_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Externalization as the organizing principle of LLM agent design. Upper panel: The arc of human cognitive externalization from thought through language, writing, printing, to digital computation. Middle panel: The corresponding externalization arc for LLM agents, from weights through three externalization dimensions&#8212;Memory (externalized state), Skills (externalized expertise), and Protocols (externalized interaction)&#8212;to the Harness that unifies them. Lower panel: A literature landscape mapping representative works onto three capability layers&#8212;Weights, Context, and Harness&#8212;illustrating how research threads have progressively migrated outward. The parallel between the two arcs encodes a recursive claim: LLM agents achieve reliable agency by externalizing cognitive burdens along the same representational dimensions that have driven human cognitive history. [<a href="https://arxiv.org/pdf/2604.08224v1">Source</a>].</figcaption></figure></div><h2>Skills</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q6l2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q6l2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 424w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 848w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 1272w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q6l2!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png" width="1200" height="1183.5164835164835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1436,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:442684,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194257779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q6l2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 424w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 848w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 1272w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: An example of Skills. Image by author.</figcaption></figure></div><p>Skills were created as an externalization response to this procedural burden.</p><p>A skill is not just a tool, and not merely a prompt file; it is an externalized form of procedural expertise. Tools answer the question &#8220;what actions can be performed,&#8221; protocols answer &#8220;how these actions are described and invoked,&#8221; and a Skill answers &#8220;how this type of task should be handled as a whole.&#8221;</p><p>A truly reusable Skill has at least three layers: the operational flow, decision heuristics, and governing constraints. In other words, it not only tells an Agent what to do first and what to do next, but also guides which path to take when there are branches, and defines which boundaries must never be crossed.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/agent-skills-distill-tasks-into-discoverable">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MARCH: A Multi-Agent Framework for Factual RAG and Hallucination Reduction — AI Innovations and Insights 126]]></title><description><![CDATA[Have you ever run into this problem when working on a RAG project: even with retrieved evidence at hand, the model still produces answers that seem correct but are actually wrong?]]></description><link>https://aiexpjourney.substack.com/p/march-a-multi-agent-framework-for</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/march-a-multi-agent-framework-for</guid><pubDate>Sat, 18 Apr 2026 04:53:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!04-G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60594711-cd53-4513-ae6f-837110f5b801_1342x642.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever run into this problem when working on a RAG project: even with retrieved evidence at hand, the model still produces answers that seem correct but are actually wrong? And the usual self-checking or verification methods often get misled by the model&#8217;s own output.</p><p>This post offers some interesting ideas.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Motivation </h2><p>Hallucination remains a critical bottleneck for LLMs, undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. </p><p>There are three core challenges:</p><ul><li><p>First, even with access to documents, models can misreport numbers, mix up timelines, or make reasoning jumps that ignore the evidence. In fields like finance, law, and healthcare, these mistakes can have serious consequences.</p></li><li><p>Second, existing training signals are too coarse. Traditional supervised fine-tuning may teach a model to &#8220;sound convincing,&#8221; but not to ensure every fact is accurate. Standard reinforcement learning methods, including RLHF, usually provide a single score for the final answer. Even newer approaches like RL with Verifiable Reward (RLVR) are bottlenecked by the scarcity of expert annotations and the limited reasoning ceilings of external verifiers.</p></li><li><p>Third, current verifiers suffer from confirmation bias <strong>due to "information leakage"</strong>. When a judge sees the question, the source documents, and the model&#8217;s answer at the same time, there&#8217;s a strong tendency to justify the original response rather than independently verify it. This is a fundamental flaw in many hallucination detection methods.</p></li></ul><h2>MARCH: Three Agents, One Goal for Tighter Grounding and Fewer Hallucinations</h2><p>MARCH is a framework that separates answer generation from answer verification and intentionally creates <strong>information asymmetry</strong>.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/march-a-multi-agent-framework-for">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MOCR: From Text-Only OCR to Parse-Anything Intelligence — AI Innovations and Insights 125]]></title><description><![CDATA[If you work with documents long enough, you start noticing how often the most important information isn&#8217;t in the paragraphs.]]></description><link>https://aiexpjourney.substack.com/p/mocr-from-text-only-ocr-to-parse</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/mocr-from-text-only-ocr-to-parse</guid><pubDate>Tue, 14 Apr 2026 04:48:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!I573!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you work with documents long enough, you start noticing how often the most important information isn&#8217;t in the paragraphs. It hides in charts, tables, diagrams, and all the parts traditional OCR still treats like decorative noise.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Traditional OCR Is No Longer Enough</h2><p>Traditional OCR and many document parsing pipelines are still largely text-centric: they recover text well, but usually do not structurally parse information-dense graphics.</p><p>When confronted with charts, diagrams, icons, or UI components, they usually treat these elements as mere image fragments and store them as pixel blocks. As a result, a significant portion of structural and semantic information within documents is lost.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I573!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I573!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 424w, https://substackcdn.com/image/fetch/$s_!I573!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 848w, https://substackcdn.com/image/fetch/$s_!I573!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 1272w, https://substackcdn.com/image/fetch/$s_!I573!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I573!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png" width="1200" height="547.2527472527472" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:664,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:482524,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/193134730?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I573!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 424w, https://substackcdn.com/image/fetch/$s_!I573!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 848w, https://substackcdn.com/image/fetch/$s_!I573!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 1272w, https://substackcdn.com/image/fetch/$s_!I573!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Comparison between traditional text-only OCR and MOCR paradigms. Traditional OCR treats graphics as pixels and often discards them, while MOCR parses graphics into structured code (e.g. SVG), enabling faithful reconstruction and broader downstream applications. [<a href="https://arxiv.org/pdf/2603.13032v2">Source</a>].</figcaption></figure></div><p>Figure 1 captures the main shift: traditional OCR keeps graphics as raster regions, while Multimodal OCR represents text and eligible visual symbols in a unified serialized format, using plain text, table markup, LaTeX, or SVG depending on the element type. </p><h2>MOCR: Not Just Reading Text, but Reconstructing Graphics</h2><p>MOCR, short for Multimodal OCR, aims to parse anything found in documents. </p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/mocr-from-text-only-ocr-to-parse">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[PaperBanana: Surpassing Nano-Banana, Let the Paper Draw Itself — AI Innovations and Insights 124]]></title><description><![CDATA[Current AI can read papers, write code, and even suggest ideas, but many researchers still end up wrestling with figures by hand.]]></description><link>https://aiexpjourney.substack.com/p/paperbanana-surpassing-nano-banana</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/paperbanana-surpassing-nano-banana</guid><pubDate>Fri, 10 Apr 2026 02:27:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!x8Xs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ee9d90-6241-4ca2-a72c-01134381c16a_1916x804.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Current AI can read papers, write code, and even suggest ideas, but many researchers still end up wrestling with figures by hand. That mismatch says a lot about what is still missing in the research workflow.</p><p>This post might offer an insightful idea.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Motivation </h2><p>There are two common approaches to creating <strong>methodology diagrams and statistical plots</strong> for research papers today, and neither of them quite gets the job done. </p><ul><li><p>The first is code-based drawing, using tools like TikZ, Python-PPTX, or SVG. This approach works well for structured diagrams and precise layouts. But it starts to fall short when dealing with the kinds of visuals that show up in modern papers, things like intricate icons, custom shapes, and polished, detail-heavy designs. Flexibility becomes a real limitation.</p></li><li><p>The second approach is to rely on image generation models. These can produce visually appealing results with very little effort. The problem is consistency. It is hard to guarantee that the output meets the standards of academic work, both in terms of factual accuracy and visual conventions. A figure might look good at first glance, but still require careful checking to ensure <strong>logical faithfulness and academic readability</strong>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yW7A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yW7A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 424w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 848w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 1272w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yW7A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png" width="726.7890625" height="423.16336862664474" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:708,&quot;width&quot;:1216,&quot;resizeWidth&quot;:726.7890625,&quot;bytes&quot;:955850,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/192685509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yW7A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 424w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 848w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 1272w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Examples of methodology diagrams and statistical plots generated by PaperBanana, which show the potential of automating the generation of academic illustrations. [<a href="https://arxiv.org/pdf/2601.23265v2">Source</a>].</figcaption></figure></div><p>In practice, this means researchers often end up spending a significant amount of time manually fixing figures anyway, even after using these tools.</p><h2>PaperBanana: Read, Plan, Stylize, and Sketch Science</h2><p>PaperBanana is an agentic, reference-driven framework that tries to bridge this gap by synthesizing aesthetic guidelines from curated academic examples.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/paperbanana-surpassing-nano-banana">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[KohakuRAG: Climb the Document Tree, Find the Right Evidence — AI Innovations and Insights 123]]></title><description><![CDATA[If you&#8217;ve ever built a RAG pipeline at work, you probably know the feeling: retrieval looks fine in demos, then quietly falls apart on real questions.]]></description><link>https://aiexpjourney.substack.com/p/kohakurag-climb-the-document-tree</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/kohakurag-climb-the-document-tree</guid><pubDate>Mon, 06 Apr 2026 09:21:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aqjS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve ever built a RAG pipeline at work, you probably know the feeling: retrieval looks fine in demos, then quietly falls apart on real questions.</p><p>This post is interesting because it introduces a method that tackles that gap in a very practical, engineering-first way.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Problems of Standard RAG</h2><p>Standard RAG is often not reliable enough for hardcore tasks, like the WattBot 2025 Challenge, which demands &#177;0.1% numeric precision and exact source attribution. In practice, three issues show up again and again:</p><ul><li><p>Flat chunking breaks semantic boundaries and document structure, which makes precise source tracking much harder and weakens confidence in citation quality.</p></li><li><p>A single query can easily miss relevant evidence when the wording does not line up, especially when different terms are used for the same idea.</p></li><li><p>A single generation pass is inherently unstable. Both the answer and its citations can vary from run to run, and in some cases, terrified of hallucinating, the system may unnecessarily abstain (outputting a blank) despite the evidence being buried right there in the context.</p></li></ul><h2>KohakuRAG: From Flat Chunks to Citation-Ready Trees</h2><p>At its core, KohakuRAG is a lightweight RAG framework built on a four-level hierarchical document index: document &#8594; section &#8594; paragraph &#8594; sentence. </p><p>It preserves document structure through bottom-up embedding aggregation, improves retrieval coverage with LLM-based query planning plus cross-query reranking, and stabilizes final outputs with abstention-aware ensemble voting during generation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aqjS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aqjS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 424w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 848w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 1272w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aqjS!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png" width="1200" height="538.8888888888889" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:582,&quot;width&quot;:1296,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:188418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/192494795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aqjS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 424w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 848w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 1272w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Overview of KohakuRAG. Left (Hierarchical Indexing): Documents are parsed into tree structures with sections, paragraphs (Para), and sentences (S). Sentence embeddings are computed and aggregated bottom-up to parent levels, then stored in a Vector DB. Center (Multi-Query Retrieval): Given a question, the Query Planner (LLM) generates multiple related queries, each retrieving Top-K results that are merged via Cross-Query Reranking. Right (Ensemble Inference): Context and question are sent to the LLM for m independent runs; blank responses are filtered (X), and majority voting produces the final answer. [<a href="https://arxiv.org/pdf/2603.07612v1">Source</a>].</figcaption></figure></div><h4>1. Offline hierarchical document indexing</h4><p>The first step is document parsing.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/kohakurag-climb-the-document-tree">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Three Paradigms Shaping Modern OCR — AI Innovations and Insights 122]]></title><description><![CDATA[Anyone who has worked with PDFs, forms, or scanned reports knows this feeling: extracting text is easy until structure starts to matter.]]></description><link>https://aiexpjourney.substack.com/p/the-three-paradigms-shaping-modern</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/the-three-paradigms-shaping-modern</guid><pubDate>Fri, 03 Apr 2026 11:04:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QPKT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Anyone who has worked with PDFs, forms, or scanned reports knows this feeling: extracting text is easy until structure starts to matter. </p><p>The real shift is this: OCR is no longer just a reading task, but a document understanding problem.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Modern OCR Is About Understanding Documents</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QPKT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QPKT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QPKT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1647586,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/192090936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QPKT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: OCR Beyond Text: The Rise of Document Intelligence. Image by author.</figcaption></figure></div><p>In the past, OCR usually meant one thing: recognizing text from an image.</p><p>Today, the scope of OCR has expanded far beyond basic character recognition. Real-world document processing rarely stops at reading text. It often involves layout analysis, table parsing, chart understanding, question answering, and extracting key information from complex pages. Documents are no longer just lines of text. They contain structure, hierarchy, and meaning that must be understood as a whole.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8_g5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8_g5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 424w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 848w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 1272w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8_g5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png" width="1456" height="579" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:579,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:348405,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/179540724?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8_g5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 424w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 848w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 1272w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Rapid growth of document parsing methods since June 2025. [<a href="https://arxiv.org/pdf/2511.10390v2">Source</a>].</figcaption></figure></div><p>Because of this shift, OCR has gradually evolved from a character recognition tool into something much closer to a document intelligence system. Instead of only asking what characters are on the page, modern OCR systems are expected to understand how the page is organized and what the content actually represents.</p><h2>Three Paradigms of Modern OCR</h2><p>Current OCR paradigms can roughly be grouped into three categories: </p><ul><li><p>Pipeline OCR systems</p></li><li><p>End-to-end OCR models</p></li><li><p>General vision-language models. </p></li></ul><p>These three approaches reflect different stages in the evolution of OCR, and each comes with its own set of trade-offs.</p><h2>Traditional Pipeline OCR: Strong in Control, Weak in Fragmentation</h2><p>The most classic design is the pipeline OCR system. A typical workflow starts with layout detection, followed by element-level recognition, and finally a rule-based step that assembles everything into the final output. </p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/the-three-paradigms-shaping-modern">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[HyperRAG: From Broken Triples to Complete Relational Reasoning — AI Innovations and Insights 121]]></title><description><![CDATA[If you&#8217;ve ever built a RAG system that looked smart in demos but got strangely lost on multi-hop questions, have you considered this idea: For multi-hop RAG failures, the bottleneck may lie not only in retrieval quality, but also in how complex facts are represented before retrieval begins.]]></description><link>https://aiexpjourney.substack.com/p/hyperrag-from-broken-triples-to-complete</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/hyperrag-from-broken-triples-to-complete</guid><pubDate>Sun, 29 Mar 2026 09:52:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!iOgR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59e026f-da40-47a3-b825-9044fe1a98e8_820x986.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve ever built a RAG system that looked smart in demos but got strangely lost on multi-hop questions, have you considered this idea: For multi-hop RAG failures, the bottleneck may lie not only in retrieval quality, but also in how complex facts are represented before retrieval begins.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Bottleneck of Combining RAG with Knowledge Graphs</h2><p>Most GraphRAG systems today are built on top of binary knowledge graphs. Knowledge is broken down into simple triples: head entity, relation, tail entity. This representation is widely used, but its simplicity comes at a cost.</p><p>Two structural limitations stand out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jkMG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jkMG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png 424w, https://substackcdn.com/image/fetch/$s_!jkMG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png 848w, https://substackcdn.com/image/fetch/$s_!jkMG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png 1272w, https://substackcdn.com/image/fetch/$s_!jkMG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jkMG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png" width="578" height="611.9132653061224" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:830,&quot;width&quot;:784,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:253250,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/191479634?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jkMG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png 424w, https://substackcdn.com/image/fetch/$s_!jkMG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png 848w, https://substackcdn.com/image/fetch/$s_!jkMG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png 1272w, https://substackcdn.com/image/fetch/$s_!jkMG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74bb7088-7e81-4f3b-b21c-504ef58dc747_784x830.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Structural Comparison of (a) Knowledge Graphs and (b) Hypergraphs. For a given question &#119902;, (a) requires 3-hop reasoning over binary facts, while (b) enables singlehop inference via an &#119899;-ary relational fact, yielding a more compact and expressive multi-entity representation. [<a href="https://arxiv.org/pdf/2602.14470v1">Source</a>].</figcaption></figure></div><p>First is semantic fragmentation. Many real-world facts involve multiple entities and roles within a single relational event. When those facts are decomposed into isolated binary triples, part of the original semantic structure is lost.</p><p>Figure 1 uses an n-ary fact roughly of the form: &#8220;Bruce Seth Green, Sam Weisman, Sam Pillsbury, and Eric Laneuville directed TV 101 in English in California.&#8221;</p><p>In a binary graph, this single holistic fact must be decomposed into multiple pairwise triples, which breaks apart its original semantic unity. The connection between director, movie, and location is no longer represented as one unified event, what researchers call an <strong>n-ary relation</strong>. It becomes a set of separate statements that the system must later stitch back together.</p><p>Second is path explosion. Because meaning is scattered across multiple edges, the system has to rely on multi-hop reasoning to reconstruct the original context. This typically means deeper traversals across the graph. As the graph grows, the search space over possible paths expands quickly. Computation becomes heavier, and small errors in earlier hops can propagate forward, compounding the problem.</p><h2>The Dual-Engine Architecture of HyperRAG</h2><p>To exploit this expressive topology without getting lost in the noise, HyperRAG is introduced. Let's dive into how they work.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/hyperrag-from-broken-triples-to-complete">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MegaRAG: Stop Chunking Blindly, Start Reading Like a Human — AI Innovations and Insights 120]]></title><description><![CDATA[If you have ever built a RAG system for real documents, you probably know the feeling: the model finds the right page, but still misses the point.]]></description><link>https://aiexpjourney.substack.com/p/megarag-stop-chunking-blindly-start</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/megarag-stop-chunking-blindly-start</guid><pubDate>Wed, 25 Mar 2026 00:44:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gOnd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31ab93d-4474-4c03-abcb-f90fa83c1859_1224x790.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you have ever built a RAG system for real documents, you probably know the feeling: the model finds the right page, but still misses the point.</p><p>There is a question: What if the system could build a document-level view, instead of reasoning over isolated chunks alone?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Problems of Current RAG and GraphRAG Systems </h2><p>Many existing RAG and GraphRAG systems share a few structural weaknesses. They often miss the big picture, struggle with deeper reasoning, and tend to treat documents as pure text. Charts, diagrams, tables, and layout are usually processed in isolation, if at all. </p><p>There are five underlying reasons behind this.</p><h4>Long Documents are Inherently Difficult to Handle</h4><p>Even though modern multimodal models can process both images and text, they are still constrained by context window limits. When dealing with entire books, lengthy reports, or full lecture decks, true global understanding becomes elusive. </p><p>This limitation directly affects a model&#8217;s ability to grasp high-level concepts and the overall structure of the content. Without a holistic view, deeper reasoning suffers.</p><h4>Most Existing GraphRAG Approaches Remain Text-Centric</h4><p>Methods such as GraphRAG and LightRAG excel at extracting entities and relations from text, building graphs, and performing multi-hop reasoning. </p><p>However, these approaches are fundamentally unimodal. They completely overlook visual elements like charts, maps, and diagrams. In multimodal documents, these visual components often carry essential information. When they are ignored or weakly integrated, important signals are simply lost.</p><h4>Chunking Fragments the Knowledge Graph</h4><p>A common practice is to split documents into many smaller chunks, then extract entities and relations from each piece independently. </p><p>While this improves efficiency, it breaks cross-page and cross-section connections. Furthermore, naive, isolated extraction often fails to capture subtle cross-modal relationships (like text-to-figure links) even within the same chunk.</p><p>The resulting knowledge graph becomes fragmented, missing relationships that only emerge when viewing the document as a whole.</p><h4>Manual Construction of Multimodal Knowledge Graphs Does not Scale</h4><p>Some earlier work relied on hand-crafted multimodal knowledge graphs for question answering. While effective in controlled settings, this approach is expensive and difficult to scale. </p><p>Automatically building multimodal knowledge graphs and integrating them into RAG pipelines remains an open challenge that has yet to be fully addressed.</p><h4>Fragmented Retrieval Spaces</h4><p>Many retrieval pipelines still struggle with mixed-modality retrieval, which is why MegaRAG uses a unified embedding model for pages, entities, and relations.</p><h2>MegaRAG: Build a Living Multimodal Graph for Smarter Document QA</h2><p>MegaRAG takes a different route. </p><p>Instead of treating a multimodal document as scattered text snippets, it uses an off-the-shelf document parsing tool (such as MinerU, <a href="https://aiexpjourney.substack.com/p/from-big-picture-to-details-mineru">From Big Picture to Details: MinerU 2.5 Redefines Document Parsing &#8212; AI Innovations and Insights 77</a>) to first extract text, figures, and tables, and then organizes the entire document into a single multimodal knowledge graph.</p><p>Then, when answering a question, it retrieves evidence from two places at once: the structured knowledge inside the graph and the original visual pages. Finally, it generates the answer in two stages.</p><p>The full pipeline can be broken down into four steps.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/megarag-stop-chunking-blindly-start">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MoDora: From Broken OCR Chunks to a Living Document Tree — AI Innovations and Insights 119]]></title><description><![CDATA[If you have ever built a document QA or RAG pipeline, you probably know the feeling: OCR gives you words, but not meaning.]]></description><link>https://aiexpjourney.substack.com/p/modora-from-broken-ocr-chunks-to</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/modora-from-broken-ocr-chunks-to</guid><pubDate>Fri, 20 Mar 2026 04:13:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!hRot!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9670a4b4-cb35-40fe-9696-ac4097c57a56_1810x790.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you have ever built a document QA or RAG pipeline, you probably know the feeling: OCR gives you words, but not meaning. </p><p>A simple question comes to mind: what if a document is meant to be understood as a <strong>hierarchical tree</strong> rather than a flat pile of chunks?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Current Problems in Semi-Structured Document Analysis</h2><p>In practice, semi-structured documents show up everywhere, yet most existing techniques struggle to support natural language queries over them. These documents usually contain tables, charts, hierarchical headings, and body text. Their layouts tend to be irregular. </p><p>An analysis of one million real documents shows that over 77% include at least one table, chart, or section heading, <strong>with 61% containing tables and 40% containing charts</strong>, making the challenge both common and meaningful.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xsbk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xsbk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png 424w, https://substackcdn.com/image/fetch/$s_!Xsbk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png 848w, https://substackcdn.com/image/fetch/$s_!Xsbk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png 1272w, https://substackcdn.com/image/fetch/$s_!Xsbk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xsbk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png" width="630" height="532.514506769826" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:874,&quot;width&quot;:1034,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:266721,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/190488008?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xsbk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png 424w, https://substackcdn.com/image/fetch/$s_!Xsbk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png 848w, https://substackcdn.com/image/fetch/$s_!Xsbk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png 1272w, https://substackcdn.com/image/fetch/$s_!Xsbk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59e1a39-e034-404b-87c3-070ac4069bf8_1034x874.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Example Semi-structured Document Layouts. {CO1, CO2, . . . , CO&#119951;}, where each CO&#119946; is formed by one or more tightly coupled elements, collectively capturing the document&#8217;s text and structure information. [<a href="https://arxiv.org/pdf/2602.23061v2">Source</a>].</figcaption></figure></div><p>More specifically, there are three major pain points:</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/modora-from-broken-ocr-chunks-to">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[AgenticOCR: Turning OCR into an Evidence-Seeking Agent — AI Innovations and Insights 118]]></title><description><![CDATA[If you work on document intelligence, you probably know the strange frustration of having "good OCR" and still getting bad answers.]]></description><link>https://aiexpjourney.substack.com/p/agenticocr-turning-ocr-into-an-evidence</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/agenticocr-turning-ocr-into-an-evidence</guid><pubDate>Mon, 16 Mar 2026 02:33:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!S_v6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cf5b3d-0723-44bf-bcbf-a5564083bbb5_1986x810.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you work on document intelligence, you probably know the strange frustration of having "good OCR" and still getting bad answers. </p><p>There is a question: what if the real problem is not reading everything, but reading the right thing at the right time?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Main Bottleneck in RAG</h2><p>On one hand, general OCR and document parsing systems are already quite advanced. Many models reach high accuracy on standard document benchmarks. On the other hand, <strong>real downstream scenarios, especially RAG setups</strong> for financial reports, technical manuals, and research papers, rarely need a full readout of an entire document. What is actually missing is the ability to extract only the portions that matter for the current question.</p><p><strong>A critical bottleneck</strong> in many current visual RAG systems is rigid page-level chunking and retrieval. This level of granularity is too coarse. Each page carries a large amount of irrelevant material such as headers, footers, decorative elements, and unrelated paragraphs. All of it is delivered to the generation model. That noise dilutes attention, and the limited visual token budget forces the system to compress high resolution pages. Important details, including tables, small text, rotated sections, and formulas, are more likely to be lost. The risk of hallucination rises as a result. Recent studies also show that multimodal models struggle to pinpoint evidence precisely in these scenarios and are easily distracted by irrelevant visual signals.</p><p>A better approach would let the model <strong>read a document the same way a person does</strong>. First look at the layout, then locate the relevant region, zoom or rotate if needed, and finally extract only the evidence that answers the question rather than feeding in the entire page.</p><h2>Overview of AgenticOCR</h2><p>Figure 1 uses a concrete example to show how AgenticOCR helps a visual RAG system answer a question.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/agenticocr-turning-ocr-into-an-evidence">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[DeepRead: From Fragmented Retrieval to Structure-Aware Agentic Reading — AI Innovations and Insights 117]]></title><description><![CDATA[If you work with long PDFs, reports, or academic papers, you probably know the frustration of seeing an AI system &#8220;search&#8221; a document without really understanding its structure.]]></description><link>https://aiexpjourney.substack.com/p/deepread-from-fragmented-retrieval</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/deepread-from-fragmented-retrieval</guid><pubDate>Thu, 12 Mar 2026 11:21:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uBnQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2057b47e-6818-4074-b238-c80e7509d8a8_1812x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you work with long PDFs, reports, or academic papers, you probably know the frustration of seeing an AI system &#8220;search&#8221; a document without really understanding its structure. </p><p>So there is a very human question: what if models could navigate a document more like we do?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Limitations of Existing Agentic Search Systems</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rJa_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rJa_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png 424w, https://substackcdn.com/image/fetch/$s_!rJa_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png 848w, https://substackcdn.com/image/fetch/$s_!rJa_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png 1272w, https://substackcdn.com/image/fetch/$s_!rJa_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rJa_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png" width="974" height="556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:556,&quot;width&quot;:974,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:258191,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/189872536?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rJa_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png 424w, https://substackcdn.com/image/fetch/$s_!rJa_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png 848w, https://substackcdn.com/image/fetch/$s_!rJa_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png 1272w, https://substackcdn.com/image/fetch/$s_!rJa_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965d22e5-1e81-4e33-bb31-e8b2b1ff4627_974x556.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: A Comparison of Search-o1-style Agentic Search and DeepRead on a Toy Case. [<a href="https://arxiv.org/pdf/2602.05014v3">Source</a>].</figcaption></figure></div><p>Existing agentic search systems can actively look through long documents, but they still treat everything as flat chunks. They rarely make use of headings, hierarchy or reading order. As a result, they miss information, repeat searches, and suffer from severe <strong>context fragmentation</strong>.</p><p>When people read long documents, the usual approach is to locate the right section first and <strong>read the surrounding content contiguously</strong>, instead of guessing keywords again and again. Many current agentic search systems still lack this ability.</p><p>At the same time, modern OCR and document parsing tools already do a decent job recovering headings, lists, and reading order <strong>into structured formats like Markdown</strong>. These structural elements are available today and should not be ignored.</p><p>This led to an idea: give the model explicit access to document structure so it can reason in a more human way, starting with locating the right part of the document and then reading through it.</p><h2>DeepRead: Not More Rounds of Search, But Setting the Document Upright First</h2><p>The idea behind DeepRead is straightforward: restore the document into a form that looks closer to how a person would read it. </p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/deepread-from-fragmented-retrieval">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[RAG Can Read Text, VDR Learns to Read Documents — AI Innovations and Insights 116]]></title><description><![CDATA[Most of us have felt this frustration at work: the answer is definitely in the document, but somehow search still can&#8217;t find it.]]></description><link>https://aiexpjourney.substack.com/p/rag-can-read-text-vdr-learns-to-read</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/rag-can-read-text-vdr-learns-to-read</guid><pubDate>Sun, 08 Mar 2026 08:50:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uK0D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f81e7d-1c07-4fe6-ba7b-a5a9e6d55fb8_1906x1148.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most of us have felt this frustration at work: the answer is definitely in the document, but somehow search still can&#8217;t find it. </p><p>That gap between &#8220;the file exists&#8221; and &#8220;the information is usable&#8221; is exactly why Visual Document Retrieval feels so important right now.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Visual Document Retrieval (VDR) Is Essential in the MLLM Era</h2><p>Visual Document Retrieval (VDR) focuses on visually rich documents rather than natural-image retrieval. It focuses on PDFs, academic papers, reports, invoices, and other structured pages. These visual documents differ from natural images in three fundamental ways.</p><ul><li><p>First, they carry far denser information. Text, layout, and charts work together to convey meaning. </p></li><li><p>Second, the retrieval granularity is much finer. People often need a specific fact inside a table, a particular sentence in a paragraph, or information tied to a location on the page. </p></li><li><p>Third, the goal leans toward precise information access, question answering, and evidence reasoning rather than concept matching like deciding whether two images look similar.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TYsF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TYsF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png 424w, https://substackcdn.com/image/fetch/$s_!TYsF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png 848w, https://substackcdn.com/image/fetch/$s_!TYsF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png 1272w, https://substackcdn.com/image/fetch/$s_!TYsF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TYsF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png" width="1072" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:1072,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:298179,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/189614916?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TYsF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png 424w, https://substackcdn.com/image/fetch/$s_!TYsF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png 848w, https://substackcdn.com/image/fetch/$s_!TYsF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png 1272w, https://substackcdn.com/image/fetch/$s_!TYsF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8754b4ad-4ba2-4377-8f3d-c48a16d4b6c7_1072x450.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Comparison of retrieval of natural image (left) and visual document (right), the focus of this survey. [<a href="https://arxiv.org/pdf/2602.19961v1">Source</a>].</figcaption></figure></div><p>Because visual documents encode meaning through text, layout, and graphics together, OCR-plus-text retrieval often misses important signals. Modern VDR increasingly benefits from models that preserve visual structure directly from document images. </p><h2>Three Paradigms of Visual Document Retrieval</h2>
      <p>
          <a href="https://aiexpjourney.substack.com/p/rag-can-read-text-vdr-learns-to-read">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Why RAG Struggles in Agent Scenarios — AI Innovations and Insights 115]]></title><description><![CDATA[Building agents taught me an odd truth: adding memory often makes behavior worse before it gets better.]]></description><link>https://aiexpjourney.substack.com/p/why-rag-struggles-in-agent-scenarios</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/why-rag-struggles-in-agent-scenarios</guid><pubDate>Mon, 02 Mar 2026 01:06:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RWdE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d31bff-8ae7-46fb-b51a-e1e681d49a00_2452x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Building agents taught me an odd truth: adding memory often makes behavior worse before it gets better. Because once everything is &#8220;relevant,&#8221; the model drowns in duplicates. </p><p>Is there a way to keep memory useful without turning your context window into a landfill?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why RAG Doesn&#8217;t Fit Well in Agent Use Cases</h2><p>Classic RAG is built for large, heterogeneous corpora where retrieved passages are relatively diverse, and a mainstream failure mode is pulling in irrelevant or redundant content.</p><p>In contrast, agent memory is a bounded, coherent dialogue stream with many near-duplicate spans, where aggressive compression can delete temporally linked prerequisites and fragment evidence chains.</p><p>So using RAG in agent scenarios often leads to:</p><ul><li><p>In highly correlated dialogue streams, fixed top-k similarity retrieval can <strong>collapse into near-duplicate chunks</strong>, repeatedly returning redundant context.</p></li><li><p>Post-retrieval pruning can <strong>delete temporally linked prerequisites and fragment evidence chains</strong>, often hurting multi-hop and temporal QA.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0rEw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0rEw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png 424w, https://substackcdn.com/image/fetch/$s_!0rEw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png 848w, https://substackcdn.com/image/fetch/$s_!0rEw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png 1272w, https://substackcdn.com/image/fetch/$s_!0rEw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0rEw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png" width="602" height="642.3960720130933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1304,&quot;width&quot;:1222,&quot;resizeWidth&quot;:602,&quot;bytes&quot;:707792,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/187692155?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0rEw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png 424w, https://substackcdn.com/image/fetch/$s_!0rEw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png 848w, https://substackcdn.com/image/fetch/$s_!0rEw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png 1272w, https://substackcdn.com/image/fetch/$s_!0rEw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a4e37b-3aac-4d5f-b506-8ef8865f836b_1222x1304.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1. From similarity top-k to structured retrieval for agent memory. Agent memory forms a coherent and highly correlated stream, where many spans are near duplicates; similarity top-k retrieval can therefore collapse and retrieve redundant chunks. xMemory organises memories into a hierarchy of intact units and performs structure-aware retrieval to produce a shorter but more answer-sufficient context. [<a href="https://arxiv.org/pdf/2602.02007v1">Source</a>].</figcaption></figure></div><h2>xMemory: Build a Hierarchical Memory That Cuts Tokens Without Losing the Plot</h2><p>So here comes an intuition: Instead of searching your entire inbox for the top-20 emails that match a keyword, you first group emails into threads, extract the durable facts each thread contains, then pull <em>a few threads</em> and only open individual emails when you&#8217;re still unsure.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/why-rag-struggles-in-agent-scenarios">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[DMAP: From Flat RAG to a Living Document Map — AI Innovations and Insights 114]]></title><description><![CDATA[When I was working on the RAG project, I often found myself wondering: since documents in a knowledge base usually have a clear structure, isn&#8217;t it problematic that most mainstream chunking and retrieval methods just flatten that structure?]]></description><link>https://aiexpjourney.substack.com/p/dmap-from-flat-rag-to-a-living-document</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/dmap-from-flat-rag-to-a-living-document</guid><pubDate>Wed, 25 Feb 2026 10:01:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UgO9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4016ed0a-583a-40e3-b65d-f15f743992af_2358x928.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I was working on the RAG project, I often found myself wondering: since documents in a knowledge base usually have a clear structure, isn&#8217;t it problematic that most mainstream chunking and retrieval methods just flatten that structure? </p><p>Is there a more elegant way to preserve and leverage the document hierarchy?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Motivation </h2><p>Most existing approaches to multimodal document question answering, especially chunk-based RAG, lose a critical piece of the puzzle when they slice documents into chunks and run vector retrieval. The original structure of the document, which was designed for human understanding, gets stripped away. This loss directly impacts both retrieval accuracy and reasoning quality.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gfWM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gfWM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png 424w, https://substackcdn.com/image/fetch/$s_!gfWM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png 848w, https://substackcdn.com/image/fetch/$s_!gfWM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png 1272w, https://substackcdn.com/image/fetch/$s_!gfWM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gfWM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png" width="1068" height="594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:594,&quot;width&quot;:1068,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:171907,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/187479109?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gfWM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png 424w, https://substackcdn.com/image/fetch/$s_!gfWM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png 848w, https://substackcdn.com/image/fetch/$s_!gfWM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png 1272w, https://substackcdn.com/image/fetch/$s_!gfWM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a8c9b9-f198-49ff-81f8-858b07680f87_1068x594.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: The utilities of document knowledge. [<a href="https://arxiv.org/pdf/2601.18203v2">Source</a>].</figcaption></figure></div><p>There are three core challenges:</p><ol><li><p>Multimodal documents are inherently complex and heterogeneous. Text, images, tables, and charts all coexist, and their relationships often hold the key to deeper reasoning. Modeling these connections, however, is far from straightforward.</p></li><li><p>Conventional chunk-based retrieval flattens the document. Hierarchical and contextual cues get lost, which breaks causal chains, causes inconsistencies in parallel reasoning, and fails to resolve references like &#8220;Table X&#8221; or &#8220;Page X&#8221;.</p></li><li><p>Recent systems (e.g., MDocAgent) leverage LLMs/LVLMs together with whole-page layout to refine retrieval results; more generally, LLMs can also enhance retrieval via query rewriting or re-ranking. However, these approaches still do not explicitly model the document&#8217;s native hierarchical structure &#8212; the very thing humans rely on to locate, comprehend, and reason through information.</p></li></ol><h2>DMAP: Two Agents and One Structural Map</h2><p>The DMAP workflow can be summed up in two simple steps:</p><ul><li><p>First, turn the document into a structural map: M = SSUA(D)</p></li><li><p>Then, answer questions using that map: A = RRA(q | M)</p></li></ul><p>At the center of this process are two key agents: SSUA and RRA.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/dmap-from-flat-rag-to-a-living-document">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Scientific Image Synthesis: From Pretty Pictures to Correct Science — AI Innovations and Insights 113 ]]></title><description><![CDATA[I came in as a fan of text-to-image models, just hoping for cleaner figures for slides.]]></description><link>https://aiexpjourney.substack.com/p/scientific-image-synthesis-from-pretty</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/scientific-image-synthesis-from-pretty</guid><pubDate>Fri, 20 Feb 2026 04:10:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!NGyL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30b764e5-9470-4e54-992c-24c383912146_1588x824.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I came in as a fan of text-to-image models, just hoping for cleaner figures for slides. There&#8217;s a moment every researcher knows: you spot a tiny inconsistency in a figure and suddenly nothing else is trustworthy. </p><p>This post is about building systems that make that moment far less common.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Scientific Diagrams Are So Much Harder Than Natural Images</h2><p>If you ask a model to draw a cat, and the worst outcome is that it looks a bit off.</p><p>But if you ask it to draw a circuit, and a missing capacitor or a slightly wrong angle can lead to a complete breakdown in logic.</p><p>This is what&#8217;s known as visual&#8211;logic divergence: the image might appear fine at first glance, but it&#8217;s scientifically incorrect. That makes it unreliable for any kind of downstream reasoning.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!21_Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!21_Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png 424w, https://substackcdn.com/image/fetch/$s_!21_Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png 848w, https://substackcdn.com/image/fetch/$s_!21_Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png 1272w, https://substackcdn.com/image/fetch/$s_!21_Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!21_Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png" width="1456" height="477" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:477,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:428979,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/187162808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!21_Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png 424w, https://substackcdn.com/image/fetch/$s_!21_Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png 848w, https://substackcdn.com/image/fetch/$s_!21_Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png 1272w, https://substackcdn.com/image/fetch/$s_!21_Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcdee613-083a-4880-98b9-7704d8cafe9f_1532x502.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Qualitative Error Taxonomy. Failures are categorized into five modes ranging from low-level visual artifacts to high-level semantic hallucinations. The specific errors are annotated in red. [<a href="https://arxiv.org/pdf/2601.17027v1">Source</a>].</figcaption></figure></div><p>A key reason is that scientific diagrams are governed by strict geometric, physical, or relational constraints, which text-to-image models often fail to follow precisely.</p><h2>Two Paths: One Optimizes for Expressiveness, the Other for Precision</h2><p>Scientific image generation can be divided into two main approaches:</p><ul><li><p><strong>Pixel-based</strong> methods translate text directly into pixels. They tend to produce richer visuals, but often struggle to reliably meet structural constraints. </p><ul><li><p>Next-generation closed-source models (like Nanobanana-Pro and GPT-Image-1.5) set a high bar, yet still struggle with dense data errors compared to code-based approaches; </p></li><li><p>Open-source models (like HunyuanImage-3.0 and Qwen-Image) can look acceptable in some cases, but frequent text blur and semantic/structural misalignment, with substantially lower structural-correctness metrics than proprietary models. </p></li></ul></li><li><p><strong>Code-driven</strong> or programmatic methods, on the other hand, first generate executable code, then render the image deterministically. This route offers much tighter control over structure, though it may come at the cost of visual richness. These generate diagrams by explicitly defining elements through code, using Python to specify coordinates, angles, and layout before rendering. The upside is precision. The downside is that the visuals can feel rigid or lack stylistic variety.</p></li></ul><p>While there is a fundamental <strong>precision&#8211;expressiveness trade-off</strong>, this study proposes a 'Spiral Co-evolution' hypothesis, which is mutual: code-based reasoning can transfer to pixel-based models, while pixel-based diversity enriches training data and feeds back into code-based reasoning and LMM learning.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tU5L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tU5L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png 424w, https://substackcdn.com/image/fetch/$s_!tU5L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png 848w, https://substackcdn.com/image/fetch/$s_!tU5L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png 1272w, https://substackcdn.com/image/fetch/$s_!tU5L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tU5L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png" width="1390" height="426" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:426,&quot;width&quot;:1390,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:230870,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/187162808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tU5L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png 424w, https://substackcdn.com/image/fetch/$s_!tU5L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png 848w, https://substackcdn.com/image/fetch/$s_!tU5L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png 1272w, https://substackcdn.com/image/fetch/$s_!tU5L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F201e1ca8-8595-43a5-89db-002fbe8ba172_1390x426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Precision vs. Expressiveness Trade-off. Left (a): When plotting the function y = x ln x, pixel-based models produce visually smooth but mathematically inaccurate plots, while code-based methods ensure exactness via execution. Right (b): Conversely, for physical scenarios like a spring system, pixel-based models offer richer visual expressiveness, whereas code-based outputs remain schematic. [<a href="https://arxiv.org/pdf/2601.17027v1">Source</a>].</figcaption></figure></div><p>Figure 2 illustrates this nicely. For tasks that demand strict accuracy, like plotting function curves, code-based approaches are far more stable. But for visually complex scenes where detailed rendering matters, pixel-based models often produce more compelling results.</p><h2>ImgCoder: A Three-Step Workflow That Puts Logic First</h2>
      <p>
          <a href="https://aiexpjourney.substack.com/p/scientific-image-synthesis-from-pretty">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[QuCo-RAG: Count What You Know, Retrieve What You Don’t — AI Innovations and Insights 112]]></title><description><![CDATA[It&#8217;s common for LLMs to produce &#8220;confidently wrong&#8221; answers.]]></description><link>https://aiexpjourney.substack.com/p/quco-rag-count-what-you-know-retrieve</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/quco-rag-count-what-you-know-retrieve</guid><pubDate>Mon, 16 Feb 2026 02:07:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_Y2x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It&#8217;s common for LLMs to produce &#8220;confidently wrong&#8221; answers.</p><p>This post is about a neat idea: deciding when to retrieve evidence by looking outside the model&#8217;s self-confidence to <strong>objective pre-training corpus statistics</strong>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Motivation</h2><p>The trigger signals used by most dynamic RAG systems to decide whether to retrieve are often unreliable. That&#8217;s because they typically rely on internal uncertainty measures like logits, probabilities, entropy, or attention weights. But LLMs aren&#8217;t well-calibrated. They can sound confident even when they&#8217;re completely wrong, which means these internal signals don&#8217;t always line up with actual correctness.</p><p>As shown in Figure 1, the DRAGIN example highlights this failure. A model might flag a <strong>token from the question</strong> (part of a movie title), like 'Il', as highly uncertain, while treating a completely hallucinated entity <strong>(like the wrong director name)</strong> as low uncertainty, giving false confidence to something that isn&#8217;t real.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9lQ8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9lQ8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png 424w, https://substackcdn.com/image/fetch/$s_!9lQ8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png 848w, https://substackcdn.com/image/fetch/$s_!9lQ8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png 1272w, https://substackcdn.com/image/fetch/$s_!9lQ8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9lQ8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png" width="1030" height="844" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:844,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:193877,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/186862077?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9lQ8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png 424w, https://substackcdn.com/image/fetch/$s_!9lQ8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png 848w, https://substackcdn.com/image/fetch/$s_!9lQ8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png 1272w, https://substackcdn.com/image/fetch/$s_!9lQ8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c78f68-4e42-44d0-93e1-3e4165c7c06d_1030x844.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Comparison of retrieval triggering mechanisms. (a) DRAGIN relies on model-internal signals, incorrectly assigning high uncertainty to &#8220;Il&#8221; (a token from the question) while showing low uncertainty on the hallucinated director name. (b) QuCo-RAG correctly detects the hallucination through zero entity co-occurrence in the pre-training corpus. [<a href="https://arxiv.org/pdf/2512.19134v1">Source</a>].</figcaption></figure></div><h2>QuCo-RAG: Replacing &#8220;Confidence&#8221; with &#8220;Corpus-Based Evidence&#8221;</h2><p>QuCo-RAG (Quantifying Uncertainty via Pre-training Corpus for Dynamic RAG) introduces a method for grounding answers in hard evidence: it draws statistical signals from the pretraining corpus to replace subjective confidence with measurable <strong>corpus statistics (entity frequency and entity-pair co-occurrence)</strong> from the pre-training corpus. It enables Infini-gram to provide millisecond-level hallucination detection and dynamic retrieval triggering.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_Y2x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_Y2x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png 424w, https://substackcdn.com/image/fetch/$s_!_Y2x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png 848w, https://substackcdn.com/image/fetch/$s_!_Y2x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png 1272w, https://substackcdn.com/image/fetch/$s_!_Y2x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_Y2x!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png" width="1200" height="607.4175824175824" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b93267cc-462c-4262-8611-3d0528eda210_1868x946.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:737,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:332655,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/186862077?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_Y2x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png 424w, https://substackcdn.com/image/fetch/$s_!_Y2x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png 848w, https://substackcdn.com/image/fetch/$s_!_Y2x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png 1272w, https://substackcdn.com/image/fetch/$s_!_Y2x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb93267cc-462c-4262-8611-3d0528eda210_1868x946.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Overview of QuCo-RAG Framework. [<a href="https://arxiv.org/pdf/2512.19134v1">Source</a>].</figcaption></figure></div><p>As shown in Figure 2, QuCo-RAG introduces a two-stage detection strategy to decide when retrieval should be triggered: </p><ul><li><p>Before generation begins, it checks if the input question contains rare entities, specifically, when the <strong>average frequency of the entities</strong> is fewer than <strong>&#964;_entity (default 1000, it </strong>reports stability across a wide range (1000&#8211;10^7). This is done using Infini-gram (a suffix array-based engine that supports millisecond-latency queries over trillion-token corpora) to look up entity frequency.</p></li><li><p>During generation, it monitors whether the <strong>co-occurrence</strong> of the entity pair being generated is supported by the corpus. If two entities have never co-occurred within the defined window (approximately 1000 tokens in experiments, with &#964;_cooc = 1 by default), QuCo-RAG (using Infini-gram) detects this zero co-occurrence in real time. This helps catch hallucinated facts as they emerge.</p></li></ul><p><strong>In short: detect rare entities ahead of time, block hallucinations in real time.</strong></p><p>For implementation, what QuCo-RAG does is as follows:</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/quco-rag-count-what-you-know-retrieve">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>