AI Innovations and Insights 20: HtmlRAG, AFLOW, ChunkRAG, and MarkItDown

Jan 18, 2025

This article is the 20th in this thought-provoking series.

Today, we will explore four fascinating topics in AI, which are:

HtmlRAG: From Text Fragments to a Global View
AFLOW: A Master Prospector in the Desert of Workflow Search Space
ChunkRAG: Eagle-Eyed Reader for Precise Knowledge Extraction
MarkItDown: A Tool for Converting Files to Markdown

Exquisite Video Imagery:

HtmlRAG: From Text Fragments to a Global View

Open-source code: https://github.com/plageon/HtmlRAG

Vivid Description

HtmlRAG is like opening a book where you can see both the text and understand its chapters and layout, rather than seeing just disconnected words (traditional RAG).

Overview

Current RAG systems convert HTML to plain text before processing, losing valuable structural information.

Thus, an intuitive idea arises: Could using HTML format directly in RAG systems better preserve document information?

Figure 1: Information loss in HTML to plain text conversion. [Source].

HtmlRAG leverages HTML format instead of plain text in RAG systems to preserve semantic and structural information.

Figure 2: HTML for RAG pipeline overview. [Source].

Due to HTML's longer context length, HtmlRAG uses progressive trimming to shorten documents. The four steps shown in Figure 2 are as follows: HTML Cleaning, Block Tree Construction, Text-Embedding-Based Block Pruning, and Generative Fine-Grained Block Pruning.

Commentary

HtmlRAG uses HTML as a knowledge carrier in RAG systems, leveraging its structure and using pruning algorithms to optimize context length.

However, most documents are not stored in HTML format, requiring conversion tools that can slow down processing.

AFLOW: A Master Prospector in the Desert of Workflow Search Space

The code will be available at https://github.com/geekan/MetaGPT

Vivid Description

AFLOW is a master prospector who, in the vast desert (workflow search space), uses advanced tools (Monte Carlo Tree Search (MCTS) and operators) to continuously dig and find the deepest buried gold (efficient solutions).

Overview

When building agentic workflows, LLMs require significant human effort, which limits their scalability and generalizability.

AFLOW reformulates workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. It is an automated framework that efficiently explores this space using Monte Carlo Tree Search (MCTS), iteratively optimizing workflows through code modification, tree-structured experience, and execution feedback.

AFLOW's core concept is to model workflows as a sequence of interconnected LLM-invoking nodes, where nodes represent LLM operations, and edges define the logic, dependencies, and flow between these operations. Operators are combinations of node operations that define logical relationships and common task patterns between nodes.

Figure 3: The example of node, operator, and edge. [Source].

As shown in Figure 4, AFLOW performs an MCTS-based search within a space defined by nodes with flexible prompt parameters, a given operator set, and code-represented edges.

Using a specialized MCTS variant for workflow optimization, AFLOW iteratively cycles through four steps: Soft Mixed Probability Selection, LLM-Based Expansion, Execution Evaluation, and Experience Backpropagation. This process continues until it reaches the maximum iterations or meets convergence criteria.

Figure 4: Overall AFLOW framework. [Source].

Commentary

AFLOW redefines workflow optimization as a search problem over code-represented workflows, showing innovation.

But I have the following concerns:

Although operators improve search efficiency, these operators require prior design and their applicability may be limited in complex or novel tasks.
AFLOW's search process terminates when certain conditions are met (such as no improvement for n rounds), but this may result in high-potential paths being missed.

ChunkRAG: Eagle-Eyed Reader for Precise Knowledge Extraction

Vivid Description

ChunkRAG is like a sharp-eyed reader who first breaks down long articles into small paragraphs, then applies expert judgment to pick out the most relevant passages, capturing all key points while avoiding irrelevant content.

Overview

Traditional RAG systems can produce inaccurate content by retrieving irrelevant information. Current document-level filtering fails to remove less relevant content within documents.

Consider a query asking "What is the capital of France?" Without proper filtering, the system might include unnecessary facts about other French cities, leading to incorrect or verbose responses (Figure 5, Left).

Figure 5: Comparison of Response Generation With and Without Chunk Filtering. [Source].

ChunkRAG introduces a novel LLM-based chunk filtering framework that enhances the precision and factuality of generated content through semantic chunking and relevance scoring.

Figure 6: ChunkRAG Methodology for Enhanced Retrieval and Filtering. [Source].

ChunkRAG framework operates in three main stages: Semantic Chunking, Hybrid Retrieval and Filtering, and Controlled Response Generation.

Commentary

In summary, ChunkRAG integrates several advanced RAG technologies—semantic chunking, query rewriting, self-reflection, and hybrid retrieval strategies—to enhance performance.

In my view:

ChunkRAG’s semantic chunking method we discussed earlier has a limitation: it performs poorly with sentences that have weak semantic similarity but strong logical connections, particularly in complex structures.
In the future, the self-reflection mechanism is expected to become a crucial element in quality control for complex task content generation.

MarkItDown: A Tool for Converting Files to Markdown

Overview

MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc.). It supports PDF, Images, PowerPoint, Word, and more.

It has recently become very popular and trending.

As usual, let's take a look at how it converts PDFs and images to MarkDown:

For PDFs, it eventually arrives at class PdfConverter, which calls pdfminer.high_level.extract_text(...).
For images, it eventually arrives at class ImageConverter, which extracts meta data and can call an multimodal LLM to get a caption/description.

Commentary

The approach is quite straightforward, though the project seems to still be in development. Looking forward to its future development.

Finally, if you’re interested in the series, feel free to check out my other articles.

Share AI Exploration Journey

AI Exploration Journey

AI Innovations and Insights 20: HtmlRAG, AFLOW, ChunkRAG, and MarkItDown

HtmlRAG: From Text Fragments to a Global View

Vivid Description

Overview

Commentary

AFLOW: A Master Prospector in the Desert of Workflow Search Space

Vivid Description

Overview

Commentary

ChunkRAG: Eagle-Eyed Reader for Precise Knowledge Extraction

Vivid Description

Overview

Commentary

MarkItDown: A Tool for Converting Files to Markdown

Overview

Commentary

Discussion about this post