AI Innovations and Trends 03: LightRAG, Docling, DRIFT, and More

Nov 11, 2024

This article is the third in this series. Today we will look at four advancements in AI, which are:

LightRAG: Simple and Fast RAG
Docling: Efficient PDF Parsing with Layout Analysis and Structure Recognition
The DRIFT Search Feature of GraphRAG
Long-Context LLMs Meet RAG

LightRAG: Simple and Fast RAG

Open source code: https://github.com/HKUDS/LightRAG

Overview

LightRAG is a Retrieval-Augmented Generation (RAG) framework that enhances language model responses by integrating graph structures into text indexing and retrieval, improving contextual relevance and response coherence.

Figure 1: Overall architecture of LightRAG. Source: LightRAG.

LightRAG employs a dual-level retrieval system, using both low-level and high-level data retrieval to better capture detailed and abstract information, thus addressing complex user queries comprehensively.

The model's graph-based approach enables efficient retrieval of related entities and relationships, significantly reducing response time while maintaining information richness and contextual understanding.

Additionally, LightRAG is designed to adapt quickly to new data through an incremental update mechanism, ensuring that the model remains relevant in dynamic, evolving data environments.

LightRAG demonstrates superior performance over NaiveRAG in terms of comprehensiveness, empowerment, and diversity, as shown in a case study on indigenous perspectives in corporate mergers.

Figure 2: Case Study: Comparison Between LightRAG and the Baseline NaiveRAG. Source: LightRAG.

LightRAG delves deeper into key themes like cultural significance, collaboration, and legal frameworks, offering specific examples to support its insights. NaiveRAG, while informative, lacks this depth. The dual-level retrieval in LightRAG enables a thorough exploration of entities and relationships, effectively capturing the topic’s complexities and broader themes.

Comments and Insights: LightRAG vs. GraphRAG

Since both use a graph structure, what is the relationship between LightRAG and this year's popular GraphRAG?

Technical Inheritance: LightRAG is an improved version of GraphRAG. Both use graph structures to represent entities and relationships, enhancing the contextual understanding in information retrieval.
Retrieval Strategy: GraphRAG uses community traversal, retrieving information by traversing each community individually. This approach works well for broad queries but has a high computational cost. LightRAG, on the other hand, adopts a dual-level retrieval strategy, consisting of low-level and high-level retrieval, to handle both detailed and abstract information, making it more efficient for complex queries.
Computational Efficiency: GraphRAG has low efficiency when processing dynamic data updates, as it requires reconstructing the entire community structure, which is computationally expensive. LightRAG reduces this overhead significantly through an incremental update mechanism that allows rapid adaptation without re-indexing.
Response Diversity and Comprehensiveness: GraphRAG primarily focuses on global information, while LightRAG’s dual-level retrieval mechanism enhances response diversity and comprehensiveness by covering both detailed and higher-level thematic information, as shown in Figure 3.
Figure 3: Case Study: Comparison Between LightRAG and the Baseline Method GraphRAG. Source: LightRAG.

Docling: Efficient PDF Parsing with Layout Analysis and Structure Recognition

Open source code: https://github.com/DS4SD/docling

Overview

Docling implements a linear document processing pipeline.

Figure 4: Sketch of Docling’s default processing pipeline. The inner part of the model pipeline is easily customizable and extensible. Source: Docling.

Each document is first parsed by a PDF backend, which extracts text content and coordinates from each page and generates bitmap images to support subsequent operations. Next, the model pipeline independently applies AI models on each page to extract content features such as layout and table structures. Finally, results from all pages are aggregated, and a post-processing stage adds metadata, detects language, infers reading order, and assembles a document object that can be serialized into JSON or Markdown.

Layout Analysis Model

The layout analysis model in Docling is an object detector that predicts bounding boxes and classes of elements on a page image. Based on RT-DETR architecture and retrained on the DocLayNet dataset, it processes page images at 72 dpi with sub-second latency on a single CPU.

Table Structure Recognition

The TableFormer model, a vision-transformer model for table structure analysis, identifies logical row and column structures, categorizing cells as headers or body. Detected table objects are fed to TableFormer, which processes tables in 2-6 seconds on a CPU.

OCR

Docling includes optional OCR support, particularly for scanned PDFs or embedded bitmaps. It uses EasyOCR, running at 216 dpi to capture fine details, though it can be slow (around 30 seconds per page on CPU).

Evaluation

Measurements were taken on two hardware systems: a MacBook Pro M3 Max and a server running Ubuntu 20.04 LTS with an Intel Xeon E5-2690 CPU. The thread budget was fixed at 4 (default) and 16 (equal to the core count of the test hardware), with results shown in Figure 5.

Figure 5: Runtime characteristics of Docling with the standard model pipeline and settings, on the test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. Source: Docling.

Comments and Insights

In general, when designing our own document parsing tool, we can refer to Docling's pipeline process. Some of Docling's key models can be replaced with smaller models that we train ourselves.

Additionally, Docling's optimization approach for multi-threading is worth referencing.

The DRIFT Search Feature of GraphRAG

DRIFT Search (Dynamic Reasoning and Inference with Flexible Traversal) is an extension of GraphRAG, enhancing local search efficiency by incorporating community information. It combines global and local search methods, generating more detailed responses.

DRIFT offers a three-phase process:

An initial comparison of the query with top community reports for broad answers and guiding questions;
Refined local searches for intermediate answers, enriching context;
A hierarchical, relevance-ranked output of questions and answers.

Figure 1. An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process. — Figure 6: An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process. A (Primer): DRIFT compares the user’s query with the top K most semantically relevant community reports, generating a broad initial answer and follow-up questions to steer further exploration. B (Follow-Up): DRIFT uses local search to refine queries, producing additional intermediate answers and follow-up questions that enhance specificity, guiding the engine towards context-rich information. A glyph on each node in the diagram shows the confidence the algorithm has to continue the query expansion step. C (Output Hierarchy): The final output is a hierarchical structure of questions and answers ranked by relevance, reflecting a balanced mix of global insights and local refinements, making the results adaptable and comprehensive. Source: DRIFT Search.

This process integrates community insights, enhancing local search and ensuring the final results are both extensive and specific.

In benchmarking DRIFT against local search, DRIFT showed superior performance in comprehensiveness (78%) and diversity (81%).

Comments and Insights

Balancing between global search and local search in GraphRAG has always been a challenge. By incorporating community information, DRIFT Search broadens the starting point for local searches, making responses more diverse and relevant.

DRIFT shows promise in scenarios requiring layered and detailed responses and may drive RAG systems toward more adaptive architectures.

Long-Context LLMs Meet RAG

In my previous articles, we introduced the relevant content about long-text LLMs and RAG. Here, let's look at a new study.

In long-context LLMs, as the number of retrieved passages increases, the model's performance initially improves but then begins to decline. This phenomenon is mainly attributed to the disruptive effect of retrieved "hard negatives."

Figure 7: Evaluating the impact of hard negatives on long-context LLMs. (a) The retriever performance on NQ dataset: e5 > contriever > BM25. (b)(c)(d) For each query, a single golden passage (containing the correct answer) is combined with varying numbers of hard negative passages retrieved by different methods: e5, Contriever, BM25, and random sampling. The LLMs are then tasked with answering the query based on this context. This setup allows us to assess the robustness of LLMs to hard negatives and the influence of retriever characteristics on their overall impact. Source: Long-Context LLMs Meet RAG.

"Hard negatives" refer to retrieved passages or documents that are highly similar or contextually relevant to the query but do not contain the correct answer or relevant information. These passages can be misleading for the model because they appear closely related to the topic or query but, in fact, provide incorrect or irrelevant information.

Strategies

Through a series of experiments, this study analyzes the causes of this phenomenon and proposes several strategies to improve the robustness and output quality of long-context LLMs in RAG tasks:

Retrieval Reordering: Based on the “lost-in-the-middle” phenomenon, this strategy suggests positioning high-scoring retrieved documents at the beginning and end of the input sequence. By prioritizing key information in these positions, the model can more effectively focus on relevant content, reducing the impact of "hard negatives"—highly similar but incorrect information that disrupts model performance.
Implicit Robustness Fine-tuning: Since LLMs do not explicitly acquire the ability to handle noise during standard training, the study introduces a fine-tuning method using retrieved content with noise (including potentially irrelevant information). This method enables the model to exhibit greater robustness when encountering “hard negatives.”
Explicit Relevance Fine-tuning: While implicit fine-tuning enhances robustness, it does not explicitly train the model to identify relevant documents. Therefore, the study proposes adding an intermediate reasoning step, allowing the model to first analyze and identify relevant information before generating the final output, thus improving accuracy in detecting and utilizing relevant content within the retrieved context.

It also suggests further exploration of automated retrieval ordering optimization and fine-tuning LLMs with more granular, multi-step reasoning chains to enhance their application capabilities in RAG tasks.

Finally, if you’re interested in the series, feel free to check out my other articles.

AI Exploration Journey

Discussion about this post