Advanced RAG 01: Problems of Naive RAG
This series of articles will focus on introducing advanced RAG techniques to enhance the quality of RAG generation.
Retrieval Augmented Generation(RAG) is the process of improving large language models(LLMs) by integrating additional information from external knowledge sources. This allows the LLMs to produce more precise and context-aware responses, while also mitigating hallucinations.
RAG has emerged as the most popular architecture in systems based on LLM since 2023. Many products are heavily reliant on RAG for their functionality. Therefore, optimizing the performance of RAG to make the retrieval process faster and the results more accurate has become a crucial issue.
This series of articles will focus on introducing advanced RAG techniques to enhance the quality of RAG generation.
Naive RAG Review
A typical workflow of naive RAG is illustrated in Figure 1.
As shown in Figure 1, RAG mainly consists of the following steps:
Indexing: The indexing process is a crucial initial step performed offline. It begins with cleaning and extracting the raw data, converting various file formats such as PDF, HTML and Word into standardized plain text. To accommodate the context constraints of the language model, these texts are divided into smaller and more manageable chunks, a process known as chunking. These chunks are then transformed into vector representations using embedding models. Finally, an index is created to store these text chunks and their vector embeddings as key-value pairs, enabling efficient and scalable search capabilities.
Retrieval: The user query is used to retrieve relevant context from external knowledge sources. To accomplish this, the user query is processed by an encoding model, which generates semantically related embeddings. Then, a similarity search is conducted on a vector database to retrieve the top k closest data objects.
Generation: The user query and the retrieved additional context are filled into a prompt template. Finally, the augmented prompt from the retrieval step is input into the LLM.
Problems with Naive RAG
As shown in Figure 2, Naive RAG has problems(the red dashed box) in all three steps mentioned above, and there is ample room for optimization.
Indexing
Information extraction is incomplete, as it does not effectively process useful information in images and tables within unstructured files such as PDF.
The chunking process uses a “one-size-fits-all” strategy instead of selecting optimal strategies based on the characteristics of different file types. This has led to each chunk containing incomplete semantic information. Furthermore, it fails to consider important details, such as existing headings in the text.
The indexing structure is not sufficiently optimized, leading to inefficient retrieval functionality.
The embedding model’s semantic representation capability is weak.
Retrieval
The relevance of the recalled contexts is inadequate and the accuracy is low.
The low recall rate prevents the retrieval of all relevant passages, thereby hindering the ability of LLMs to generate comprehensive answers.
The query may be inaccurate or the semantic representation capability of the embedding model may be weak, resulting in the inability to retrieve valuable information.
The retrieval algorithm is limited because it does not incorporate different types of retrieval methods or algorithms, such as combining keyword, semantic, and vector retrieval.
Information redundancy occurs when multiple retrieved contexts contain similar information, leading to repetitive content in the generated answers.
Generation
Effectively integrating the retrieved context with the current generation task may not be possible, resulting in inconsistent outputs.
Over-reliance on the enhanced information during the generation process carries a high risk. This can lead to outputs that simply repeat the retrieved content without providing valuable information.
The LLM may generate incorrect, irrelevant, harmful, or biased responses.
Note that the reasons for these problems can be multifaceted. For instance, if the final response given to the user contains irrelevant content, it may not be solely due to LLM issues. The underlying cause could be imprecise extraction of the document from the PDF or the embedding model’s inability to accurately capture semantics, and so on.
Conclusion
This article introduces the problems that exist in Naive RAG.
The next part of this series will provide measures or solutions to mitigate these problems and enhance the effectiveness of RAG.
If you’re interested in RAG technologies, feel free to check out my other articles.
Lastly, if there are any errors or omissions in this article, please kindly point them out.