Retrieval Augmented Generation(RAG)[1] was initially proposed in 2020 as an end-to-end approach that combined a pre-trained retriever and a pre-trained generator. At that time, its main goal was to improve performance through model fine-tuning.
The release of ChatGPT in December 2022 marked a significant turning point for RAG. Since then, RAG has focused more on leveraging the reasoning capabilities of large language models (LLM) to achieve better generation results by incorporating external knowledge.
RAG technology eliminates the need for developers to retrain the entire large-scale model for every specific task. Instead, they can simply connect relevant knowledge bases to provide additional input to the model, enhancing the accuracy of the answers.
This article provides a brief introduction to the concept, purpose, and characteristics of RAG.
What is Retrieval Augmented Generation (RAG)?
Retrieval Augmented Generation (RAG) [1] is the process of enhancing large language models (LLMs) by incorporating additional information from external knowledge sources. This enables the LLMs to generate more accurate and context-aware answers, while also reducing hallucinations.
When answering questions or generating text, relevant information is initially retrieved from existing knowledge bases or a large number of documents. The answer is then generated using LLM, which improves the quality of the response by incorporating this retrieved information, rather than relying solely on LLM to generate it.
A typical workflow of RAG is illustrated in Figure 1.
As shown in Figure 1, RAG mainly consists of the following steps:
Indexing: The indexing process is a crucial initial step performed offline. It begins with cleaning and extracting the raw data, converting various file formats such as PDF, HTML and Word into standardized plain text. To accommodate the context constraints of the language model, these texts are divided into smaller and more manageable chunks, a process known as chunking. These chunks are then transformed into vector representations using embedding models. Finally, an index is created to store these text chunks and their vector embeddings as key-value pairs, enabling efficient and scalable search capabilities.
Retrieval: The user query is used to retrieve relevant context from external knowledge sources. To accomplish this, the user query is processed by an encoding model, which generates semantically related embeddings. Then, a similarity search is conducted on a vector database to retrieve the top k closest data objects.
Generation: The user query and the retrieved additional context are filled into a prompt template. Finally, the augmented prompt from the retrieval step is input into the LLM.
Why do we need RAG?
Why do we still need RAG when we have LLM? The reason is simple: LLM cannot solve the problems that RAG can address. These problems include:
Model hallucination problem: The text generation in LLM is based on probability. Without sufficient factual support, it may generate content that appears serious but lacks coherence.
Timeliness problem: The larger the parameter size of LLM, the higher the training cost and the longer the time required. As a result, time-sensitive data may not be included in training in a timely manner, leading to the model’s inability to directly answer time-sensitive questions.
Data security problem: Generic LLMs do not have access to enterprise internal or user-private data. To ensure data security while using LLM, a good solution is to store the data locally and perform all data computations locally. The cloud LLM only serves the purpose of summarizing information.
Answer constraint problem: RAG provides more control over LLM generation. For instance, when a question involves multiple knowledge points, the clues retrieved through RAG can be used to limit the boundaries of LLM generation.
What are the characteristics of RAG?
RAG possesses the following characteristics, which enable it to effectively address the mentioned issues:
(1) Scalability: RAG reduces model size and training costs, and facilitates rapid knowledge expansion.
(2) Accuracy: The model provides answers based on facts, minimizing the occurrence of illusions.
(3) Controllability: RAG allows for knowledge updates and customization.
(4) Explainability: Relevant information retrieved serves as a reference for the model’s predictions.
(5) Versatility: RAG can be fine-tuned and customized for various tasks such as QA, Summary, Dialogue, etc.
Conclusion
In terms of imagery, RAG can be likened to an open-book exam for LLM. Similar to an open-book exam, students are permitted to bring reference materials that they can consult to find relevant information in order to answer questions.
This article only provides an introduction to the basic knowledge of RAG. Many advanced RAG techniques will be introduced in the future.
Additionally, if you’re interested in RAG , feel free to check out my other articles.
Lastly, if there are any errors or omissions in this article, please kindly bring them to our attention.
References
[1]: Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401, 2023.