Enhancing RAG for Reliable Responses: A Dive into SFR-RAG's Fight Against Hallucinations

Sep 19, 2024

Existing RAG frameworks face several limitations. Most general-purpose LLMs are not equipped to handle the complexities of conflicting contextual information or provide reliable citations. They tend to revert to using pre-trained data when the retrieved context is insufficient or unclear, resulting in hallucinated responses.

SFR-RAG is a novel model optimized for minimizing hallucination and faithfully generating responses based on the context retrieved.

SFR-RAG

The SFR-RAG model is a 9-billion-parameter LLM instruction-tuned with a strong focus on context-grounded generation and hallucination minimization. It is designed to tackle common challenges in RAG frameworks, such as conflicting information or gaps in retrieved knowledge. Additionally, it offers function-calling capabilities, allowing it to interact with external tools dynamically, and is trained to cite appropriate sources reliably.

The framework is evaluated using ContextualBench, a benchmarking suite specifically designed for RAG systems, ensuring reproducibility and consistency across multiple datasets such as HotpotQA, TriviaQA, and TruthfulQA.

The SFR-RAG model is built upon a specialized chat template and a fine-tuned process to ensure high fidelity in its responses by making full use of retrieved contextual information.

Figure 2: Example of the chat format used by SFR-RAG, with additional Thought and Observation turns (roles). The former indicates the model’s “inner” thought or reasoning, actions and tool use syntax that are not typically meant to be shown to users. The latter indicates all external information retrieved and returned by performing a search or function call. The Assistant turn, therefore, is relieved to only be responsible to generate user-friendly responses. During training, Thought and Assistant turns are trained while the others are masked out. Source: SFR-RAG.

SFR-RAG Chat Template:
- The model introduces a unique chat structure that goes beyond the usual System, User, and Assistant roles by adding Thought and Observation roles.
  - Thought represents the model’s internal reasoning and decision-making process.
  - Observation holds the external information retrieved, helping to separate retrieved context from internal logic, which is crucial for avoiding confusion during response generation.
Fine-Tuning Process:
- The model is trained with a focus on real-world retrieval tasks, ensuring it can handle complex contexts while minimizing hallucinations. Key aspects include:
  - Contextual Comprehension: It extracts relevant information from long contexts, distinguishes conflicting data, and refrains from generating answers when there is insufficient context.
  - Hallucination Reduction: The model is fine-tuned to avoid generating information that is not grounded in the retrieved context, ensuring high factual accuracy.
Agentic and Function-Calling Abilities:
- SFR-RAG is capable of proactive search and function-calling, allowing it to retrieve additional information or use external tools dynamically during the response generation process. This feature is particularly useful for tasks requiring multi-hop reasoning across different contexts.

Evaluation

Compared to other solutions like Command-R+, SFR-RAG stands out by achieving state-of-the-art performance with significantly fewer parameters. While other models like GPT-4o excel in standard tasks, SFR-RAG's specialized tuning for RAG frameworks allows it to outperform them in key contextual benchmarks like 2WikiHopQA and HotpotQA.

Figure 3: Performances of SFR-RAG-9B and various open- and closed-source baselines across 7 contextual question answering tasks in ContextualBench. Bold numbers mean best of all, while underlined numbers mean best among open-source models. PopQA is measured in easy-match accuracy (EasyEM), while the rest are measured in exact-match accuracy (EM). The Appendix presents the full results in metrics. Source: SFR-RAG.

The model also underwent additional testing with FaithEval, a suite designed to assess resilience to changes in context. It remained faithful to the provided context even when facts were altered or removed, showcasing the model's ability to resist hallucination and remain focused on factual, relevant information.

Figure 4: FaithEval: average easy match accuracy scores of different models when contextual facts are fabricated (Counterfactual), removed (Unknown) or when the facts are contradicting (Conflict). Small variations between those settings and overall high absolute scores indicate that SFR-RAG-9B is resilient to changes in contextual information. Source: SFR-RAG.

Conclusion and Insights

This article explored SFR-RAG, a 9-billion-parameter LLM fine-tuned for RAG. Its ability to remain faithful to context, avoid hallucination, and cite reliable sources marks a significant advancement in generative AI.

In my opinion, one of the most compelling aspects is its compact size compared to competitors, achieving better results with fewer parameters.

Additionally, challenges remain.

One minor shortcoming is the absence of a detailed case study. While the model’s strong performance is demonstrated through standardized benchmarks, there are no real-world application examples, such as its deployment in domains like healthcare, legal, or financial industries. These fields often require processing complex contextual information, and a case study could further showcase SFR-RAG's practical value in commercial and applied settings.
While SFR-RAG performs well on several benchmarks, the paper does not emphasize its performance in open-domain, non-contextual tasks.
As models grow in size, there is potential for further improving reasoning and contextual comprehension in larger datasets, which would be an interesting direction for future work.

AI Exploration Journey

Discussion about this post