Advanced RAG 04: Re-ranking

From Principles to Two Mainstream Implementation Methods

Feb 14, 2024

Re-ranking plays a crucial role in the Retrieval Augmented Generation (RAG) process. In a naive RAG approach, a large number of contexts may be retrieved, but not all of them are necessarily relevant to the question. Re-ranking allows for the reordering and filtering of documents, placing the relevant ones at the forefront, thereby enhancing the effectiveness of RAG.

This article introduces RAG’s re-ranking technique and demonstrates how to incorporate re-ranking functionality using two methods.

Introduction to Re-ranking

Figure 1: Re-ranking in RAG, the task of re-ranking is to evaluate the relevance of these contexts and prioritize the ones(red boxes) that are most likely to provide accurate and relevant answers. Image by author.

As shown in Figure 1, the task of re-ranking is like an intelligent filter. When the retriever retrieves multiple contexts from the indexed collection, these contexts may have different relevance to the user’s query. Some contexts may be very relevant (highlighted in red boxes in Figure 1), while others may only be slightly related or even unrelated (highlighted in green and blue boxes in Figure 1).

The task of re-ranking is to evaluate the relevance of these contexts and prioritize the ones that are most likely to provide accurate and relevant answers. This allows the LLM to prioritize these top-ranked contexts when generating answers, thereby improving the accuracy and quality of the response.

In simpler terms, re-ranking is like helping you choose the most relevant references from a pile of study materials during an open-book exam, so that you can answer the questions more efficiently and accurately.

The re-ranking methods described in this article can be mainly divided into the following two types:

Re-ranking models: These models consider the interaction features between documents and queries to evaluate their relevance more accurately.
LLM: The emergence of LLM has opened up new possibilities for re-ranking. By thoroughly understanding the entire document and query, it is possible to capture semantic information more comprehensively.

Using re-ranking model as reranker

The re-ranking model, unlike the embedding model, takes query and contexts as inputs and directly outputs similarity scores instead of embeddings. It is important to note that the re-ranking model is optimized using cross-entropy loss, allowing for relevance scores that are not limited to a specific range and can even be negative.

Currently, there are not many available re-ranking models. One option is an online model by Cohere that can be accessed through API. There are also open-source models such as bge-reranker-base and bge-reranker-large , among others.

Figure 2 shows the evaluation results using the Hit Rate and Mean Reciprocal Rank (MRR) metrics:

Figure 2: The evaluation results using the Hit Rate and Mean Reciprocal Rank (MRR) metrics. Source: Boosting RAG: Picking the Best Embedding & Reranker models

From this evaluation result, we can see:

Regardless of the embedding model used, re-ranking demonstrates a higher hit rate and MRR, indicating the significant impact of re-ranking.
Currently, the best re-ranking model is Cohere, but it is a paid service. The open-source bge-reranker-large model has a similar capability to Cohere.
The combination of embedding models and re-ranking models can also have an impact, so developers may need to experiment with different combinations in their actual process.

In this article, the bge-reranker-base model will be used.

Environment Configuration

Import relevant libraries, set environment and global variables

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
from llama_index.schema import QueryBundle

dir_path = "YOUR_DIR_PATH"

There is only one PDF file in the directory, the paper “TinyLlama: An Open-Source Small Language Model” is used.

(py) Florian:~ Florian$ ls /Users/Florian/Downloads/pdf_test/
tinyllama.pdf

Using LlamaIndex to build a simple retriever

documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k = 3)

Basic Retrieving

query = "Can you provide a concise description of the TinyLlama model?"
nodes = retriever.retrieve(query)
for node in nodes:
    print('----------------------------------------------------')
    display_source_node(node, source_length = 500)

The display_source_node function is adapted from the llama_index source code. The original function was designed for Jupyter notebook, so it has been modified as follows:

from llama_index.schema import ImageNode, MetadataMode, NodeWithScore
from llama_index.utils import truncate_text

def display_source_node(
    source_node: NodeWithScore,
    source_length: int = 100,
    show_source_metadata: bool = False,
    metadata_mode: MetadataMode = MetadataMode.NONE,
) -> None:
    """Display source node"""
    source_text_fmt = truncate_text(
        source_node.node.get_content(metadata_mode=metadata_mode).strip(), source_length
    )
    text_md = (
        f"Node ID: {source_node.node.node_id} \n"
        f"Score: {source_node.score} \n"
        f"Text: {source_text_fmt} \n"
    )
    if show_source_metadata:
        text_md += f"Metadata: {source_node.node.metadata} \n"
    if isinstance(source_node.node, ImageNode):
        text_md += "Image:"

    print(text_md)
    # display(Markdown(text_md))
    # if isinstance(source_node.node, ImageNode) and source_node.node.image is not None:
    #     display_image(source_node.node.image)

The results of basic retrieving are as follows, represent the top 3 nodes before re-ranking.:

----------------------------------------------------
Node ID: 438b9d91-cd5a-44a8-939e-3ecd77648662 
Score: 0.8706055408845863 
Text: 4 Conclusion
In this paper, we introduce TinyLlama, an open-source, small-scale language model. To promote
transparency in the open-source LLM pre-training community, we have released all relevant infor-
mation, including our pre-training code, all intermediate model checkpoints, and the details of our
data processing steps. With its compact architecture and promising performance, TinyLlama can
enable end-user applications on mobile devices, and serve as a lightweight platform for testing a
w... 

----------------------------------------------------
Node ID: ca4db90f-5c6e-47d5-a544-05a9a1d09bc6 
Score: 0.8624531691777889 
Text: TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang∗Guangtao Zeng∗Tianduo Wang Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
{peiyuan_zhang, tianduo_wang, @sutd.edu.sg">luwei}@sutd.edu.sg
guangtao_zeng@mymail.sutd.edu.sg
Abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1
trillion tokens for approximately 3 epochs. Building on the architecture and tok-
enizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances
contr... 

----------------------------------------------------
Node ID: e2d97411-8dc0-40a3-9539-a860d1741d4f 
Score: 0.8346160605298356 
Text: Although these works show a clear preference on large models, the potential of training smaller
models with larger dataset remains under-explored. Instead of training compute-optimal language
models, Touvron et al. (2023a) highlight the importance of the inference budget, instead of focusing
solely on training compute-optimal language models. Inference-optimal language models aim for
optimal performance within specific inference constraints This is achieved by training models with
more tokens...

Re-ranking

To re-rank the above nodes, use the bge-reranker-base model.

print('------------------------------------------------------------------------------------------------')
print('Start reranking...')

reranker = FlagEmbeddingReranker(
    top_n = 3,
    model = "BAAI/bge-reranker-base",
)

query_bundle = QueryBundle(query_str=query)
ranked_nodes = reranker._postprocess_nodes(nodes, query_bundle = query_bundle)
for ranked_node in ranked_nodes:
    print('----------------------------------------------------')
    display_source_node(ranked_node, source_length = 500)

The results after re-ranking are as follows:

------------------------------------------------------------------------------------------------
Start reranking...
----------------------------------------------------
Node ID: ca4db90f-5c6e-47d5-a544-05a9a1d09bc6 
Score: -1.584416151046753 
Text: TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang∗Guangtao Zeng∗Tianduo Wang Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
{peiyuan_zhang, tianduo_wang, @sutd.edu.sg">luwei}@sutd.edu.sg
guangtao_zeng@mymail.sutd.edu.sg
Abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1
trillion tokens for approximately 3 epochs. Building on the architecture and tok-
enizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances
contr... 

----------------------------------------------------
Node ID: e2d97411-8dc0-40a3-9539-a860d1741d4f 
Score: -1.7028117179870605 
Text: Although these works show a clear preference on large models, the potential of training smaller
models with larger dataset remains under-explored. Instead of training compute-optimal language
models, Touvron et al. (2023a) highlight the importance of the inference budget, instead of focusing
solely on training compute-optimal language models. Inference-optimal language models aim for
optimal performance within specific inference constraints This is achieved by training models with
more tokens... 

----------------------------------------------------
Node ID: 438b9d91-cd5a-44a8-939e-3ecd77648662 
Score: -2.904750347137451 
Text: 4 Conclusion
In this paper, we introduce TinyLlama, an open-source, small-scale language model. To promote
transparency in the open-source LLM pre-training community, we have released all relevant infor-
mation, including our pre-training code, all intermediate model checkpoints, and the details of our
data processing steps. With its compact architecture and promising performance, TinyLlama can
enable end-user applications on mobile devices, and serve as a lightweight platform for testing a
w...

It is evident that after re-ranking, the node with the ID ca4db90f-5c6e-47d5-a544–05a9a1d09bc6 has changed its ranking from 2 to 1. It means that the most relevant context is ranked first.

Using LLM as Reranker

The existing methods for re-ranking involving LLM can be roughly divided into three categories: fine-tuning LLM with re-ranking tasks, prompting LLM for re-ranking, and using LLM for data augmentation during training.

The method of prompting LLM for re-ranking has a lower cost. Below is a demonstration using RankGPT, which has been integrated into LlamaIndex.

The idea of RankGPT is to perform zero-shot listwise passage re-ranking using LLM (such as ChatGPT or GPT-4 or other LLMs). It applies a permutation generation approach and sliding window strategy to efficiently re-rank passages.

As shown in Figure 3, the paper presents three feasible methods.

Figure 3: Three types of instructions for zero-shot passage re-ranking with LLMs. The gray and yellow blocks indicate the inputs and outputs of the model. (a) Query generation relies on the log probability of LLMs to generate the query based on the passage. (b) Relevance generation instructs LLMs to output relevance judgments. (c) Permutation generation generates a ranked list of a group of passages. Source: https://arxiv.org/pdf/2304.09542.pdf

The first two methods are conventional methods, where a score is given to each document and then all the passages are sorted based on this score.

The third method, permutation generation, is proposed in this paper. Specifically, instead of relying on an external score, the model directly performs end-to-end sorting of the passages. In other words, it directly utilizes the semantic understanding ability of LLM to perform relevance ranking on all candidate passages.

However, typically the number of candidate documents is very large while the input to LLM is limited. Therefore, it is often not possible to input all the text at once.

Figure 4: Illustration of re-ranking 8 passages using sliding windows with a window size of 4 and a step size of 2. The blue color represents the first two windows, while the yellow color represents the last window. The sliding windows are applied in back-to-first order, meaning that the first 2 passages in the previous window will participate in re-ranking the next window. Source: https://arxiv.org/pdf/2304.09542.pdf

Thus, as shown in Figure 4, a sliding window method is introduced, which follows the idea of bubble sort. Each time, only the first 4 texts are sorted, then the window is moved, and the subsequent 4 texts are sorted. After iterating through the entire text, we can obtain the top texts with the best performance.

Please note that in order to use RankGPT, you will need to install a newer version of LlamaIndex. The version (0.9.29) I had previously installed does not include the necessary code for RankGPT. As a result, I have created a new conda environment with LlamaIndex version 0.9.45.post1.

The code is simple, based on the code from the previous section, just set RankGPT as the reranker.

from llama_index.postprocessor import RankGPTRerank
from llama_index.llms import OpenAI
reranker = RankGPTRerank(
    top_n = 3,
    llm = OpenAI(model="gpt-3.5-turbo-16k"),
    # verbose=True,
)

Overall results are as follows:

(llamaindex_new) Florian:~ Florian$ python /Users/Florian/Documents/rerank.py 
----------------------------------------------------
Node ID: 20de8234-a668-442d-8495-d39b156b44bb 
Score: 0.8703492815379594 
Text: 4 Conclusion
In this paper, we introduce TinyLlama, an open-source, small-scale language model. To promote
transparency in the open-source LLM pre-training community, we have released all relevant infor-
mation, including our pre-training code, all intermediate model checkpoints, and the details of our
data processing steps. With its compact architecture and promising performance, TinyLlama can
enable end-user applications on mobile devices, and serve as a lightweight platform for testing a
w... 

----------------------------------------------------
Node ID: 47ba3955-c6f8-4f28-a3db-f3222b3a09cd 
Score: 0.8621633467539512 
Text: TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang∗Guangtao Zeng∗Tianduo Wang Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
{peiyuan_zhang, tianduo_wang, @sutd.edu.sg">luwei}@sutd.edu.sg
guangtao_zeng@mymail.sutd.edu.sg
Abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1
trillion tokens for approximately 3 epochs. Building on the architecture and tok-
enizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances
contr... 

----------------------------------------------------
Node ID: 17cd9896-473c-47e0-8419-16b4ac615a59 
Score: 0.8343984516104476 
Text: Although these works show a clear preference on large models, the potential of training smaller
models with larger dataset remains under-explored. Instead of training compute-optimal language
models, Touvron et al. (2023a) highlight the importance of the inference budget, instead of focusing
solely on training compute-optimal language models. Inference-optimal language models aim for
optimal performance within specific inference constraints This is achieved by training models with
more tokens... 

------------------------------------------------------------------------------------------------
Start reranking...
----------------------------------------------------
Node ID: 47ba3955-c6f8-4f28-a3db-f3222b3a09cd 
Score: 0.8621633467539512 
Text: TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang∗Guangtao Zeng∗Tianduo Wang Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
{peiyuan_zhang, tianduo_wang, @sutd.edu.sg">luwei}@sutd.edu.sg
guangtao_zeng@mymail.sutd.edu.sg
Abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1
trillion tokens for approximately 3 epochs. Building on the architecture and tok-
enizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances
contr... 

----------------------------------------------------
Node ID: 17cd9896-473c-47e0-8419-16b4ac615a59 
Score: 0.8343984516104476 
Text: Although these works show a clear preference on large models, the potential of training smaller
models with larger dataset remains under-explored. Instead of training compute-optimal language
models, Touvron et al. (2023a) highlight the importance of the inference budget, instead of focusing
solely on training compute-optimal language models. Inference-optimal language models aim for
optimal performance within specific inference constraints This is achieved by training models with
more tokens... 

----------------------------------------------------
Node ID: 20de8234-a668-442d-8495-d39b156b44bb 
Score: 0.8703492815379594 
Text: 4 Conclusion
In this paper, we introduce TinyLlama, an open-source, small-scale language model. To promote
transparency in the open-source LLM pre-training community, we have released all relevant infor-
mation, including our pre-training code, all intermediate model checkpoints, and the details of our
data processing steps. With its compact architecture and promising performance, TinyLlama can
enable end-user applications on mobile devices, and serve as a lightweight platform for testing a
w...

Note that due to the use of LLM, the scores after re-ranking have not been updated. Of course, this is not crucial.

From the results, we can see that after re-ranking, the top 1st result is the correct text containing the answer, which is consistent with the results obtained using the re-ranking model earlier.

Evaluation

We can use the method described in the previous article of this series.

Advanced RAG 03: Using RAGAs + LlamaIndex for RAG evaluation

Florian June

February 5, 2024

Read full story

The specific process is described in the previous article of this series. The modified code is as follows:

reranker = FlagEmbeddingReranker(
    top_n = 3,
    model = "BAAI/bge-reranker-base",
    use_fp16 = False
)

# or using LLM as reranker
# from llama_index.postprocessor import RankGPTRerank
# from llama_index.llms import OpenAI
# reranker = RankGPTRerank(
#     top_n = 3,
#     llm = OpenAI(model="gpt-3.5-turbo-16k"),
#     # verbose=True,
# )

query_engine = index.as_query_engine(       # add reranker to query_engine
    similarity_top_k = 3, 
    node_postprocessors=[reranker]
)
# query_engine = index.as_query_engine()    # original query_engine

Interested readers can test it out.

Conclusion

Overall, this article introduces the principles and two mainstream methods of re-ranking.

Among them, the method of using a re-ranking model is lightweight and has less overhead.

On the other hand, the method of using LLM performs well on multiple benchmarks but is more expensive, and it performs well only when using ChatGPT and GPT-4, while its performance is not as good when using other open-source models such as FLAN-T5 and Vicuna-13B.

Therefore, in practical projects, a specific trade-off is required.

Additionally, if you’re interested in RAG , feel free to check out my other articles.

Finally, if you have any questions, please feel free to point them out in the comments section.

AI Exploration Journey

Advanced RAG 03: Using RAGAs + LlamaIndex for RAG evaluation

Discussion about this post