At Literal AI, we see a lot of engineers implement RAG applications, and techniques evolve rapidly. The first requirement when building RAGs is to observe and debug the intermediate steps of the process. The challenge that follows is to evaluate the retrieval strategy.
RAG (Retrieval Augmented Generation) combines large language models with external knowledge retrieval. This approach allows AI systems to generate responses that are both coherent and grounded in accurate, up-to-date information. A key use case is in customer support, where chatbots can access company knowledge bases to provide precise answers. RAG addresses the limitations of traditional language models by augmenting AI-generated content with factual, retrieved information, thereby enhancing accuracy and reliability.
This blog series is composed of three parts:
Part 1: Observing the RAG Process: Step by Step
Part 2: Advanced RAG techniques
Part 3: Evaluating RAG applications
Part 1 - Observing the RAG Process: Step by Step
In this first part, we go over the building blocks of RAG and reference specialized frameworks for each step.
RAG works in three phases: ingestion, update, and inference.
Ingestion: This phase involves collecting and integrating data from various sources into a knowledge base. The data is pre-processed, embedded, and stored for efficient retrieval. Two great open-source frameworks for data ingestion and RAG: unstructured.io and LlamaIndex.
Update: In this phase, the knowledge base is regularly updated with new information to ensure that the system remains current and accurate. This may involve re-embedding documents and refreshing the search index.
Inference: During this phase, the system processes user queries by retrieving relevant information from the knowledge base and generating responses using a large language model. This involves rephrasing the query, embedding it, retrieving relevant documents, reranking them, and finally synthesizing a coherent response.
Let's deep dive on the inference phase and each step.
1. Rephrasing (Query Expansion)
Rephrasing, or query expansion, is the initial step in the RAG inference process. It involves reformulating the user's query to improve retrieval relevance. This can range from simple synonym addition to complex semantic analysis. This step is usually done by an LLM.
For example, the query "What are the effects of climate change on polar bears?" might be expanded to include "global warming impact on Arctic wildlife" or "sea ice loss consequences for Ursus maritimus".
Query expansion offers two key benefits:
It broadens the search scope, increasing the chances of finding relevant documents that don't exactly match the original query.
It addresses the vocabulary mismatch problem by incorporating synonyms, related terms, and domain-specific jargon.
These improvements are particularly valuable for handling ambiguous natural language queries or those lacking specific technical terms.
2. Embedding
Once we have our expanded query, the next step is creating embeddings. Embedding is the process of mapping textual data to dense vector representations in a high-dimensional space. These vectors capture semantic and syntactic properties of the text, enabling efficient computation and comparison.
Key aspects:
Dimension: Typically ranges from 300 to 1024, balancing expressiveness and computational efficiency.
Models: Utilize transformer-based architectures like BERT, RoBERTa, or MPNet.
Training: Often employ self-supervised techniques like masked language modeling or contrastive learning.
The power of embeddings lies in their ability to capture semantic similarity. For example, the embeddings for "automobile" and "car" would be very close in the vector space, even though they are different words. These embedding models are provided by AI model platforms such as OpenAI, Mistral AI, or HuggingFace. This property is crucial for the next step in the RAG process: search.
3. Search (Retrieve)
With our query represented as an embedding, we can now perform a search to find the most relevant information. RAG's search process differs from traditional keyword-based searches, utilizing vector or hybrid searches to find documents with embeddings similar to our query embedding.
Common similarity metrics used for semantic search are cosine similarity, dot product or euclidean distance. Cosine similarity calculates the cosine of the angle between two vectors in the embedding space: the closer this value is to 1, the more similar the vectors (and thus, the texts they represent) are considered to be.
However, as knowledge bases can be massive, containing millions or even billions of documents, we need efficient algorithms to perform this search quickly. This is where Approximate Nearest Neighbour (ANN) search algorithms come into play.
These algorithms trade a small amount of accuracy for significant speed improvements, making it possible to search through vast document collections in milliseconds.
The output of this search step is typically a list of the top-k most similar documents to our query, along with their similarity scores. These documents will then move on to the next stage of the RAG process: reranking.
4. Reranking
While the initial search based on embedding similarity is powerful, it may not always surface the most relevant documents for our specific task. This is where reranking comes in. Reranking is a crucial step that fine-tunes our search results, ensuring that the most pertinent information is prioritized.
Reranking techniques often employ more computationally intensive methods than the initial search, as they only need to process a small subset of documents (the top-k results from the previous step).
The reranking step significantly improves the quality of the retrieved information. It helps to filter out false positives from the initial search and ensures that the documents passed to the final generation step are truly relevant and informative.
The Cohere reranker is an excellent tool that significantly improves the effectiveness of the reranking process.
5. LLM Generation
The final RAG step uses a Large Language Model (LLM) like GPT-4, Anthropic or Mistral AI LLMs to generate a response based on:
The original user query
The relevant snippets from top-ranked retrieved documents
The prompt template and instruction binding the variables:
based on {retrieved_context}, answer {user_question}
.
Key aspects of LLM generation in the context of RAG:
Information Integration: Combining retrieved facts with model knowledge
Contextual Understanding: Interpreting the query with the retrieved context
Coherent Synthesis: Smoothly integrating information without verbatim copying
Handling Contradictions: Reconciling or acknowledging discrepancies in retrieved data
One of the key challenges in this step is striking the right balance between utilizing the retrieved information and leveraging the model's own knowledge. Too much reliance on retrieval can lead to rigid, fact-recitation responses, while too little can result in the model "hallucinating" or generating plausible-sounding but potentially incorrect information.
Literal AI enables developers and data scientists to log and visualize intermediate steps of RAG applications, and evaluate them individually - embed, search, rerank, generate - or the retrieval task as a whole.