Introduction
Retrieval-Augmented Generation (RAG) applications are becoming increasingly important for leveraging the power of large language models (LLMs) with your own data. This blog post guides you through building and incrementally improving a RAG application using Langchain, a powerful framework for developing LLM-powered applications, and FutureAGI SDK for robust evaluation and observability. We'll start with a basic RAG setup and progressively enhance it, analyzing the performance at each stage to understand the impact of our improvements.
Follow along with our comprehensive cookbook for a hands-on experience: https://docs.futureagi.com/cookbook/cookbook5/How-to-build-and-incrementally-improve-RAG-applications-in-Langchain
Tools for Building RAG
For this tutorial, we'll be using these key packages:
Langchain and its extensions (core, community, experimental)
OpenAI's GPT-4o-mini for LLM responses
OpenAI's text-embedding-3-large for vector embeddings
Supporting libraries: beautifulsoup4, chromadb, and FutureAGI SDK
Configuring Future AGI SDK for Evaluation and Observability
Evaluation and observability are crucial for building robust and reliable RAG applications. We'll use the FutureAGI SDK to automatically track our experiments, evaluate performance, and gain insights into our RAG pipeline.
First, configure the FutureAGI SDK with your API and Secret keys:
This code initializes the EvalClient
to communicate with the FutureAGI evaluation platform. It also registers a trace provider using register
to capture experiment data. The LangChainInstrumentor().instrument(tracer_provider=trace_provider)
line is key – it automatically instruments Langchain components to send tracing data to FutureAGI.
Why is Observability Critical for RAG Applications?
Observability provides crucial insights into how your RAG pipeline is performing at each step. Without it, debugging becomes nearly impossible, as you can't identify why retrieval or generation is failing. Good observability pinpoints exactly where improvements are needed, transforming RAG development from trial-and-error into a data-driven process that ensures reliability in production.
For more Info you can our documentation
Viewing Experiment Results in FutureAGI
Once you run your RAG application with the instrumented components, you can access the FutureAGI platform to visualize and analyze your experiment data. The platform provides a dashboard with various metrics and insights into your RAG pipeline's performance.

A sample dashboard view in FutureAGI showing experiment results and key metrics for your RAG application.
The dashboard allows you to monitor performance, identify bottlenecks, and track the impact of improvements you make to your RAG pipeline.
Sample Questionnaire Dataset
To evaluate our RAG application, we'll use a sample questionnaire dataset containing queries and their corresponding target contexts. This dataset will help us assess the quality of our RAG system.
This loads a CSV file named Ragdata.csv
into a Pandas DataFrame. The dataset should have columns like Query_ID
, Query_Text
, Target_Context
, and Category
. Here's a preview of the dataset structure:
Query_ID | Query_Text | Target_Context | Category |
---|---|---|---|
1 | What are the key differences between the transformer architecture in 'Attention is All You Need' and the bidirectional approach used in BERT? | Attention is All You Need; BERT | Technical Comparison |
2 | Explain the positional encoding mechanism in the original transformer paper and why it was necessary. | Attention is All You Need | Technical Understanding |
This dataset will be used to test our RAG applications and evaluate their performance using FutureAGI's evaluation metrics.
Recursive Splitter and Basic Retrieval
Let's start by building a basic RAG application using Langchain's RecursiveCharacterTextSplitter
for chunking and ChromaDB as our vector store. We will retrieve information from Wikipedia pages about Transformer models, BERT, and GPT.
In this code:
We load documents from Wikipedia URLs using
WebBaseLoader
.RecursiveCharacterTextSplitter
is used to split the documents into smaller chunks of text.ChromaDB is initialized to store the vector embeddings of these chunks.
A basic RAG chain is defined:
rag_chain
retrieves relevant documents using theretriever
and then usesopenai_llm
to generate an answer based on the retrieved context.
Evaluating the Basic RAG Application
Now, let's use our questionnaire dataset to test this basic RAG application and evaluate its performance.
This code iterates through each question in our dataset, retrieves relevant documents using our RAG chain, generates a response, and stores the results in a DataFrame. The DataFrame is then saved to a CSV file named rag_evaluation_results.csv
.
Evaluating RAG Performance with Future AGI SDK
To quantitatively evaluate our RAG application, we'll use Future AGI SDK's evaluation metrics. We'll focus on:
ContextRelevance: How relevant is the retrieved context to the query?
ContextRetrieval: How well does the RAG system retrieve the necessary context?
Groundedness: Is the generated answer grounded in the retrieved context?
Here are the functions to perform these evaluations using FutureAGI SDK:
These functions utilize ContextRelevance
, ContextRetrieval
, and Groundedness
evaluators from FutureAGI SDK. They take a DataFrame, question column, context column, and response column as input and return DataFrames with evaluation metrics.
Let's run these evaluations on the results from our basic RAG application:
This will add context_relevance
, context_retrieval
, and Groundedness
columns to our recursive_df
DataFrame, containing the evaluation scores for each query.
Semantic Chunker and Basic Embedding Retrieval
Observing the evaluation results, we might notice areas for improvement, particularly in context retrieval. Let's try using SemanticChunker
from Langchain's experimental text splitters to improve our chunking strategy. SemanticChunker
aims to create chunks that are semantically coherent, potentially leading to better retrieval.
Here, we replace RecursiveCharacterTextSplitter
with SemanticChunker
. We initialize SemanticChunker
with our embeddings
model and use it to chunk our documents. The rest of the RAG pipeline remains similar to the basic setup.
Let's evaluate the performance of the Semantic Chunking approach using the same evaluation functions:
Chain of Thought Retrieval Logic for Enhanced Groundedness
To further improve our RAG application, especially in terms of groundedness, we can implement a Chain of Thought (CoT) retrieval logic. Instead of directly retrieving context based on the initial query, we'll first generate sub-questions related to the main query and then retrieve context for each sub-question. This can help retrieve more relevant and focused context, potentially leading to better grounded answers.
In this code:
We define a
subq_prompt
to instruct the LLM to generate sub-questions.parse_subqs
function parses the LLM output to extract a list of sub-questions.subq_chain
combines the prompt, LLM, and parser to create a chain for sub-question generation.qa_system_prompt
is modified to handle multiple contexts retrieved for sub-questions.full_chain
orchestrates the entire process: generate sub-questions, retrieve context for each sub-question, and then answer the original question using all retrieved contexts.
Finally, let's evaluate the Chain of Thought RAG application using our evaluation functions:
This will evaluate the Chain of Thought RAG application and add the evaluation metrics to the analysis_df
DataFrame.
Results Analysis
Now, let's analyze the evaluation results by plotting the average scores for each metric across the three RAG approaches.

Results Analysis:
The comparison of the three different RAG approaches reveals the following:
Context Relevance:
All three approaches show similar context relevance scores, ranging from 0.44 to 0.48.
Semantic chunking slightly outperforms the other two, with an average score of 0.48.
Context Retrieval:
The Chain of Thought (SubQ) approach significantly outperforms the other two in context retrieval, achieving an average score of 0.92.
Semantic chunking comes in second with a score of 0.86, while recursive splitting scores the lowest at 0.80.
Groundedness:
The Chain of Thought approach also shows the highest groundedness score at 0.31.
Semantic chunking is second with a score of 0.28, and recursive splitting performs the poorest with a score of 0.15.
Observability Benefits: Throughout our experiments, we used FutureAGI's observability tools to track, measure, and compare performance metrics. This observability was crucial in identifying which approach performed best for different aspects of RAG. It enabled us to make data-driven decisions when refining our implementation and helped pinpoint specific areas for improvement in each approach.

FutureAGI observability dashboard showing LLM tracing metrics with performance data over 30 days. The interface displays Primary Average and Traffic trends, with evaluation metrics for different RAG components and their context relevance scores.
Key Takeaway: The Chain of Thought (SubQ) approach demonstrates the best overall performance, particularly in context retrieval and groundedness. While it shows a slight trade-off in context relevance compared to semantic chunking, the improvements in retrieval and groundedness are significant.
Best Practices and Recommendations
Based on our experiments and analysis, here are some best practices and recommendations for building and improving RAG applications:
When to use each approach:
Chain of Thought (SubQ): Ideal for complex queries that require integrating information from multiple parts of the document or when groundedness is a top priority. It excels in retrieving highly relevant context and producing well-grounded answers.
Semantic chunking: A good balance for general-purpose RAG applications. It provides improved context retrieval over basic recursive splitting and can be faster and less resource-intensive than Chain of Thought. Use it when speed and a moderate level of groundedness are important.
Recursive splitting: Suitable as a baseline or for simple RAG applications where query complexity is low, and computational efficiency is paramount. However, it may not be optimal for production systems requiring high accuracy and groundedness.
Performance considerations:
SubQ approach: Involves more API calls due to sub-question generation and retrieval, potentially increasing latency.
Semantic chunking: Introduces a moderate computational overhead for semantic analysis during chunking.
Recursive splitting: Is the most computationally efficient in terms of chunking and retrieval.
Cost considerations:
SubQ approach: May incur higher API costs due to multiple LLM calls for sub-question generation and potentially more document retrievals.
Consider implementing caching mechanisms for frequently asked questions and sub-questions to reduce API calls and costs for all approaches.
Future Improvements
There are several avenues for further improvement of our RAG application:
Hybrid Approach:
Combine semantic chunking with Chain of Thought for complex queries. Use semantic chunking for initial document segmentation and then apply Chain of Thought retrieval on semantically chunked data.
Implement adaptive approach selection based on query complexity. For simple queries, use semantic chunking or recursive splitting; for complex queries, switch to Chain of Thought.
Optimization Opportunities:
Implement caching for sub-questions and their retrieved contexts in the Chain of Thought approach to reduce redundant computations and API calls.
Fine-tune chunk sizes and overlap parameters for both
RecursiveCharacterTextSplitter
andSemanticChunker
to optimize retrieval performance.Experiment with different embedding models, potentially using task-specific embedding models for better semantic representation and retrieval.
Additional Evaluations:
Incorporate response time measurements to evaluate the latency of each approach.
Include cost per query metrics to assess the economic efficiency of different methods.
Measure memory usage for each approach to understand resource requirements, especially for large-scale RAG applications.
Conclusion
This blog post demonstrated a step-by-step process for building and incrementally improving a RAG application using Langchain and FutureAGI SDK. We started with a basic RAG setup using recursive splitting, enhanced it with semantic chunking, and further improved it with a Chain of Thought retrieval logic. Through quantitative evaluations using FutureAGI SDK, we analyzed the performance of each approach and identified the Chain of Thought method as the most effective in terms of context retrieval and groundedness.
By understanding the trade-offs and benefits of different RAG techniques, and by leveraging evaluation and observability tools like FutureAGI, you can build robust and high-performing RAG applications tailored to your specific needs. We encourage you to experiment with these approaches and continue to iterate and improve your RAG pipelines for optimal results.
Similar Blogs