Build and Improve Your RAG Application with Langchain and Observability

Build and Improve Your RAG Application with Langchain and Observability

Share icon
Share icon

Introduction

Retrieval-Augmented Generation (RAG) applications are becoming increasingly important for leveraging the power of large language models (LLMs) with your own data. This blog post guides you through building and incrementally improving a RAG application using Langchain, a powerful framework for developing LLM-powered applications, and FutureAGI SDK for robust evaluation and observability. We'll start with a basic RAG setup and progressively enhance it, analyzing the performance at each stage to understand the impact of our improvements.

Follow along with our comprehensive cookbook for a hands-on experience: https://docs.futureagi.com/cookbook/cookbook5/How-to-build-and-incrementally-improve-RAG-applications-in-Langchain

Tools for Building RAG

For this tutorial, we'll be using these key packages:

  • Langchain and its extensions (core, community, experimental)

  • OpenAI's GPT-4o-mini for LLM responses

  • OpenAI's text-embedding-3-large for vector embeddings

  • Supporting libraries: beautifulsoup4, chromadb, and FutureAGI SDK

Configuring Future AGI SDK for Evaluation and Observability

Evaluation and observability are crucial for building robust and reliable RAG applications. We'll use the FutureAGI SDK to automatically track our experiments, evaluate performance, and gain insights into our RAG pipeline.

First, configure the FutureAGI SDK with your API and Secret keys:

from getpass import getpass
from fi.evals import EvalClient
import os
from fi.integrations.otel import register, LangChainInstrumentor
from fi.integrations.otel.types import (
    ProjectTypes,
    EvalConfig,
    EvalName,
    EvalSpanKind,
    EvalTag,
    EvalTagType,
)
from itertools import product
from fi.evals.config import get_default_config

os.environ["FI_API_KEY"] = getpass("Enter your FI API key: ")
os.environ["FI_SECRET_KEY"] = getpass("Enter your FI API secret: ")

evaluator = EvalClient(
    fi_base_url="<https://api.futureagi.com>",
)
eval_tags = [
    EvalTag(
        type=tag_type,
        value=span_kind,
        eval_name=eval_name,
        config=get_default_config(eval_name),
    )
    for tag_type, span_kind, eval_name in product(
        EvalTagType, EvalSpanKind, [EvalName.CONTEXT_ADHERENCE, EvalName.PROMPT_PERPLEXITY]
    )
]
trace_provider = register(
  project_type=ProjectTypes.EXPERIMENT,
  project_name="RAG-Cookbook",
  project_version_name="v1",
  eval_tags=eval_tags
)

LangChainInstrumentor().instrument(tracer_provider=trace_provider)

This code initializes the EvalClient to communicate with the FutureAGI evaluation platform. It also registers a trace provider using register to capture experiment data. The LangChainInstrumentor().instrument(tracer_provider=trace_provider) line is key – it automatically instruments Langchain components to send tracing data to FutureAGI.

Why is Observability Critical for RAG Applications?

Observability provides crucial insights into how your RAG pipeline is performing at each step. Without it, debugging becomes nearly impossible, as you can't identify why retrieval or generation is failing. Good observability pinpoints exactly where improvements are needed, transforming RAG development from trial-and-error into a data-driven process that ensures reliability in production.

For more Info you can our documentation

Viewing Experiment Results in FutureAGI

Once you run your RAG application with the instrumented components, you can access the FutureAGI platform to visualize and analyze your experiment data. The platform provides a dashboard with various metrics and insights into your RAG pipeline's performance.

A sample dashboard view in FutureAGI showing experiment results and key metrics for your RAG application.

The dashboard allows you to monitor performance, identify bottlenecks, and track the impact of improvements you make to your RAG pipeline.

Sample Questionnaire Dataset

To evaluate our RAG application, we'll use a sample questionnaire dataset containing queries and their corresponding target contexts. This dataset will help us assess the quality of our RAG system.

import pandas as pd

dataset = pd.read_csv("Ragdata.csv")
pd.set_option('display.max_colwidth', None)
dataset.head(2)

This loads a CSV file named Ragdata.csv into a Pandas DataFrame. The dataset should have columns like Query_ID, Query_Text, Target_Context, and Category. Here's a preview of the dataset structure:

Query_ID

Query_Text

Target_Context

Category

1

What are the key differences between the transformer architecture in 'Attention is All You Need' and the bidirectional approach used in BERT?

Attention is All You Need; BERT

Technical Comparison

2

Explain the positional encoding mechanism in the original transformer paper and why it was necessary.

Attention is All You Need

Technical Understanding

This dataset will be used to test our RAG applications and evaluate their performance using FutureAGI's evaluation metrics.

Recursive Splitter and Basic Retrieval

Let's start by building a basic RAG application using Langchain's RecursiveCharacterTextSplitter for chunking and ChromaDB as our vector store. We will retrieve information from Wikipedia pages about Transformer models, BERT, and GPT.

from bs4 import BeautifulSoup as bs
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
# Load the data from the web URL
docs = []
urls = ['<https://en.wikipedia.org/wiki/Attention_Is_All_You_Need>',
        '<https://en.wikipedia.org/wiki/BERT_(language_model)>',
        '<https://en.wikipedia.org/wiki/Generative_pre-trained_transformer>' ]
for url in urls:
  loader = WebBaseLoader(url)
  doc = loader.load()
  docs.extend(doc)

def openai_llm(question, context):
    formatted_prompt = f"Question: {question}\\\\n\\\\nContext: {context}"
    messages=[{'role': 'user', 'content': formatted_prompt}]
    response = llm.invoke(messages)
    print(response)
    return response.content

def rag_chain(question):
    retrieved_docs = retriever.invoke(question)
    formatted_context = "\\\\n\\\\n".join(doc.page_content for doc in retrieved_docs)
    return openai_llm(question, formatted_context)

def get_important_facts(question):
    return rag_chain(question)

# Split the loaded documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Create embeddings and vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings, persist_directory="chroma_db")

# Define the RAG setup
retriever = vectorstore.as_retriever()

In this code:

  • We load documents from Wikipedia URLs using WebBaseLoader.

  • RecursiveCharacterTextSplitter is used to split the documents into smaller chunks of text.

  • ChromaDB is initialized to store the vector embeddings of these chunks.

  • A basic RAG chain is defined: rag_chain retrieves relevant documents using the retriever and then uses openai_llm to generate an answer based on the retrieved context.

Evaluating the Basic RAG Application

Now, let's use our questionnaire dataset to test this basic RAG application and evaluate its performance.

import pandas as pd
import time

# Create a list to store results
results = []

# Loop through each query in the dataset
for idx, question in enumerate(dataset['Query_Text']):
    try:
        # Retrieve relevant documents
        retrieved_docs = retriever.invoke(question)

        # Format context
        formatted_context = "\\\\n\\\\n".join([doc.page_content for doc in retrieved_docs])

        # Get LLM response
        response = openai_llm(question, formatted_context)

        # Store results
        results.append({
            "query_id": idx + 1,
            "question": question,
            "context": formatted_context,
            "chunks_list": [doc.page_content for doc in retrieved_docs],  # List storage
            "response": response
        })

        # Optional: Add delay to avoid rate limits
        time.sleep(1)

        print(f"Processed query {idx+1}/{len(dataset)}")

    except Exception as e:
        print(f"Error processing query {idx+1}: {str(e)}")
        results.append({
            "query_id": idx + 1,
            "question": question,
            "context": "Error",
            "response": f"Error: {str(e)}"
        })

# Create DataFrame from results
recursive_df = pd.DataFrame(results)

# Add additional metadata columns if needed
recursive_df['context_length'] = recursive_df['context'].apply(lambda x: len(x.split()))
recursive_df['response'].apply(lambda x: len(x.split()))

# Save to CSV
recursive_df.to_csv('rag_evaluation_results.csv', index=False)

This code iterates through each question in our dataset, retrieves relevant documents using our RAG chain, generates a response, and stores the results in a DataFrame. The DataFrame is then saved to a CSV file named rag_evaluation_results.csv.

Evaluating RAG Performance with Future AGI SDK

To quantitatively evaluate our RAG application, we'll use Future AGI SDK's evaluation metrics. We'll focus on:

  • ContextRelevance: How relevant is the retrieved context to the query?

  • ContextRetrieval: How well does the RAG system retrieve the necessary context?

  • Groundedness: Is the generated answer grounded in the retrieved context?

Here are the functions to perform these evaluations using FutureAGI SDK:

from fi.evals import ContextRelevance, ContextRetrieval, Groundedness
from fi.testcases import TestCase
import pandas as pd
import time

def evaluate_context_relevance(df, question_col, context_col, model="gpt-4o-mini"):
    """
    Evaluate context relevance for each row in the dataframe
    """
    agentic_context_eval = ContextRelevance(config={"model": model, "check_internet": True})
    results = []

    for _, row in df.iterrows():
        try:
            test_case = TestCase(
                input=row[question_col],
                context=row[context_col]
            )
            result = evaluator.evaluate(eval_templates=[agentic_context_eval], inputs=[test_case])
            time.sleep(2)  # Rate limiting
            results.append({'context_relevance': result.eval_results[0].metrics[0].value})
        except Exception as e:
            print(f"Error in context relevance evaluation: {e}")
            results.append({'context_relevance': 'Error'})

    return pd.DataFrame(results)

def evaluate_context_retrieval(df, question_col, context_col, response_col, model="gpt-4o-mini"):
    """
    Evaluate context retrieval for each row in the dataframe
    """
    agentic_retrieval_eval = ContextRetrieval(config={
        "model": model,
        "check_internet": True,
        "criteria": "Check if the Context retrieved is relevant and accurate to the query and the response generated isn't incorrect"
    })
    results = []

    for _, row in df.iterrows():
        try:
            test_case = TestCase(
                input=row[question_col],
                context=row[context_col],
                output=row[response_col]
            )
            result = evaluator.evaluate(eval_templates=[agentic_retrieval_eval], inputs=[test_case])
            time.sleep(2)  # Rate limiting
            results.append({'context_retrieval': result.eval_results[0].metrics[0].value})
        except Exception as e:
            print(f"Error in context retrieval evaluation: {e}")
            results.append({'context_retrieval': 'Error'})

    return pd.DataFrame(results)

def evaluate_groundedness(df, question_col, context_col, response_col, model="gpt-4o-mini"):
    """
    Evaluate groundedness for each row in the dataframe
    """
    agentic_groundedness_eval = Groundedness(config={"model": model, "check_internet": True})
    results = []

    for _, row in df.iterrows():
        try:
            test_case = TestCase(
                input=row[question_col],
                context=row[context_col],
                response=row[response_col]
            )
            result = evaluator.evaluate(eval_templates=[agentic_groundedness_eval], inputs=[test_case])
            time.sleep(2)  # Rate limiting
            results.append({'Groundedness': result.eval_results[0].metrics[0].value})
        except Exception as e:
            print(f"Error in groundedness evaluation: {e}")
            results.append({'Groundedness': 'Error'})

    return pd.DataFrame(results)

def run_all_evaluations(df, question_col, context_col, response_col, model="gpt-4o-mini"):
    """
    Run all three evaluations and combine results
    """
    relevance_results = evaluate_context_relevance(df, question_col, context_col, model)
    retrieval_results = evaluate_context_retrieval(df, question_col, context_col, response_col, model)
    groundedness_results = evaluate_groundedness(df, question_col, context_col, response_col, model)

    # Combine all results with original dataframe
    return pd.concat([df, relevance_results, retrieval_results, groundedness_results], axis=1)

These functions utilize ContextRelevance, ContextRetrieval, and Groundedness evaluators from FutureAGI SDK. They take a DataFrame, question column, context column, and response column as input and return DataFrames with evaluation metrics.

Let's run these evaluations on the results from our basic RAG application:

recursive_df = run_all_evaluations(
    recursive_df,
    question_col='Query_Text',
    context_col='context',
    response_col='response'
)

This will add context_relevance, context_retrieval, and Groundedness columns to our recursive_df DataFrame, containing the evaluation scores for each query.

Semantic Chunker and Basic Embedding Retrieval

Observing the evaluation results, we might notice areas for improvement, particularly in context retrieval. Let's try using SemanticChunker from Langchain's experimental text splitters to improve our chunking strategy. SemanticChunker aims to create chunks that are semantically coherent, potentially leading to better retrieval.

from langchain_experimental.text_splitter import SemanticChunker
semantic_chunker = SemanticChunker(embeddings, breakpoint_threshold_type="percentile")

semantic_chunks = semantic_chunker.create_documents([d.page_content for d in all_docs])

vectorstore = Chroma.from_documents(documents=semantic_chunks, embedding=embeddings, persist_directory="chroma_db")

retriever = vectorstore.as_retriever()

Here, we replace RecursiveCharacterTextSplitter with SemanticChunker. We initialize SemanticChunker with our embeddings model and use it to chunk our documents. The rest of the RAG pipeline remains similar to the basic setup.

Let's evaluate the performance of the Semantic Chunking approach using the same evaluation functions:

results_df = run_all_evaluations(
    results_df,
    question_col='question',
    context_col='context',
    response_col='response'
)

Chain of Thought Retrieval Logic for Enhanced Groundedness

To further improve our RAG application, especially in terms of groundedness, we can implement a Chain of Thought (CoT) retrieval logic. Instead of directly retrieving context based on the initial query, we'll first generate sub-questions related to the main query and then retrieve context for each sub-question. This can help retrieve more relevant and focused context, potentially leading to better grounded answers.

from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from typing import List, Dict

# New: Sub-question generation prompt
subq_prompt = PromptTemplate.from_template(
    "Break down this question into 2-3 sub-questions needed to answer it. "
    "Focus on specific topics and details and related subtopics.\\\\n"
    "Question: {input}\\\\n"
    "Format: Bullet points with 'SUBQ:' prefix"
)

# New: Sub-question parser (extract clean list from LLM output)
def parse_subqs(text: str) -> List[str]:

    content = text.content
    return [line.split("SUBQ:")[1].strip()
            for line in text.content.split("\\\\n")
            if "SUBQ:" in line]

# New: Chain to generate and parse sub-questions
subq_chain = subq_prompt | llm | RunnableLambda(parse_subqs)

# Modified QA prompt to handle multiple contexts
qa_system_prompt = PromptTemplate.from_template(
    "Answer using ALL context below. Connect information between contexts.\\\\n"
    "CONTEXTS:\\\\n{contexts}\\\\n\\\\n"
    "Question: {input}\\\\n"
    "Final Answer:"
)

# Revised chain with proper data flow
full_chain = (

    RunnablePassthrough.assign(
        subqs=lambda x: subq_chain.invoke(x["input"])
    )
    .assign(
        contexts=lambda x: "\\\\n\\\\n".join([
            doc.page_content
            for q in x["subqs"]
            for doc in retriever.invoke(q)
        ])
    )
    .assign(
        answer=qa_system_prompt | llm  # Now properly wrapped
    )
)

In this code:

  • We define a subq_prompt to instruct the LLM to generate sub-questions.

  • parse_subqs function parses the LLM output to extract a list of sub-questions.

  • subq_chain combines the prompt, LLM, and parser to create a chain for sub-question generation.

  • qa_system_prompt is modified to handle multiple contexts retrieved for sub-questions.

  • full_chain orchestrates the entire process: generate sub-questions, retrieve context for each sub-question, and then answer the original question using all retrieved contexts.

Finally, let's evaluate the Chain of Thought RAG application using our evaluation functions:

analysis_df = run_all_evaluations(
    analysis_df,
    question_col='original_question',
    context_col='retrieved_contexts',
    response_col='final_answer'
)

This will evaluate the Chain of Thought RAG application and add the evaluation metrics to the analysis_df DataFrame.

Results Analysis

Now, let's analyze the evaluation results by plotting the average scores for each metric across the three RAG approaches.


Results Analysis:

The comparison of the three different RAG approaches reveals the following:

  1. Context Relevance:

    • All three approaches show similar context relevance scores, ranging from 0.44 to 0.48.

    • Semantic chunking slightly outperforms the other two, with an average score of 0.48.

  2. Context Retrieval:

    • The Chain of Thought (SubQ) approach significantly outperforms the other two in context retrieval, achieving an average score of 0.92.

    • Semantic chunking comes in second with a score of 0.86, while recursive splitting scores the lowest at 0.80.

  3. Groundedness:

    • The Chain of Thought approach also shows the highest groundedness score at 0.31.

    • Semantic chunking is second with a score of 0.28, and recursive splitting performs the poorest with a score of 0.15.

Observability Benefits: Throughout our experiments, we used FutureAGI's observability tools to track, measure, and compare performance metrics. This observability was crucial in identifying which approach performed best for different aspects of RAG. It enabled us to make data-driven decisions when refining our implementation and helped pinpoint specific areas for improvement in each approach.

FutureAGI observability dashboard showing LLM tracing metrics with performance data over 30 days. The interface displays Primary Average and Traffic trends, with evaluation metrics for different RAG components and their context relevance scores.

Key Takeaway: The Chain of Thought (SubQ) approach demonstrates the best overall performance, particularly in context retrieval and groundedness. While it shows a slight trade-off in context relevance compared to semantic chunking, the improvements in retrieval and groundedness are significant.

Best Practices and Recommendations

Based on our experiments and analysis, here are some best practices and recommendations for building and improving RAG applications:

  1. When to use each approach:

    • Chain of Thought (SubQ): Ideal for complex queries that require integrating information from multiple parts of the document or when groundedness is a top priority. It excels in retrieving highly relevant context and producing well-grounded answers.

    • Semantic chunking: A good balance for general-purpose RAG applications. It provides improved context retrieval over basic recursive splitting and can be faster and less resource-intensive than Chain of Thought. Use it when speed and a moderate level of groundedness are important.

    • Recursive splitting: Suitable as a baseline or for simple RAG applications where query complexity is low, and computational efficiency is paramount. However, it may not be optimal for production systems requiring high accuracy and groundedness.


  2. Performance considerations:

    • SubQ approach: Involves more API calls due to sub-question generation and retrieval, potentially increasing latency.

    • Semantic chunking: Introduces a moderate computational overhead for semantic analysis during chunking.

    • Recursive splitting: Is the most computationally efficient in terms of chunking and retrieval.


  3. Cost considerations:

    • SubQ approach: May incur higher API costs due to multiple LLM calls for sub-question generation and potentially more document retrievals.

    • Consider implementing caching mechanisms for frequently asked questions and sub-questions to reduce API calls and costs for all approaches.

Future Improvements

There are several avenues for further improvement of our RAG application:

  1. Hybrid Approach:

    • Combine semantic chunking with Chain of Thought for complex queries. Use semantic chunking for initial document segmentation and then apply Chain of Thought retrieval on semantically chunked data.

    • Implement adaptive approach selection based on query complexity. For simple queries, use semantic chunking or recursive splitting; for complex queries, switch to Chain of Thought.


  2. Optimization Opportunities:

    • Implement caching for sub-questions and their retrieved contexts in the Chain of Thought approach to reduce redundant computations and API calls.

    • Fine-tune chunk sizes and overlap parameters for both RecursiveCharacterTextSplitter and SemanticChunker to optimize retrieval performance.

    • Experiment with different embedding models, potentially using task-specific embedding models for better semantic representation and retrieval.


  3. Additional Evaluations:

    • Incorporate response time measurements to evaluate the latency of each approach.

    • Include cost per query metrics to assess the economic efficiency of different methods.

    • Measure memory usage for each approach to understand resource requirements, especially for large-scale RAG applications.

Conclusion

This blog post demonstrated a step-by-step process for building and incrementally improving a RAG application using Langchain and FutureAGI SDK. We started with a basic RAG setup using recursive splitting, enhanced it with semantic chunking, and further improved it with a Chain of Thought retrieval logic. Through quantitative evaluations using FutureAGI SDK, we analyzed the performance of each approach and identified the Chain of Thought method as the most effective in terms of context retrieval and groundedness.

By understanding the trade-offs and benefits of different RAG techniques, and by leveraging evaluation and observability tools like FutureAGI, you can build robust and high-performing RAG applications tailored to your specific needs. We encourage you to experiment with these approaches and continue to iterate and improve your RAG pipelines for optimal results.

Table of Contents

Subscribe to Newsletter

Exclusive Webinar on AI Failures & Smarter Evaluations -

Cross

Exclusive Webinar on AI Failures & Smarter Evaluations -

Cross
Logo Text

Exclusive Webinar on AI Failures & Smarter Evaluations -

Cross
Logo Text

Exclusive Webinar on AI Failures & Smarter Evaluations -

Cross
Logo Text

Exclusive Webinar on AI Failures & Smarter Evaluations -

Cross
Logo Text
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo