FutureAGI for Startups: Get 6 months of Pro access free plus $5,000 in credits. Apply now!

Products

Research

Customers

Company

Resources

Docs

Pricing

Book a Demo

FutureAGI for Startups: Get 6 months of Pro access free plus $5,000 in credits. Apply now!

Home

Customers

Benchmarking LLMs for Customer Support: A 3-Day Experiment

Last Updated

Apr 10, 2025

NVJK Kartik

Time to read

7 mins

Benchmarking LLMs for Customer Support: A 3-Day Experiment

Join Future AGI

Advance Evaluations

Real-time monitoring

Safety Guardrails

Get started for free

Introduction

Customer support has become a crucial operational task for companies. They spend a significant amount of money setting up various teams, including personal operations, to provide the best support possible. However, with the rise of large language models (LLMs), companies have started to shift towards using chatbots. These chatbots can replace other support tools and alleviate the burden on human personnel.

Problem Statement: Delivering Exceptional Customer Support with AI

As customer expectations soar, companies are increasingly turning to AI-powered chatbots to handle inquiries efficiently and at scale. Large Language Models (LLMs) emerge as a promising solution, equipped to comprehend intricate queries and generate human-like responses. However, selecting the optimal LLM for customer support presents a complex challenge. Several factors, including response accuracy, tone, cost-effectiveness, and latency, play pivotal roles in ensuring a positive customer experience.

By selecting the right LLM, businesses can significantly enhance their customer support capabilities. This leads to increased customer satisfaction, improved brand loyalty, and operational efficiencies that translate into cost savings. Moreover, aligning the LLM's performance with customer engagement strategies can result in more meaningful interactions, fostering stronger relationships with customers.

How can businesses confidently make an informed decision and choose the best LLM to power their customer support initiatives?

Solution

Future AGI's Observability Platform: Streamlining LLM Benchmarking

To address this challenge, Future AGI offers a powerful observability platform designed to simplify the process of benchmarking and evaluating LLMs. Our platform provides comprehensive tracing, automated metric collection, and intuitive dashboards, enabling teams to objectively compare different models and identify the optimal choice for their specific needs.

This case study details a 3-day experiment conducted by Future AGI's team to benchmark three leading LLMs – Mistral Large, Claude Sonnet 3.5, and GPT-4o – for a customer support chatbot application. We aim to demonstrate how Future AGI's platform streamlines the benchmarking process, providing actionable insights to inform model selection.

Setting the Stage: Evaluation Criteria and Experiment Design

Before diving into testing, we defined key evaluation criteria relevant to customer support interactions. These included:

Response Accuracy: Does the model understand the user's query and provide a relevant and helpful answer?
Politeness & Tone (Response Tone): Does the model communicate in a professional, empathetic, and customer-centric tone, even in response to negative or frustrated inquiries? We specifically focused on metrics like neutrality, joy, and the minimization of negative emotions like anger and annoyance.
Consistency: Does the model maintain a consistent brand voice and adhere to company policies across different interactions?
Latency: How quickly does the model generate a response?
Cost: What is the average cost per interaction for each model?
Content Moderation: Does the model avoid generating inappropriate or harmful responses?
Data Privacy Compliance: Does the model respect user data privacy and adhere to regulations like GDPR and CCPA?
Cultural Sensitivity: Is the model aware of and sensitive to cultural nuances in customer interactions?
Completeness: Does the model provide complete and comprehensive answers, addressing all aspects of the user's query?
Bias Detection: Does the model exhibit any biases in its responses?

To simulate a real-world customer support scenario, we created a series of test queries representing common customer support issues for a fictitious technology company, "Hooli." These queries spanned various categories, including billing, account access, complaints, product information, technical support, and shipping. We also intentionally included queries with different tones (neutral, angry, casual, etc.) to assess the models' ability to adapt their communication style.

Below is the sample of the dataset we used for benchmarking

The prompt used is given below:

"""
You are a customer support assistant for Hooli, a global technology leader founded in 2008 by visionary entrepreneurs Gavin Belson and Peter Gregory in Silicon Valley. 

COMPANY BACKGROUND:
Hooli began as a revolutionary search algorithm but rapidly expanded into cloud computing (HooliCloud), consumer electronics (HooliPhone, HooliPad, HooliHome), AI research (HooliMind), and enterprise solutions (HooliWorkspace). With a market cap of $200 billion and 75,000 employees across 28 countries, Hooli is known for its innovative campus "HooliPlex" and the slogan "Making the world a better place through technology."

PRODUCT ECOSYSTEM:
- HooliCloud: Our enterprise cloud platform with tiered plans (Basic, Premium, Enterprise)
- HooliDevices: Consumer electronics with 1-year standard warranty
- HooliSubscription: Our software-as-a-service model with monthly/annual billing options
- HooliConnect: Our customer loyalty program with Silver, Gold, and Platinum tiers

POLICIES:
- 30-day refund policy on all physical products
- 14-day free trial on all subscription services
- Global shipping available with standard (5-7 days) and express (1-2 days) options
- GDPR and CCPA compliant data handling

Respond to customer inquiries professionally and empathetically across all support categories. Provide specific, actionable solutions when possible, and know when to escalate issues to specialized teams. Always embody Hooli's commitment to customer satisfaction and technical excellence.
"""

Experiment Execution: Tracing and Evaluating LLM Responses with Future AGI

We leveraged Future AGI's platform to instrument our benchmarking script (provided in the initial prompt). This involved:

1. Platform Initialization: Registering our project within Future AGI and defining evaluation tags corresponding to our chosen metrics. This is done using the `register` function from the `fi.integrations.otel` library. Here's a snippet showing how we initialized the Future AGI tracer and defined evaluation tags:

from fi.integrations.otel import register
from fi.integrations.otel.fi_types import (
    EvalName,
    EvalSpanKind,
    EvalTag,
    EvalTagType,
    ProjectType
)

trace_provider = register(
    endpoint="http://api.futureagi.com/tracer/observation-span/create_otel_span/",
    project_type=ProjectType.EXPERIMENT,
    project_name="Customer Support Benchmark",
    project_version_name="Claude",  # Will be changed manually for each run
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.TONE,
            config={},
            custom_eval_name="Response_Tone"
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.CONTENT_MODERATION,
            config={},
            custom_eval_name="Response_Content_Moderation"
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.DATA_PRIVACY_COMPLIANCE,
            config={},
            custom_eval_name="Response_Data_Privacy_Compliance"
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.CULTURAL_SENSITIVITY,
            config={},
            custom_eval_name="Response_Cultural_Sensitivity"
        ),
        # ... other EvalTag definitions for Completeness, Bias Detection, etc

These `EvalTag` definitions instruct Future AGI to automatically evaluate specific aspects of the LLM responses, such as `TONE`, `CONTENT_MODERATION`, `DATA_PRIVACY_COMPLIANCE`, and `CULTURAL_SENSITIVITY` in this example, using custom names for easy identification in the dashboard.

Instrumentation: Using Future AGI's OpenTelemetry integrations for OpenAI, Anthropic, and Mistral AI. This instrumentation automatically traces each LLM call, capturing key data points like request parameters, responses, latency, and cost. The following code snippet demonstrates how we instrumented the OpenAI library:

from fi.integrations.otel import OpenAIInstrumentor

OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
# Similar instrumentation for Anthropic and MistralAI using AnthropicInstrumentor() and MistralAIInstrumentor()

Benchmark Execution: Running the benchmark script three times, once for each model (Mistral, Claude, and GPT-4o). For each run, we manually updated the project_version_name in the script to accurately reflect the model being tested (e.g., "Mistral Large", "Claude Sonnet", "GPT-4o"). The core of the benchmark execution is handled by the run_benchmark function:

def run_benchmark(test_cases: List[Dict[str, str]], model_name: str):
    """
    Run the benchmark on all test cases.
    """
    results = []

    for i, test_case in enumerate(tqdm(test_cases, desc="Processing test cases")):
        query = test_case["query"]
        category = test_case.get("category", "general")

        if model_name == "gpt":
            response = query_gpt(query) # Function using OpenAI API (instrumented)
        elif model_name == "claude":
            response = query_claude(query) # Function using Anthropic API (instrumented)
        elif model_name == "mistral":
            response = query_mistral(query) # Function using Mistral API (instrumented)

        results.append({
            "query": query,
            "response": response,
            "category": category
        })

    return results

# Example of running the benchmark for Claude:
results = run_benchmark(test_cases, "claude")

Notice how the `query_gpt`, `query_claude`, and `query_mistral` functions, which interact with the respective LLM APIs, are automatically traced due to the instrumentation set up earlier.

Automated Evaluation: As the script executed queries against each LLM, Future AGI's platform automatically recorded traces and performed evaluations based on the defined tags, covering a wide range of metrics crucial for customer support quality and responsible AI.

Analysing Results: Actionable Insights from Future AGI's Dashboard

After running the benchmarks for each model, we turned to Future AGI's dashboard to analyse the results. The dashboard provided a centralized view of all runs, allowing for easy comparison across models across both system and evaluation metrics.

Looking at the dashboard, we can now observe a more comprehensive set of evaluation metrics beyond just neutrality and joy in "Response_Tone". Specifically, we can see metrics related to negative emotions like anger, annoyance, and confusion, as well as crucial aspects like Content Moderation, Data Privacy Compliance, and Cultural Sensitivity.

Here's a summary of some key metric averages observed in the dashboard:

We can look at the traces that were generated to have a better information where our model is not working properly.

From these metrics, we can observe:

Negative Emotions: Claude Sonnet exhibits significantly higher average annoyance and confusion scores compared to Mistral Large and GPT-4o, suggesting it might generate responses perceived as more annoying or confusing by customers. Anger levels are consistent across all models in this evaluation.
Content Moderation: All models perform well in content moderation, consistently scoring above 95%.
Data Privacy Compliance: Mistral Large and Claude Sonnet show slightly better average Data Privacy Compliance scores compared to GPT-4o in this specific benchmark.
Cultural Sensitivity: Claude Sonnet has a slightly higher average Cultural Sensitivity score.
System Metrics: GPT-4o has the lowest average cost, while Claude Sonnet exhibits the lowest latency. Mistral Large has the highest latency and a cost in between Claude and GPT-4o.

Key Findings and Model Selection using "Choose Winner"

Based on the comprehensive data collected and visualized in Future AGI's dashboard, we leveraged the "Choose Winner" feature to objectively determine the best model for our customer support chatbot. After configuring our priorities within the "Choose Winner" interface (as shown in the dashboard), Claude Sonnet was identified as the winner.

While GPT-4o demonstrated the lowest cost and comparable performance in some tone metrics (like anger and confusion), and Mistral Large showed good performance in negative emotion metrics, Claude Sonnet's slightly better cultural sensitivity and lower latency, combined with potentially weighted importance of other factors in "Choose Winner" configuration, led to its selection as the top-performing model in this benchmark. The higher annoyance and confusion scores for Claude Sonnet would be areas to investigate further and potentially refine prompts or model configurations to mitigate.

Benefits of Future AGI's Platform for LLM Benchmarking

This 3-day experiment highlighted the significant benefits of using Future AGI's observability platform for LLM benchmarking:

Radically Simplified Experiment Setup & Execution:
- Time Saving: Instrumenting several LLM libraries (OpenAI, Anthropic, Mistral) took just a few lines of code (register, instrument), saving setup time probably from hours or days (if creating custom tracing/evaluation logic) to minutes.
- Effort Saving: Executing benchmarks on various models meant only modifying the project_version_name and the target model function, greatly simplifying comparative testing.
Completely Automated Metric Collection:
- Time Savings: The platform automatically computed and summed more than 12 different metrics per run (such as Avg. Cost, Avg. Latency, and several detailed Evals like Tone elements, Content Moderation, Data Privacy, Cultural Sensitivity, Completeness, Bias). Manually gathering, processing, and computing these, particularly subjective Evals, for every test case would normally require hours of manual analysis per run. Future AGI produced these results instantaneously on completion.
- Data Richness: Shared detailed data points such as Avg. Cost and Avg. Latency (ranging from 4.8sec for Claude to 5.7 sec for Mistral).
Comprehensive & Unified Observability:
- Holistic View: Displayed both key System Metrics (Cost, Latency) and detailed Evaluation Metrics (e.g., Claude's 47.62% annoyance vs. Mistral's 14.29%) in one integrated dashboard. This avoids optimizing in one dimension (such as cost) while sub-optimizing others (such as user-perceived quality).
- Analysis Depth: Facilitated drilling down into particular interactions through traces (as suggested by the trace view picture) to identify the underlying reason behind low metric scores.
Actionable, Quick Insights:
- Insight to Speed: Visual dashboards enabled at-a-glance comparison of model performance against all metrics, shortening analysis time from possibly hours (parsing logs/spreadsheets) to minutes.
- Accelerated Decision Making: The "Choose Winner" capability reduced the complicated, multi-metric evaluation to a simple, ranked result (#1 Claude Sonnet, #2 GPT-4o, #3 Mistral Large) in seconds, with user-specified priorities.
Objective, Customizable Model Selection:
- Data-Driven Decision: Substituted for subjective selection by an objective ranking using weighted priorities (adaptable via sliders). It guaranteed the picking of Claude Sonnet to reflect exactly the predetermined weight of differing criteria (latency, sensitivity to culture, etc.), though it wasn't the lowest price or least obnoxious.
- Better Alignment: Ensures the ultimate model selection directly promotes certain business objectives (e.g., user experience through low latency over lowest cost).

Real-time Monitoring and Continuous Evaluation with Observe

Beyond benchmarking experiments, Future AGI's platform also offers Observe, a powerful feature for monitoring LLM-powered applications in real-time production environments. While the benchmarking experiment focused on evaluating models in a controlled setting, Observe extends these capabilities to provide continuous insights into application performance and LLM behaviour as users interact with the live system.

Just as we defined evaluation metrics for our benchmark, these same Evals can be seamlessly integrated into the Observe feature. This means that as your customer support chatbot (or any LLM application) handles real user queries, Future AGI's platform continuously traces these interactions and applies the defined evaluations in real-time.

Benefits of Real-time Monitoring with Observe and Evals:

Performance Tracking in Production: Monitor key metrics like latency, cost, and success rates of your LLM application as it handles live traffic.
Real-time Quality Assurance: Continuously evaluate response tone, content moderation, data privacy compliance, and other critical aspects of LLM outputs in real-world scenarios.
Early Detection of Issues: Identify performance degradations, unexpected behaviour, or emerging issues (like shifts in tone or drops in compliance) as they happen, enabling proactive intervention.
Data-Driven Iteration and Improvement: Gain continuous feedback on your LLM application's performance to inform prompt engineering, model fine-tuning, and ongoing optimization efforts.
Ensuring Consistent Quality: Maintain a high standard of quality and reliability for your LLM application over time by constantly monitoring and evaluating its behaviour.

A sample of FUTURE AGI’s OBSERVE dashboard to monitor the deployed app

By leveraging Observe with integrated Evals, Future AGI provides a complete observability solution, moving beyond initial model selection to ensure the ongoing health, performance, and responsible operation of your LLM-powered customer support chatbot in a dynamic, real-world environment. This continuous feedback loop is essential for building and maintaining truly effective and trustworthy AI applications.

Conclusion

Data-Driven LLM Selection for Superior Customer Support – Claude Sonnet Emerges as the Winner

Choosing the right LLM for customer support is a critical decision that impacts customer satisfaction and operational efficiency. Future AGI's observability platform empowers businesses to move beyond subjective assessments and adopt a data-driven approach to LLM benchmarking. By providing comprehensive tracing, automated evaluations, and intuitive dashboards, culminating in the objective "Choose Winner" feature, Future AGI simplifies the process of comparing models, identifying optimal choices, and ultimately delivering superior AI-powered customer support experiences. This case study demonstrates how Future AGI transforms LLM evaluation from a complex, manual task into a streamlined, insightful, and ultimately more effective process, leading us to confidently select Claude Sonnet as the winning LLM for our customer support chatbot application.

Benchmarking LLMs for Customer Support: A 3-Day Experiment

Benchmarking LLMs for Customer Support: A 3-Day Experiment

Benchmarking LLMs for Customer Support: A 3-Day Experiment

Benchmarking LLMs for Customer Support: A 3-Day Experiment

Benchmarking LLMs for Customer Support: A 3-Day Experiment