| 3 min read

Data Scientist
Share:
Introduction
Customer support has become a crucial operational task for companies. They spend a significant amount of money setting up various teams, including personal operations, to provide the best support possible. However, with the rise of large language models (LLMs), companies have started to shift towards using chatbots. These chatbots can replace other support tools and alleviate the burden on human personnel.
Problem Statement: Delivering Exceptional Customer Support with AI
As customer expectations soar, companies are increasingly turning to AI-powered chatbots to handle inquiries efficiently and at scale. Large Language Models (LLMs) emerge as a promising solution, equipped to comprehend intricate queries and generate human-like responses. However, selecting the optimal LLM for customer support presents a complex challenge. Several factors, including response accuracy, tone, cost-effectiveness, and latency, play pivotal roles in ensuring a positive customer experience.
By selecting the right LLM, businesses can significantly enhance their customer support capabilities. This leads to increased customer satisfaction, improved brand loyalty, and operational efficiencies that translate into cost savings. Moreover, aligning the LLM's performance with customer engagement strategies can result in more meaningful interactions, fostering stronger relationships with customers.
How can businesses confidently make an informed decision and choose the best LLM to power their customer support initiatives?
Solution
Future AGI's Observability Platform: Streamlining LLM Benchmarking
To address this challenge, Future AGI offers a powerful observability platform designed to simplify the process of benchmarking and evaluating LLMs. Our platform provides comprehensive tracing, automated metric collection, and intuitive dashboards, enabling teams to objectively compare different models and identify the optimal choice for their specific needs.
This case study details a 3-day experiment conducted by Future AGI's team to benchmark three leading LLMs – Mistral Large, Claude Sonnet 3.5, and GPT-4o – for a customer support chatbot application. We aim to demonstrate how Future AGI's platform streamlines the benchmarking process, providing actionable insights to inform model selection.
Setting the Stage: Evaluation Criteria and Experiment Design
Before diving into testing, we defined key evaluation criteria relevant to customer support interactions. These included:
Response Accuracy: Does the model understand the user's query and provide a relevant and helpful answer?
Politeness & Tone (Response Tone): Does the model communicate in a professional, empathetic, and customer-centric tone, even in response to negative or frustrated inquiries? We specifically focused on metrics like neutrality, joy, and the minimization of negative emotions like anger and annoyance.
Consistency: Does the model maintain a consistent brand voice and adhere to company policies across different interactions?
Latency: How quickly does the model generate a response?
Cost: What is the average cost per interaction for each model?
Content Moderation: Does the model avoid generating inappropriate or harmful responses?
Data Privacy Compliance: Does the model respect user data privacy and adhere to regulations like GDPR and CCPA?
Cultural Sensitivity: Is the model aware of and sensitive to cultural nuances in customer interactions?
Completeness: Does the model provide complete and comprehensive answers, addressing all aspects of the user's query?
Bias Detection: Does the model exhibit any biases in its responses?
To simulate a real-world customer support scenario, we created a series of test queries representing common customer support issues for a fictitious technology company, "Hooli." These queries spanned various categories, including billing, account access, complaints, product information, technical support, and shipping. We also intentionally included queries with different tones (neutral, angry, casual, etc.) to assess the models' ability to adapt their communication style.
Below is the sample of the dataset we used for benchmarking

The prompt used is given below:
Experiment Execution: Tracing and Evaluating LLM Responses with Future AGI
We leveraged Future AGI's platform to instrument our benchmarking script (provided in the initial prompt). This involved:
1. Platform Initialization: Registering our project within Future AGI and defining evaluation tags corresponding to our chosen metrics. This is done using the `register` function from the `fi.integrations.otel` library. Here's a snippet showing how we initialized the Future AGI tracer and defined evaluation tags:
These `EvalTag` definitions instruct Future AGI to automatically evaluate specific aspects of the LLM responses, such as `TONE`, `CONTENT_MODERATION`, `DATA_PRIVACY_COMPLIANCE`, and `CULTURAL_SENSITIVITY` in this example, using custom names for easy identification in the dashboard.
Instrumentation: Using Future AGI's OpenTelemetry integrations for OpenAI, Anthropic, and Mistral AI. This instrumentation automatically traces each LLM call, capturing key data points like request parameters, responses, latency, and cost. The following code snippet demonstrates how we instrumented the OpenAI library:
Benchmark Execution: Running the benchmark script three times, once for each model (Mistral, Claude, and GPT-4o). For each run, we manually updated the project_version_name in the script to accurately reflect the model being tested (e.g., "Mistral Large", "Claude Sonnet", "GPT-4o"). The core of the benchmark execution is handled by the run_benchmark function:
Notice how the `query_gpt`, `query_claude`, and `query_mistral` functions, which interact with the respective LLM APIs, are automatically traced due to the instrumentation set up earlier.
Automated Evaluation: As the script executed queries against each LLM, Future AGI's platform automatically recorded traces and performed evaluations based on the defined tags, covering a wide range of metrics crucial for customer support quality and responsible AI.
Analysing Results: Actionable Insights from Future AGI's Dashboard
After running the benchmarks for each model, we turned to Future AGI's dashboard to analyse the results. The dashboard provided a centralized view of all runs, allowing for easy comparison across models across both system and evaluation metrics.

Looking at the dashboard, we can now observe a more comprehensive set of evaluation metrics beyond just neutrality and joy in "Response_Tone". Specifically, we can see metrics related to negative emotions like anger, annoyance, and confusion, as well as crucial aspects like Content Moderation, Data Privacy Compliance, and Cultural Sensitivity.
Here's a summary of some key metric averages observed in the dashboard:

We can look at the traces that were generated to have a better information where our model is not working properly.

From these metrics, we can observe:
Negative Emotions: Claude Sonnet exhibits significantly higher average annoyance and confusion scores compared to Mistral Large and GPT-4o, suggesting it might generate responses perceived as more annoying or confusing by customers. Anger levels are consistent across all models in this evaluation.
Content Moderation: All models perform well in content moderation, consistently scoring above 95%.
Data Privacy Compliance: Mistral Large and Claude Sonnet show slightly better average Data Privacy Compliance scores compared to GPT-4o in this specific benchmark.
Cultural Sensitivity: Claude Sonnet has a slightly higher average Cultural Sensitivity score.
System Metrics: GPT-4o has the lowest average cost, while Claude Sonnet exhibits the lowest latency. Mistral Large has the highest latency and a cost in between Claude and GPT-4o.
Key Findings and Model Selection using "Choose Winner"
Based on the comprehensive data collected and visualized in Future AGI's dashboard, we leveraged the "Choose Winner" feature to objectively determine the best model for our customer support chatbot. After configuring our priorities within the "Choose Winner" interface (as shown in the dashboard), Claude Sonnet was identified as the winner.

While GPT-4o demonstrated the lowest cost and comparable performance in some tone metrics (like anger and confusion), and Mistral Large showed good performance in negative emotion metrics, Claude Sonnet's slightly better cultural sensitivity and lower latency, combined with potentially weighted importance of other factors in "Choose Winner" configuration, led to its selection as the top-performing model in this benchmark. The higher annoyance and confusion scores for Claude Sonnet would be areas to investigate further and potentially refine prompts or model configurations to mitigate.
Benefits of Future AGI's Platform for LLM Benchmarking
This 3-day experiment highlighted the significant benefits of using Future AGI's observability platform for LLM benchmarking:
Radically Simplified Experiment Setup & Execution:
Time Saving: Instrumenting several LLM libraries (OpenAI, Anthropic, Mistral) took just a few lines of code (register, instrument), saving setup time probably from hours or days (if creating custom tracing/evaluation logic) to minutes.
Effort Saving: Executing benchmarks on various models meant only modifying the
project_version_name
and the target model function, greatly simplifying comparative testing.
Completely Automated Metric Collection:
Time Savings: The platform automatically computed and summed more than 12 different metrics per run (such as Avg. Cost, Avg. Latency, and several detailed Evals like Tone elements, Content Moderation, Data Privacy, Cultural Sensitivity, Completeness, Bias). Manually gathering, processing, and computing these, particularly subjective Evals, for every test case would normally require hours of manual analysis per run. Future AGI produced these results instantaneously on completion.
Data Richness: Shared detailed data points such as Avg. Cost and Avg. Latency (ranging from 4.8sec for Claude to 5.7 sec for Mistral).
Comprehensive & Unified Observability:
Holistic View: Displayed both key System Metrics (Cost, Latency) and detailed Evaluation Metrics (e.g., Claude's 47.62% annoyance vs. Mistral's 14.29%) in one integrated dashboard. This avoids optimizing in one dimension (such as cost) while sub-optimizing others (such as user-perceived quality).
Analysis Depth: Facilitated drilling down into particular interactions through traces (as suggested by the trace view picture) to identify the underlying reason behind low metric scores.
Actionable, Quick Insights:
Insight to Speed: Visual dashboards enabled at-a-glance comparison of model performance against all metrics, shortening analysis time from possibly hours (parsing logs/spreadsheets) to minutes.
Accelerated Decision Making: The "Choose Winner" capability reduced the complicated, multi-metric evaluation to a simple, ranked result (#1 Claude Sonnet, #2 GPT-4o, #3 Mistral Large) in seconds, with user-specified priorities.
Objective, Customizable Model Selection:
Data-Driven Decision: Substituted for subjective selection by an objective ranking using weighted priorities (adaptable via sliders). It guaranteed the picking of Claude Sonnet to reflect exactly the predetermined weight of differing criteria (latency, sensitivity to culture, etc.), though it wasn't the lowest price or least obnoxious.
Better Alignment: Ensures the ultimate model selection directly promotes certain business objectives (e.g., user experience through low latency over lowest cost).
Real-time Monitoring and Continuous Evaluation with Observe
Beyond benchmarking experiments, Future AGI's platform also offers Observe, a powerful feature for monitoring LLM-powered applications in real-time production environments. While the benchmarking experiment focused on evaluating models in a controlled setting, Observe extends these capabilities to provide continuous insights into application performance and LLM behaviour as users interact with the live system.
Just as we defined evaluation metrics for our benchmark, these same Evals can be seamlessly integrated into the Observe feature. This means that as your customer support chatbot (or any LLM application) handles real user queries, Future AGI's platform continuously traces these interactions and applies the defined evaluations in real-time.
Benefits of Real-time Monitoring with Observe and Evals:
Performance Tracking in Production: Monitor key metrics like latency, cost, and success rates of your LLM application as it handles live traffic.
Real-time Quality Assurance: Continuously evaluate response tone, content moderation, data privacy compliance, and other critical aspects of LLM outputs in real-world scenarios.
Early Detection of Issues: Identify performance degradations, unexpected behaviour, or emerging issues (like shifts in tone or drops in compliance) as they happen, enabling proactive intervention.
Data-Driven Iteration and Improvement: Gain continuous feedback on your LLM application's performance to inform prompt engineering, model fine-tuning, and ongoing optimization efforts.
Ensuring Consistent Quality: Maintain a high standard of quality and reliability for your LLM application over time by constantly monitoring and evaluating its behaviour.

A sample of FUTURE AGI’s OBSERVE dashboard to monitor the deployed app
By leveraging Observe with integrated Evals, Future AGI provides a complete observability solution, moving beyond initial model selection to ensure the ongoing health, performance, and responsible operation of your LLM-powered customer support chatbot in a dynamic, real-world environment. This continuous feedback loop is essential for building and maintaining truly effective and trustworthy AI applications.
Conclusion
Data-Driven LLM Selection for Superior Customer Support – Claude Sonnet Emerges as the Winner
Choosing the right LLM for customer support is a critical decision that impacts customer satisfaction and operational efficiency. Future AGI's observability platform empowers businesses to move beyond subjective assessments and adopt a data-driven approach to LLM benchmarking. By providing comprehensive tracing, automated evaluations, and intuitive dashboards, culminating in the objective "Choose Winner" feature, Future AGI simplifies the process of comparing models, identifying optimal choices, and ultimately delivering superior AI-powered customer support experiences. This case study demonstrates how Future AGI transforms LLM evaluation from a complex, manual task into a streamlined, insightful, and ultimately more effective process, leading us to confidently select Claude Sonnet as the winning LLM for our customer support chatbot application.
