Mastering Evaluation for AI Agents

Mastering Evaluation for AI Agents

N.V.J.K Kartik

N.V.J.K Kartik

Jan 7, 2025

Jan 7, 2025

Introduction: Mastering Evaluation for AI Agents

As Artificial Intelligence is growing exponentially, AI agents are rapidly becoming prevalent in current landscape. The autonomous systems help in reshaping the human computer interaction loop by enabling more natural and intuitive ways of completing tasks. These agents effectively utilize various tools and power of large language models to accomplish diverse tasks. This blog post will provide a comprehensive guide to mastering AI agent evaluation, complete with practical examples using our evaluation cookbook.

What are AI Agents and Why Evaluate Them?

AI agents are sophisticated software programs designed to perceive their environment and take autonomous actions to achieve specific goals. These can range from simple chatbots and virtual assistants to complex systems that make critical decisions in areas like healthcare, finance, and autonomous vehicles.

Evaluating AI agents is crucial for several reasons:

  • Ensuring reliability and performance in real-world scenarios

  • Identifying potential biases and limitations

  • Maintaining safety and ethical standards

As these systems become more complex and autonomous, proper evaluation becomes not just a necessity but a critical imperative for ensuring their responsible deployment

Key Evaluation Concepts

Function Calling Assessment

Think of function calling as checking whether an AI agent can use its tools correctly. Just as a skilled craftsperson must know when and how to use different tools, an AI agent must demonstrate proficiency in utilizing its available functions and APIs. The Key aspect we are going to evaluate are:

  • Accuracy in function selection

  • Proper parameter handling

Prompt Adherence

Prompt adherence measures how well an AI agent follows instructions and stays within given parameters. This is similar to evaluating how well a student follows exam instructions or how accurately an employee carries out assigned tasks.

We assess:

  • Compliance with given instructions

  • Consistency in responses

  • Ability to handle complex or multi-step prompts

Tone, Toxicity, and Context Relevance

These aspects focus on the qualitative elements of AI agent responses:

  • Tone: Ensuring communications are appropriate for the context

  • Toxicity: Monitoring for harmful or inappropriate content

  • Context Relevance: Verifying that responses align with the given situation

Practical Guide: Evaluating AI Agents with Future AGI SDK

Future AGI’s SDK effectively helps to evaluate the AI agents automatically and efficiently.

Here’s a sample on how we do an automated evaluation on function calling capabilities of the AI agent system

First, we load a dataset containing three columns:

  • 'input': Contains the user queries or prompts

  • 'function_calling': Contains the function calls made by the agent and their parameters

  • 'output': Contains the actual responses from the agent

This dataset structure allows us to systematically evaluate how well the agent interprets commands and executes appropriate functions.

import pandas as pd

dataset = pd.read_csv("functiondata.csv")
pd.set_option('display.max_colwidth', None) # This line ensures that we can see the full content of each column
dataset.head()

Setting Up Future AGI Evaluation Client

from getpass import getpass
from fi.evals import EvalClient

evaluator = EvalClient(
getpass("Enter your Future AGI API key: ")
)

You can get the API key and secret key from your Future AGI account.

Evaluation of Agent’s Function Calling Capabilities

After setting up the evaluation client with your API key, we can initialize our function calling evaluation module. This specialized module helps assess how well our AI agent handles function calls and parameter passing. Let's look at how we implement this evaluation:

from fi.evals import LLMFunctionCalling
from fi.testcase import TestCase

agentic_function_eval = LLMFunctionCalling(config={"model": "gpt-4o-mini"})

results_1 = []
for index, row in dataset.iterrows():
    test_case_1 = TestCase(
        input=row['input'],
        output=row['function_calling']
    )
    result_1 = evaluator.evaluate(eval_templates=[agentic_function_eval], inputs=[test_case_1])
    option_1 = result_1.eval_results[0].data[0]
    results_1.append(option_1)

Let's analyze the evaluation results from our function calling assessment. As shown in the table above, our AI agent performed well in most cases, correctly handling multiple function calls and gracefully acknowledging its limitations. However, we identified a critical issue in the second test case where the agent produced toxic content, highlighting the importance of implementing proper content filtering.

Evaluating Agent’s Toxicity And Prompt Adherence capabilities

Now let's implement the toxicity and prompt adherence evaluation using Future AGI's SDK. Here's how we can set up these evaluations:

# Evaluating Prompt Adherence
agentic_instruction_eval = InstructionAdherence(config={"model": "gpt-4o-mini"})

results_2 = []
for index, row in dataset.iterrows():
    test_case_2 = TestCase(
        input=row['input'],
        output=row['output']
    )
    result_2 = evaluator.evaluate(eval_templates=[agentic_instruction_eval], inputs=[test_case_2])
    option_2 = result_2.eval_results[0]

    result_dict = {
        'value': option_2.metrics[0].value,
        'reason': option_2.reason,
    }
    results_2.append(result_dict)

Now let's analyze the results of our toxicity evaluation using Future AGI's SDK. Our evaluation shows that while most responses maintained appropriate tone and content, there was a concerning instance of toxic language in the vegan lasagna response. This highlights the importance of implementing robust content filtering and toxicity detection in AI systems.

from fi.evals import Toxicity
agentic_toxicity_eval = Toxicity(config={"model": "gpt-4o-mini"})
results_4 = []
for index, row in dataset.iterrows():
    test_case_4 = TestCase(
        input=row['output'],
    )
    result_4 = evaluator.evaluate(eval_templates=[agentic_toxicity_eval], inputs=[test_case_4])
    option_4 = result_4.eval_results[0]
    results_dict = {}

    results_dict = {
        'toxicity': option_4.data[0],
    }
    results_4.append(results_dict)

Let’s take a look at the second row in the table. It doesn’t pass the toxicity and prompt adherence evaluation. The output from the second row isn’t suitable for the agent to write for the user. Let’s perform some additional evaluation tests to make sure the other datapoints are up to the mark for the agent

Evaluating Agent’s Context Relevance and Tone

Let’s implement the Context Relevance and Tone evaluations to make sure that the agent’s behavior is relevant to user’s requirements.

from fi.evals import Tone

# Initialize tone evaluator
agentic_tone_eval = Tone(config={"model": "gpt-4o-mini"})
results = []

# Evaluate tone for each output
for index, row in dataset.iterrows():
    test_case = TestCase(input=row['output'])
    result = evaluator.evaluate(eval_templates=[agentic_tone_eval], inputs=[test_case])
    results.append({'tone': result.eval_results[0].data or 'N/A'})

We can see how the outputs are mostly neutral which is what we require from our agent when speaking to the user, there shouldn’t be a bias unless instructed like in first row we can see the user asked for a joke which naturally changed the Agent’s tone in the final output to accommodate user’s instructions

Now let’s implement the context relevance evaluation for our agent to check if the final output is what the user is looking for.

from fi.evals import ContextRelevance
agentic_context_eval = ContextRelevance(config={"model": "gpt-4o-mini", "check_internet": False})
results_5 = []
for index, row in dataset.iterrows():
    test_case_5 = TestCase(
        input=row['input'],
        context=row['output']
    )
    result_5 = evaluator.evaluate(eval_templates=[agentic_context_eval], inputs=[test_case_5])
    option_5 = result_5.eval_results[0]
    results_dict = {
        'context': option_5.metrics[0].value,
    }
    results_5.append(results_dict)

Here we find an another anomaly in the fourth data row, where the agent wasn’t able to fulfill the user’s request despite maintaining the tone, and behavior which tells us that we have to improve the agent’s particular capabilities in context to the query related to population of a country. This wouldn’t have been able to caught by other evaluation therefore context relevance is another necessary evaluation for our use case.

Conclusion

Mastering AI agent evaluation is crucial for developing reliable and effective AI systems. By understanding and implementing proper evaluation techniques, we can build more trustworthy and capable AI agents that better serve their intended purposes.

We encourage you to explore our evaluation cookbook and start implementing these practices in your own AI development journey. Remember, thorough evaluation isn't just about finding flaws—it's about building better, more reliable AI systems for the future.

🔗 Ready to start? Access our evaluation cookbook here.

Table of Contents