Evaluate · Retell
Retell logo

Evaluate Retell

Voice & Realtime

Score Retell responses against 70+ purpose-built evaluators — Groundedness, Context Relevance, Prompt Injection, Toxicity, function-call accuracy, and your own custom templates.

python typescript · trace evaluate simulate

Prerequisites

Before you start

  • · A working Retell app — local or already in production.
  • · A free Future AGI account with FI_API_KEY and FI_SECRET_KEY.
  • · Python 3.9+ / Node 18+ / Java 17+ depending on which SDK you're installing.
  • · Trace input/output payloads (or a dataset) ready to score.

Install

pip install traceAI-openai

Evaluate recipe

from fi.evals import EvalClient
from fi.evals.templates import (
    ContextRelevance, Groundedness, PromptInjection, Toxicity
)

client = EvalClient(api_key="<FI_API_KEY>", secret_key="<FI_SECRET_KEY>")

# Reuse the trace input/output from your Retell run
result = client.evaluate(
    eval_templates=[ContextRelevance(), Groundedness(), PromptInjection(), Toxicity()],
    inputs=[{
        "input": user_query,
        "output": retell_response,
        "context": retrieved_context,
    }],
)

print(result.eval_results)

What Future AGI captures

Evaluate fields you'll see in the dashboard

  • Run any of the 70+ Future AGI evaluator templates against trace input/output

  • Score in real time on production spans, in CI on a dataset, or as a guardrail before response

  • Custom evaluators via the builder API — heuristic, LLM-as-judge, or fine-tuned Turing models

  • Eval results land back on the originating trace as searchable attributes

Common gotchas

Read these before you ship

  1. 01

    Eval templates expect specific input keys — check the template signature in `fi.evals.templates`.

  2. 02

    For RAG evaluators, pass the retrieved chunks as `context`, not the full document.

  3. 03

    LLM-as-judge templates count against your eval-model token budget — switch to Turing flash for high-volume.

Next: chain it with the other recipes

Evaluate is the first step. Most teams add an evaluator the same week, and start optimising or simulating once they have a baseline. Each recipe takes minutes to wire up.