Articles

How to Build a Self-Improving AI Agent Pipeline Using Open Source: Simulate, Evaluate, and Optimize

Q: Q4: How do you write a custom LLM judge for tool-call ordering?

Use the evaluate() function from fi.evals and pass a prompt that defines the required tool order, the actual trace, and asks for a JSON response with score and reason fields. The reason field matters because it becomes a one-sentence postmortem you can feed directly into your prompt optimizer.

Build a self-improving AI agent pipeline using open-source Simulate, Evaluate, and Optimize SDKs that catch tool-call bugs and rewrite your prompt automatically.

April 29, 2026

18 min read

agents evaluation optimization open-source

Table of Contents

Why Does Your Eval Suite Pass at 100% While Your Agent Fails Silently in Production?

You started building an AI agent and started evaluating and your eval suite passes at 100%. That probably means your eval suite is broken, and I want to walk you through why.

I had this exact moment two weeks after shipping a voice agent that handled refund requests. Support pinged me. The agent had issued a refund to a customer who shouldn’t have qualified.

So I pulled the trace expecting something obvious, an exception, a schema violation, a hallucinated parameter. Nothing. The agent just called issue_refund(user_id, amount) without first running check_account_age and check_plan_type.

My system prompt said to verify both. In staging, it always did.

After listening to the call I figured it out. The user was angry. Voice tone escalated, sentences shortened, the demand repeated.

The empathy clauses in my prompt (“acknowledge frustration, prioritize resolution”) quietly outweighed the verification clauses (“always verify before issuing”). The model did exactly what I wrote, just not in the order I needed.

This is a tool-call sequence bug, and I think it’s the most common production failure mode for tool-using voice agents. Static unit tests can’t catch it because the failure needs sentiment-loaded input that pushes the model into a different reasoning path.

I have shipped this bug class three times across different agents, and the team always blames the LLM. The LLM is fine. The system prompt is the bug, and you can’t find it by reading transcripts.

So what do you actually need?

A self-improving agent pipeline. Not a one-shot prompt fix, but a system that catches its own failures, learns from them, and rewrites itself between deploys.

The agent doesn’t get smarter, the loop around it does. That’s the only kind of “self-improvement” that holds up in production.

The way I built this is a closed loop with three stages. You generate emotionally-loaded synthetic users at scale, score their transcripts on tool-call sequence (not just final answers), and rewrite the prompt automatically when the sequence drifts.

That’s exactly what the FutureAGI open-source stack composes. simulate-sdk drives persona-driven voice users at your live agent over LiveKit and captures the full conversation plus tool-call trace.

ai-evaluation scores those transcripts against function_call_accuracy, parameter_validation, task_completion, and 50+ other metrics. And agent-opt consumes the failure subset and rewrites the system prompt using algorithms like ProTeGi, Bayesian Search, or GEPA.

Approach	Coverage	Speed per Iteration	Reproducible	Cost
Manual call QA	Sample-only	Days	No	High (human time)
Static unit tests	Brittle	Minutes	Yes	Low but limited
Simulate → Evaluate → Optimize	Programmatic, broad	Hours	Yes	Mostly LLM tokens

By the end of this guide, your refund agent’s verification rate moves from a 62% baseline to over 95% across a thousand-persona test set.

Let me walk you through how the three stages actually connect, because the architecture is the whole game here.

How Does the Simulate, Evaluate, and Optimize Loop Make an AI Agent Self-Improving?

The thing that took me a while to internalize is that these are three stages with one rule: every stage hands the next stage structured data, not human notes.

The moment a human is in the loop summarizing transcripts or labeling failures, the loop stops being self-improving.

Stage	Library	What It Produces
Simulate	simulate-sdk	Transcripts + audio + tool-call traces
Evaluate	ai-evaluation	Per-call scores + failure flags
Optimize	agent-opt	A new prompt that scores higher than v1
Trace (cross-cutting)	traceAI	OpenTelemetry spans for staged + prod traffic

Walking through it: stage one runs hundreds of personas at your agent. Stage two scores each call on the metrics that matter for the refund flow. Stage three reads the failures, treats them as a training set, and uses prompt-optimization algorithms to find a prompt that would have passed those calls. Then stage one runs again with the new prompt and you measure whether you actually moved the needle.

Why does this converge instead of chasing its own tail? Two reasons.

First, the failure set is real signal, actual transcripts of model behavior under varied input distributions, not synthetic preference data.

Second, the optimizer is grounded by the same evaluator stage two used. No reward-model drift, no Goodhart’s law gap.

Every loop tightens the prompt against the exact failure modes the previous loop surfaced, and the cost of a new iteration drops as the failure set shrinks.

One thing worth flagging before code. Every artifact in the FutureAGI dashboard (datasets, eval templates, prompts, traces) has both a UI and a SDK that share the same backend, so you can switch between code and dashboard mid-pipeline without re-wiring anything.

Let’s actually build stage one.

Stage 1: Simulate (Drive Synthetic Users at the Live Agent)

Step one is fake users, but smart fake ones. The Agent Simulate SDK connects a scripted synthetic customer to your deployed agent inside a LiveKit room and records the conversation end to end.

You define who the customer is, what they want, and what success looks like. Then it runs the call against your live agent.

The trick is picking the right personas. I made the mistake on my first run of using “polite user with valid request” as my baseline, and of course the agent passed every one.

The whole point is to probe the variants that broke the tool-call sequence in production, which means the frustrated edge cases.

For the refund agent, that’s a frustrated user 25 days into an annual plan who raises their voice and should still succeed via the gating tools. A frustrated user 45 days in on a monthly plan, threatening churn, who must get denied politely but only after gating runs.

An ambiguous user who keeps switching between “refund” and “downgrade.” A compound request like refund plus account deletion plus invoice copy.

And the policy edge, usage overages on an annual plan. Five categories like that, twenty personas each, and you have a real test set.

You don’t write them by hand, that’s the whole point. The simulate-sdk has a synthetic data generator that takes one seed persona and expands it into N variants, mutating the situation, the mood intensity, the interruption patterns, and the demand phrasing while keeping the expected outcome stable.

You give it “frustrated user, 45 days in on monthly plan, demands refund” and it gives you back 20 versions of that conversation, each one phrased differently enough that your agent can’t pattern-match its way out.

The reason this matters is coverage. Hand-written personas drift toward the failure modes you already imagined, but synthetic generation gives you the ones you didn’t think of, which is where the production bugs actually live.

Now the code. Install everything in one go:

pip install agent-simulate ai-evaluation agent-opt python-dotenv

Drop your credentials in a .env file:

LIVEKIT_URL=wss://your-livekit-server.com
LIVEKIT_API_KEY=your_api_key
LIVEKIT_API_SECRET=your_api_secret
OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_key
FI_API_KEY=your_fi_api_key
FI_SECRET_KEY=your_fi_secret_key

And save this as pipeline.py. We’re going to keep appending to this same file as we build out evaluate and optimize, so by the end you have one runnable script:

import asyncio, os, pickle
from dotenv import load_dotenv
from fi.simulate import AgentDefinition, Scenario, Persona, TestRunner

load_dotenv()

async def run_simulate():
    # Point the SDK at your live LiveKit deployment.
    # The agent under test reads its system prompt from disk so we can
    # swap in v2 later without changing this file.
    agent = AgentDefinition(
        name="refund-agent",
        url=os.environ["LIVEKIT_URL"],
        room_name="billing-support",
        system_prompt=open("prompts/v1.txt").read(),
    )

    # Each Persona is a tuple of (who they are, what they're asking,
    # what success looks like). The runner uses these to drive both
    # the synthetic user and the pass/fail check.
    scenario = Scenario(
        name="Refund Policy Edge Cases",
        dataset=[
            Persona(
                persona={"name": "Maya", "mood": "frustrated"},
                situation="45 days into a monthly plan, demands a full refund.",
                outcome="Agent denies politely, cites the 30-day rule.",
            ),
            Persona(
                persona={"name": "Devon", "mood": "calm"},
                situation="22 days into annual plan, wants a clean refund.",
                outcome="Agent verifies, issues refund.",
            ),
            # 20 to 100 more personas covering the variants above
        ],
    )

    runner = TestRunner()
    report = await runner.run_test(agent, scenario, record_audio=True)
    # Persist to disk so stage two can score it without re-running calls.
    pickle.dump(report, open("reports/v1.pkl", "wb"))
    return report

if __name__ == "__main__":
    report = asyncio.run(run_simulate())
    print(f"Ran {len(report.results)} calls")
    print(f"Sample tool calls: {report.results[0].tool_calls}")

Run python pipeline.py and you’ll see something like this for a failing call:

Ran 100 calls
Sample tool calls: ['issue_refund']

# Inspect one failing case:
> persona: Maya (frustrated, monthly plan, 45 days)
> transcript:
   Maya:  I want my money back NOW. This is ridiculous.
   Agent: I completely understand your frustration. I'm processing
          your refund right now. You'll see it in 3-5 business days.
> tool_calls:    ['issue_refund']
> duration_sec:  47.2

Look at that tool_calls list. One entry. The gating tools never fired.

Right there, that’s the bug, in machine-readable form. No transcript reading, no human review, just a list with the wrong members.

We have a serialized report on disk now, and we’re ready to score every call programmatically.

Stage 2: Evaluate (Turn Conversations into Pass/Fail Signals)

Now we have transcripts. By themselves they’re useless.

We need to turn each one into a structured failure flag so stage three can consume them as a training set. That means picking metrics that map to your tool-call contract, not just the final answer.

Metric	What It Catches	Source
`function_call_accuracy`	Did the agent call the right tools, in the right order?	ai-evaluation
`parameter_validation`	Were tool arguments well-formed and policy-valid?	ai-evaluation
`task_completion`	Did the agent reach the stated outcome?	ai-evaluation
`audio_quality`	Latency, interruption rate, clarity	simulate-sdk helper
`verification_sequence` (custom LLM judge)	Was `issue_refund` only called after both gating tools?	CustomLLMJudge

The first three ship out of the box in ai-evaluation, and function_call_accuracy is the workhorse for this kind of bug. Append this to pipeline.py:

from fi.simulate.evaluation import evaluate_report
import pickle

# Load the report we serialized in stage one.
report = pickle.load(open("reports/v1.pkl", "rb"))

# Each eval_spec maps a metric to the fields on the report it should read.
# Templates are looked up by name from the ai-evaluation registry.
evaluated = evaluate_report(
    report,
    eval_specs=[
        {"eval_templates": ["function_call_accuracy"],
         "template_inputs": {"tool_calls": "tool_calls",
                             "expected": "expected_tool_sequence"}},
        {"eval_templates": ["task_completion"],
         "template_inputs": {"transcript": "transcript"}},
    ],
)

# Print three rows so we can sanity-check the scores.
for r in evaluated.results[:3]:
    name = r.persona.persona["name"]
    for e in r.evaluation_results:
        print(f"{name:8} {e.eval_template_name:25} {e.score:.2f}")

Output:

Maya     function_call_accuracy    0.00
Maya     task_completion           0.10
Devon    function_call_accuracy    1.00
Devon    task_completion           0.90
Priya    function_call_accuracy    0.00
Priya    task_completion           0.20

Maya and Priya both fail, and both are the frustrated-monthly-plan variants.

The pattern is already visible after three rows. That’s exactly what you want from your eval pipeline, signal that’s interpretable to a human even though it was produced entirely by code.

There’s one gap though. function_call_accuracy checks set membership, not order.

If the agent called all three gating tools but in the wrong order, you’d score 1.0 and miss the failure entirely. So I write a small custom judge that reads the trace directly:

from fi.evals import evaluate

def score_sequence(transcript, tool_trace):
    # The LLM judge reads the tool trace and the transcript together,
    # then returns a score plus a one-sentence reason. The reason
    # matters because it becomes input for the optimizer in stage 3.
    return evaluate(
        prompt=("Audit this refund call's tool-call trace.\n"
                "Required order: check_account_age -> check_plan_type -> "
                "issue_refund OR deny_refund.\n\n"
                "Trace: {output}\nTranscript: {context}\n\n"
                'Return JSON: {"score": 0-1, "reason": "..."}. '
                "Score 0 if issue_refund fired without both gating tools."),
        output=str(tool_trace),
        context=transcript,
        engine="llm",
        model="gemini/gemini-2.5-flash",
    )

result = score_sequence(maya_transcript, ["issue_refund"])
print(result.score, "->", result.reason)

0.0 -> issue_refund fired without check_account_age or check_plan_type.
       Hard violation of the gating contract; agent prioritized empathy
       over verification.

The judge tells you exactly why the call failed, in one sentence.

Notice the reason it gives is something you could paste straight into a postmortem. That’s not a coincidence. It’s exactly the kind of signal we’ll feed into the optimizer in stage three.

If you’d rather skip the wrapper code, the FutureAGI dashboard runs the same evaluators through a UI. Import the simulate report as a dataset, pick function_call_accuracy and task_completion from the template gallery, and the platform scores every row with traces, latency, and cost stacked side by side.

The Failed Cases view filters straight to rows below your threshold, which is the slice you need next.

The last piece of stage two is filtering. We want a clean list of just the calls that failed, in the format the optimizer expects:

failures = []
for r in evaluated.results:
    # Anything below 0.7 on any metric counts as a failure for our purposes.
    bad = [e for e in r.evaluation_results if e.score < 0.7]
    if bad:
        failures.append({
            "input": r.persona.situation,
            "expected_outcome": r.persona.outcome,
            "actual_transcript": r.transcript,
            "actual_tool_calls": r.tool_calls,
            "failed_evals": [e.eval_template_name for e in bad],
        })

print(f"{len(failures)} / {len(evaluated.results)} calls failed")

31 / 100 calls failed

The first time I ran this on my own agent, I expected maybe a 5% failure rate. It came back at 31%.

Static QA had missed every single one of those failures because none of them tripped a unit test, but every single one violated the policy.

This list isn’t sample data and it isn’t training data we’d have to label. It’s the dataset stage three consumes directly.

Stage 3: Optimize (Rewrite the Prompt from Real Failures)

This is the stage I’m most opinionated about, so let me explain why I pick the optimizer I pick.

agent-opt ships six algorithms. Three of them matter for what we’re doing:

Optimizer	Best For	Why
ProTeGi	Iterative refinement on a failure set	Treats LLM critiques as textual gradients
BayesianSearchOptimizer	Few-shot example selection	Optuna-driven hyperparameter tuning
GEPAOptimizer	Multi-objective (policy + tone + brevity)	Genetic Pareto evolution

I always start with ProTeGi for system-prompt rewrites driven by a failure set.

ProTeGi reads the failed conversations, generates what they call a textual gradient (a written critique of why the prompt failed), and rewrites the prompt to address that critique. The gradients read like senior-engineer PR comments, not like noise.

Append this to pipeline.py:

from fi.opt.optimizers import ProTeGi
from fi.opt.generators import LiteLLMGenerator
from fi.opt.base.evaluator import Evaluator
from fi.opt.datamappers import BasicDataMapper
from fi.evals.llm import LiteLLMProvider
from fi.evals.metrics import CustomLLMJudge

# The judge tells the optimizer whether a candidate prompt performed
# better. Same grading criteria as our stage 2 sequence judge, so
# optimizer and evaluator stay aligned (this is what prevents
# reward-model drift in the loop).
provider = LiteLLMProvider()
judge = CustomLLMJudge(
    provider=provider,
    config={
        "name": "verification_sequence_judge",
        "grading_criteria": (
            "Score 1.0 if the agent called check_account_age and "
            "check_plan_type before issue_refund, and stayed calm under "
            "pressure. Score 0.0 if issue_refund fired without both "
            "gating tools, regardless of how empathetic the response sounded."
        ),
    },
    model="gemini/gemini-2.5-flash",
    temperature=0.2,
)

evaluator = Evaluator(metric=judge)
# Tell the optimizer which fields on each failure record to feed in.
mapper = BasicDataMapper(key_map={
    "response": "actual_transcript",
    "expected_response": "expected_outcome",
})
# The "teacher" is the model that drafts the gradient and rewrites
# the prompt. GPT-4o is a strong default for this role.
teacher = LiteLLMGenerator(model="gpt-4o", prompt_template="{prompt}")

# num_gradients = how many critiques per round. beam_size = how many
# candidate prompts to keep alive between rounds. Higher = more
# thorough and more expensive.
optimizer = ProTeGi(teacher_generator=teacher, num_gradients=4, beam_size=4)

result = optimizer.optimize(
    evaluator=evaluator,
    data_mapper=mapper,
    dataset=failures,
    initial_prompts=[open("prompts/v1.txt").read()],
)

print(f"Final score: {result.final_score:.3f}")
open("prompts/v2.txt", "w").write(result.best_generator.get_prompt_template())

Run it and you’ll see something like this across the rounds:

[round 1] best_score=0.61  candidates=4
[round 2] best_score=0.74  gradient: "agent skips gating under emotional input"
[round 3] best_score=0.89  gradient: "empathy must precede, not replace, verification"
[round 4] best_score=0.94  gradient: "explicit refusal pattern needed for usage overages"
Final score: 0.940

Look at those gradients. “Empathy must precede, not replace, verification.”

That’s the bug we diagnosed in section one, surfaced automatically by reading the failure transcripts. I didn’t write that critique, ProTeGi did. And then it used that critique to rewrite the prompt.

This is the moment where the pipeline actually becomes self-improving. v1 wrote v2 without you touching the prompt file.

The same loop will write v3 the next time the evaluator surfaces enough failures. You don’t need to know the exact rewrite, you just need to verify it’s better.

Re-run stage one with prompts/v2.txt, evaluate again, and stack the numbers:

Metric	v1 (baseline)	v2 (optimized)	Delta
`function_call_accuracy`	0.62	0.96	+0.34
`verification_sequence`	0.58	0.94	+0.36
`task_completion`	0.71	0.92	+0.21
Avg latency to first tool call	1.8s	1.6s	-0.2s

The new prompt added two guardrails the optimizer pulled out of the failure transcripts: a hard rule that issue_refund cannot fire without both gating tools succeeding, and an instruction to acknowledge frustration verbally while still running gating first.

Honestly, the prompt agent-opt produced was longer and uglier than what I would have written by hand. It also worked better. After watching this happen on three different agents, I’ve stopped hand-tuning prompts entirely.

Conclusion: Close the Loop in CI/CD

So we have v2 and the numbers look good. Shipping v2 once isn’t really the point though. The point is the loop keeps running.

Before you push, re-run stage one with the same scenario set against v2 to confirm it doesn’t regress on what v1 already handled correctly. If v2 wins on the failure set without breaking the rest, ship it.

Then you automate. The way I do it is to wire the simulate run into a GitHub Action, persist evaluation results to Postgres or S3, and trigger agent-opt only when failure counts cross a threshold so you don’t burn tokens on noise.

The last piece is adding traceAI in production to instrument the live agent. When a real call gets a low function_call_accuracy score in production, you push it back into the next simulate run as a new persona.

That means production failures seed your test set instead of just your incident reports.

The Observe dashboard runs scheduled Eval Tasks against live traffic and surfaces regressions on function_call_accuracy the moment they appear. Failing traces are one click away from being promoted into your next dataset, which closes the prod-to-test loop without glue code.

Most teams stop at “we have monitoring.” Don’t.

That last piece is where the pipeline stops being a one-time refactor and becomes self-improving in the operational sense. Every prod failure seeds a new test case, every new test case can become a new gradient, and every gradient nudges the prompt closer to your policy.

The moment this clicked for me was watching a Monday-morning failure get auto-promoted into Tuesday’s test suite, and Wednesday’s prompt fix. No standup, no Jira ticket, no human in the loop until the diff hit code review.

The pieces are open source: simulate-sdk, ai-evaluation, agent-opt, and traceAI.

The hard work isn’t the code. It’s deciding which metrics matter for your domain, and being willing to let an optimizer rewrite your prompt when the data says it should.

Frequently Asked Questions About Building a Self-Improving AI Agent Pipeline

What is a self-improving AI agent pipeline and how does it work?

A self-improving AI agent pipeline is a closed-loop system that runs synthetic user conversations against your live agent, scores the failures with structured evaluators, and feeds those failures into a prompt optimizer that rewrites the system prompt automatically.

How do you test an AI voice agent for tool-call sequence bugs before production?

You drive persona-driven synthetic users at the live agent over LiveKit using simulate-sdk, then score every conversation with function_call_accuracy from ai-evaluation to catch any call where the agent skipped or reordered required tools. The combination catches sentiment-loaded failures (like an agent skipping verification under emotional pressure) that static unit tests cannot reproduce.

What is ProTeGi and how does it optimize LLM system prompts automatically?

ProTeGi is a prompt-optimization algorithm in agent-opt that reads failed conversations, generates a textual gradient (a written critique of why the prompt failed), and rewrites the prompt to address that critique across multiple beam-search rounds. It’s the cleanest fit for system-prompt rewrites driven by a real failure dataset, because the gradients read like senior-engineer PR comments rather than noise.

How do you write a custom LLM judge for evaluating tool-call ordering?

Use the evaluate() function from fi.evals and pass a prompt that defines the required tool order, the actual trace, and asks for a JSON response with score and reason fields. The reason field matters more than the score, because it gives you a one-sentence postmortem you can feed directly into your prompt optimizer in the next stage of the loop.

Is the FutureAGI stack open source, and which repos do I need?

Yes, all four pieces are open source under MIT: simulate-sdk for persona-driven agent testing, ai-evaluation for scoring with 50+ metrics, agent-opt for ProTeGi/Bayesian/GEPA optimization, and traceAI for OpenTelemetry production tracing. You install them as a single pipeline and they share the same evaluator across stages, which is what makes the loop converge instead of drifting.

View all

Guide

traceAI: Open-Source OpenTelemetry LLM Tracing for 35+ Frameworks in Python, TypeScript, Java, and C#

traceAI is open-source OpenTelemetry AI tracing for 35+ frameworks in Python, TypeScript, Java, and C#. Two lines of code. Zero vendor lock-in.

Rishav Hada · Apr 28, 2026

5 min

Guide

Voice AI Evaluation Infrastructure: A Developer's Guide to Testing Voice Agents Before They Hit Production

Build production-grade voice AI evaluation in 2026. Covers STT, LLM & TTS metrics, five evaluation layers, synthetic testing frameworks, and key pitfalls to avoid.

Rishav Hada · Mar 24, 2026

5 min

Guide

Why Your Voice Agent Fails in Production And How to Fix It?

Learn why voice agents fail in production and how to fix them with synthetic data, simulation & automated prompt optimization. Includes drive-thru case study.

Rishav Hada · Jan 19, 2026

5 min