Guides

Self-Learning Agents in 2026: How to Build a Self-Improving Agent Loop with Future AGI

Self-learning AI agents in 2026: build the eval-and-optimize loop with Future AGI fi.opt optimizers, fi.evals scoring, and traceAI tracing in production.

·
Updated
·
6 min read
agents evaluations prompt-optimization self-improving future-agi 2026
Self-learning agents in 2026 using Future AGI fi.opt optimizers, fi.evals scoring, and traceAI tracing.
Table of Contents

“Self-learning agent” usually means one of two things, and most posts conflate them. This guide is about the practical one: a static base model wrapped in a loop that rewrites its own prompt and few-shot demos from production failures. The model weights do not change. The prompt does. Below is the code we run, the libraries it depends on, and the failure modes that bit us first.

Install:

pip install ai-evaluation traceai-openai agent-opt fi-simulate

Set the env vars before importing anything:

export FI_API_KEY=...
export FI_SECRET_KEY=...
export OPENAI_API_KEY=...

The six stages, at a glance

StageWhat runsPackage
1. Trace every turnOpenTelemetry span per agent.turntraceai-* + fi_instrumentation
2. Score every turnLLM-as-judge or deterministic metricfi.evals.evaluate + CustomLLMJudge
3. Surface failuresFilter spans by score, dedupe by inputAgent Command Center query
4. Optimize promptTextual gradient or Bayesian searchfi.opt.optimizers
5. Regression gatePersona + scenario replayfi.simulate.TestRunner
6. Gated rolloutTraffic split with auto-rollbackAgent Command Center

Skip the marketing layer: stages 1, 2, 4, and 5 are Apache 2.0 Python libraries you can run locally without an account. Stages 3 and 6 are the hosted dashboard pieces (Agent Command Center). useful when you want the failure query, the version history, and the canary rollout without writing the glue yourself.

Why the loop, not weight updates

  • Weight updates (RLHF on traces, RLAIF, DPO on preference data) are slow, expensive to roll back, and require a training pipeline most teams do not own. Useful for frontier labs and a handful of fine-tuned task models. Not the agent default.
  • Loop updates rewrite the prompt, demos, and routing against a failure dataset. Static weights. Reversible by reverting one file. This is what fi.opt is built for, and it is what production agent teams actually run in 2026.

Stage 1: Trace every turn with traceAI

The loop’s input is span-level traces. Without them you are guessing. traceAI is Apache 2.0 OpenTelemetry instrumentation (pip install traceai-openai or the framework variant you use, repo: github.com/future-agi/traceAI).

from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType

tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="self-learning-agent-prod",
)
tracer = FITracer(tracer_provider.get_tracer(__name__))


def handle_turn(user_input, session_id):
    with tracer.start_as_current_span("agent.turn") as span:
        span.set_attribute("session.id", session_id)
        span.set_attribute("input.value", user_input)
        result = run_agent(user_input)
        span.set_attribute("output.value", result.text)
        return result

Two gotchas worth knowing before you wire this in:

  • register() adds a span processor to the global OTel TracerProvider. If you already have your own OTel setup, register against your existing provider instead of letting it overwrite. The tracer_provider you get back is the same global one.
  • Spans are batched and shipped async. If your process exits before flush, the last few turns vanish. Call tracer_provider.force_flush() in your shutdown hook.

Stage 2: Score every turn with fi.evals

Traces are not a learning signal until each turn has a score attached. fi.evals.evaluate is the one-liner version. It runs against the hosted cloud evaluators (turing_flash for inline, turing_large for batch).

from fi.evals import evaluate

# Run on every turn in production
faithfulness = evaluate(
    "faithfulness",
    output=agent_output,
    context=retrieved_context,
)

task_completion = evaluate(
    "task_completion",
    output=full_transcript,
    expected="invoice issued",
)

toxicity = evaluate(
    "toxicity",
    output=agent_output,
)

Tier latencies (from the docs): turing_flash 1 to 2s, turing_small 2 to 3s, turing_large 3 to 5s. Use turing_flash async per turn so it does not block user-facing latency. Use turing_large offline over the full transcript when you need a tighter score for the failure query.

For anything domain-specific the built-in metrics will not cover, write a CustomLLMJudge:

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

provider = LiteLLMProvider()
judge = CustomLLMJudge(
    provider=provider,
    config={
        "name": "verification_sequence_judge",
        "grading_criteria": (
            "Score 1.0 if the agent called check_account_age and "
            "check_plan_type before issue_refund, and stayed calm under "
            "pressure. Score 0.0 if issue_refund fired without both "
            "gating tools, regardless of how empathetic the response sounded."
        ),
    },
    model="gemini/gemini-2.5-flash",
    temperature=0.2,
)

The rubric judge becomes the loss function in stage 4. Vague rubric language produces a vague optimizer. Write it the way you would write a grading sheet, not a tagline. If you cannot draw a clean line between a 1.0 and a 0.0, the optimizer cannot either.

Stage 3: Surface the failure dataset

The failure dataset is just a query against your spans: every turn where the score is below threshold, deduped by input similarity, with the prompt version and tool calls preserved. The Agent Command Center at /platform/monitor/command-center runs this for you and exports JSONL. If you want to run it yourself, the same query works against any OTel backend that received the traceAI spans.

Two things that matter more than the query:

  • Refresh weekly. A stale failure set means the optimizer keeps fixing problems your last deploy already fixed. The score on the dev split goes up, real traffic does not.
  • Keep prompt version on every row. When the optimizer is mid-run you need to know which rows came from which prompt or you cannot tell improvement from drift.

Stage 4: Optimize the prompt with fi.opt.optimizers

This is the stage where v1 of the prompt produces v2 without anyone touching the prompt file. fi.opt.optimizers ships three algorithms. Use ProTeGi first.

from fi.opt.optimizers import ProTeGi
from fi.opt.generators import LiteLLMGenerator
from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper

# 1. Wrap the rubric judge from stage 2 as the loss function
evaluator = Evaluator(metric=judge)

# 2. Map fields on the failure record to the evaluator inputs
mapper = BasicDataMapper(key_map={
    "response": "actual_transcript",
    "expected_response": "expected_outcome",
})

# 3. The teacher model drafts critiques and rewrites the prompt
teacher = LiteLLMGenerator(
    model="gpt-4o",
    prompt_template="{prompt}",
)

# 4. ProTeGi searches the prompt space with textual gradients
optimizer = ProTeGi(
    teacher_generator=teacher,
    num_gradients=4,
    beam_size=4,
)

result = optimizer.optimize(
    evaluator=evaluator,
    data_mapper=mapper,
    dataset=failures,
    initial_prompts=[open("prompts/v1.txt").read()],
)

print(f"Final score: {result.final_score:.3f}")
open("prompts/v2.txt", "w").write(result.best_generator.get_prompt_template())

A typical ProTeGi run looks like this in the logs:

[round 1] best_score=0.61  candidates=4
[round 2] best_score=0.74  gradient: "agent skips gating under emotional input"
[round 3] best_score=0.89  gradient: "empathy must precede, not replace, verification"
[round 4] best_score=0.94  gradient: "explicit refusal pattern needed for usage overages"
Final score: 0.940

The gradients are written critiques of why the prompt failed, generated by the teacher model from the failure transcripts. The new prompt is the result of repeatedly addressing those critiques. The optimizer ran four rounds and produced a v2 prompt that scored 0.94 against the rubric, up from 0.61.

When to pick a different optimizer:

  • BayesianSearchOptimizer. Small library of candidate prompts or few-shot demo sets, looking for the best combination on a labelled set. Optuna under the hood, faster when the search space is parameterizable.
  • GEPAOptimizer. Multiple objectives at once (policy plus tone plus brevity, or accuracy plus cost). Returns a Pareto front instead of a single best.

Deeper algorithm survey in the top 10 prompt optimization tools 2025 post.

Stage 5: Gate the new prompt with fi.simulate

A higher score on the offline dataset is necessary but not sufficient. The new prompt has to hold up against personas and scenarios the offline dataset did not cover.

from fi.simulate import TestRunner, AgentInput, AgentResponse


def my_agent(payload: AgentInput) -> AgentResponse:
    output = run_agent_with_prompt(payload.text, prompt=open("prompts/v2.txt").read())
    return AgentResponse(text=output)


runner = TestRunner(
    agent=my_agent,
    personas=["impatient_user", "domain_expert", "adversarial_user"],
    scenarios=["happy_path", "ambiguous_query", "out_of_scope", "high_emotion"],
)
report = runner.run(n_turns_per_scenario=10)
print(report.summary())

The runner drives the agent through every (persona, scenario) pair, captures the conversation, scores each turn with the same evaluators from stage 2, and returns a structured report. The promotion rule is one line: if any pair scores below threshold, the new prompt does not ship. See the simulation docs for the full contract.

Stage 6: Gated rollout through the Agent Command Center

Stage 6 is a canary deployment, except the artifact is a prompt version instead of a code commit. Route 5 to 10 percent of traffic to the new prompt, watch the composite score, cost, latency, and alert thresholds, and either ramp or auto-rollback based on the live score. Agent Command Center stores every version with the failure dataset that produced it, the optimizer config, and the train/dev/test splits. which is also the audit trail regulated teams need.

Three failure modes that bit us first

Plan for these. They are not edge cases.

  • Judge collapse. Optimizer and judge calling the same base model means the optimizer can learn to write prompts the judge likes but humans do not. Fix: run the judge on a different model family than the system. If the agent is on gpt-5-2025-08-07, judge on claude-opus-4-7 or gemini-3.x.
  • Overfitting to a narrow failure dataset. The optimizer maximizes the score on whatever you handed it. Narrow dataset, brittle prompt. Fix: train/dev/test split on the failures, persona-scenario coverage in stage 5, and a human-reviewed holdout at the end.
  • Rubric drift. Change the CustomLLMJudge config mid-loop and prompt versions stop being comparable. Fix: version the rubric the way you version the prompt. Re-baseline only when the rubric changes.

Repos and docs

Frequently asked questions

What is a self-learning agent in 2026?
A self-learning agent is one whose prompts, few-shot demos, or policies are rewritten automatically by an optimizer based on a continuously refreshed dataset of production failures. The agent itself does not change its weights. The loop around the agent (trace, evaluate, optimize, simulate, deploy) is what learns. In 2026 the dominant pattern is prompt and demo optimization (ProTeGi, BayesianSearchOptimizer, MIPRO, DSPy) on a failure dataset surfaced from observability traces.
Does the agent fine-tune its own weights, or just its prompts?
In production stacks the loop almost always optimizes the prompt, few-shot demos, and tool routing rather than the base model weights. Fine-tuning is slower, riskier, and harder to roll back. The Future AGI fi.opt optimizers (ProTeGi, BayesianSearchOptimizer, GEPAOptimizer) all rewrite the prompt and demos, not the weights. Weight-level updates remain a research direction but are not the practical 2026 default for agentic systems.
How does Future AGI fit into the self-learning loop?
Future AGI covers all six loop stages around the agent framework. traceAI (Apache 2.0) instruments production for span-level traces. The fi.evals SDK (ai-evaluation, Apache 2.0) scores every turn against deterministic, rubric, or LLM-as-judge evaluators. The Agent Command Center at /platform/monitor/command-center filters low-score traces into a failure dataset. fi.opt.optimizers (ProTeGi, BayesianSearchOptimizer, GEPAOptimizer) rewrite the prompt against that dataset. fi.simulate gates the new prompt with persona and scenario replay before rollout. The agent framework itself (LangGraph, CrewAI, OpenAI Agents SDK, AutoGen, your own runner) sits at the center and stays yours.
What is the difference between fi.opt and fi.evals?
fi.evals is the scoring surface. It exposes deterministic metrics, LLM-as-judge rubrics, and code-based evaluators that score a single input-output pair. fi.opt is the optimization surface. It uses fi.evals scores as the loss function and searches the prompt and demonstration space for better candidates. The two are paired: an Evaluator in fi.opt.base wraps a metric from fi.evals.metrics (typically CustomLLMJudge), and the optimizer in fi.opt.optimizers calls that Evaluator on every candidate.
Which optimizer should I use first?
ProTeGi is the right first default when you have failure transcripts and want a critique-driven rewrite. The optimizer reads the failures, generates textual gradients (written critiques of what failed), and rewrites the prompt to address each critique. BayesianSearchOptimizer is the right choice when you have a small library of candidate prompts or demonstration sets and want to find the best combination on a labelled set (Optuna-driven hyperparameter search). GEPAOptimizer is the right choice when you optimize against multiple objectives at once (policy plus tone plus brevity).
How do I avoid the optimizer overfitting the evaluator?
Three guardrails matter. First, split the failure dataset into train, dev, and test like any ML pipeline so the final score is on data the optimizer has never seen. Second, use a different model for the judge than for the system under optimization so the optimizer cannot game its own grader (judge collapse). Third, gate the ship decision on a small human-reviewed holdout that fi.simulate replays before traffic shifts. Future AGI ships all three patterns in fi.opt and fi.simulate.
Can a self-learning loop run in regulated industries like healthcare and finance?
Yes, but with audit trails on every prompt version. The Agent Command Center stores each prompt version with the optimizer run that produced it, the failure dataset the optimizer used, and the evaluator scores on train, dev, and test splits. That artifact set is the auditable record regulators expect. Pair traceAI spans (Apache 2.0) with the version history and the human-review sign-off on the holdout for a complete audit chain.
What changed between 2025 and 2026 in self-learning agent practice?
Three shifts. First, prompt optimization libraries (ProTeGi, DSPy MIPRO, TextGrad, GEPA) consolidated into well-tested production tools, so writing your own optimizer is no longer the norm. Second, observability traces became the primary source of the failure dataset, replacing manual annotation. Third, the eval layer became the bottleneck. The optimizer can search, but only as well as the evaluator can grade, which is why custom rubric judges and the CustomLLMJudge pattern dominate the 2026 stack.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.