LLM Eval Feedback Loop Design: A 2026 Engineering Guide
How to design an LLM evaluation feedback loop that compounds: capture, join, calibrate, promote to dataset, gate in CI, and optimize. The six-stage shape and the dataset-promotion step most teams skip.
Table of Contents
A team ships a thumbs widget. Three months in, it has captured 4,200 thumbs-up and 760 thumbs-down. The dashboard shows a sentiment chart. Nobody can tell you which of those 760 rows is in the regression test that runs on the next pull request, which rubric was retuned because of them, or which prompt version they cluster against. The widget worked. The loop didn’t.
A feedback loop without dataset promotion is theater. The loop that compounds runs six stages in order: capture → join → calibrate → promote → CI gate → optimize. Skip promotion and the loop dies at calibrate, because the next failure pattern never gets a labeled case to learn from. Skip the gate and the loop is a sharper dashboard. Skip optimize and the agent never improves against the signal users send.
This guide is the working pattern: the six-stage shape, the implementation per stage, the anti-patterns that show up monthly, and the FAGI surface that ships the loop on one stack.
TL;DR: the loop in one table
| Stage | What you ship | Why it earns its keep |
|---|---|---|
| 1. Capture | Explicit (thumbs, ratings, comments) + implicit (retry, regenerate, abandon, copy-paste) | Thumbs alone cover 1-3% of conversations; implicit covers the other 97% |
| 2. Join | trace_id, prompt_version, route, user_cohort, score_source on every event | A feedback row without context is debug-only |
| 3. Calibrate | Two-annotator labels + ThresholdCalibrator sweep + rubric-prompt rewrites | The judge stops drifting because user labels keep pulling it back |
| 4. Promote | Cluster → SME labels → versioned golden-set entry | Without promotion the loop has nothing to write into |
| 5. CI gate | fi CLI runs the expanded set on every PR; per-cohort feedback delta blocks merge | Same failure cannot ship twice |
| 6. Optimize | Failure shape feeds prompt or threshold optimization via agent-opt | The agent moves with the signal, not behind it |
One sentence: the loop is judged on whether a thumbs-down today becomes a regression test tomorrow and a prompt fix next week, not on whether the widget got a click.
Why most “feedback loops” are dashboards
Three failure modes show up in every audit.
Capture-only. Thumbs land in a Postgres table. Nobody queries it. The dashboard renders a sentiment chart. No rubric changes, no dataset grows. The loop got to stage 1 and stopped.
Capture plus calibrate, no promote. The team retunes the judge against the aggregate. Thresholds shift, the dashboard improves. But the labeled case never lands in a golden set, the CI gate runs against the same fixture it ran against three months ago, and the next regression that fits the same failure shape ships anyway because the test does not exist.
Capture plus gate, no calibrate. The dataset grows, the CI gate runs new cases, but the rubric is still calibrated to last quarter’s reality. The gate fires green on real failures because the rubric does not know what to score.
The compound failure underneath all three: capture without join. A feedback row that does not carry trace_id, prompt_version, route, and user_cohort is debug-only no matter how many other stages you build. Wiring the schema on Day 1 costs an engineering day. Retrofitting it on Day 300 costs a quarter.
The six-stage loop
The shape that survives contact with production. Each stage is one job and earns one named artifact.
Stage 1: capture
Two signal classes, both first-class.
Explicit. Thumbs up and thumbs down on every user-facing response. Star ratings, written comments, structured rubric scores where the surface allows. Cheap to interpret, biased toward extreme experiences. High-precision, low-recall: most users (97-99%) never click.
Implicit. Retry on the same prompt with slightly different wording. Regenerate clicks. Abandonment mid-conversation. Copy-paste (the user trusted it enough to use elsewhere). Edit before sending. Escalation to a human agent. Implicit signals are two to three orders of magnitude denser than thumbs and cover the conversations the rating button never touches.
For non-user-facing surfaces (batch summarization, an internal compliance agent), wire a relabel UI for the eval team instead. Either way, every response gets a feedback hook. Discipline that matters: never sample feedback. Sample traces if cost requires it; a feedback event sampled away is gone.
Stage 2: join
The feedback event has to carry enough context that a SQL fork three months later returns useful rows. The schema that holds up:
{
"feedback_id": "fb_2026_05_142087",
"trace_id": "01HZX...PQ", # joins to the span tree
"prompt_version": "support-rag-v23", # joins to the prompt registry
"route": "refund", # joins to the per-route gate
"user_cohort": "enterprise_admin", # joins to the cohort delta
"signal_type": "thumbs_down", # or "regenerate", "abandon", "edit"
"score_source": "human", # vs "api", "auto_grader"
"comment": "Cited a policy that doesn't exist.",
"created_at": "2026-05-19T14:33:00Z"
}
Five fields earn their keep on every row. trace_id joins to the span tree (which already carries prompt, retrieval, tool calls, rubric scores). prompt_version compares feedback rates across prompt iterations. route scopes the gate so a refund regression does not block a sales PR. user_cohort makes the cohort delta meaningful: a new feature shipping to power users first should be scored against power users, not averaged. score_source distinguishes human, LLM-judge, and auto-grader labels so disagreement surfaces explicitly later.
Instrument the join on the same day you instrument the widget. The retrofit is the most expensive line item in the loop.
Stage 3: calibrate
User labels are the ground truth a judge calibrates against. Two artifacts move together.
Threshold calibration. Sweep the rubric threshold across a fixed grid (typically 0.3 to 0.9 in 13 steps) and pick the value that maximizes F1 or accuracy on the labeled validation set assembled from feedback. Manual tuning does not survive past three or four rubrics; a production stack runs 30 to 60.
from fi.evals import Evaluator, ThresholdCalibrator
from fi.evals.feedback import FeedbackRetriever, ChromaFeedbackStore
from fi.evals.templates import Groundedness
store = ChromaFeedbackStore(collection="grounding-feedback-prod")
retriever = FeedbackRetriever(store=store)
calibrator = ThresholdCalibrator(
template=Groundedness(),
retriever=retriever,
metric="f1",
)
tuned = calibrator.calibrate(validation_set=retriever.labeled_validation())
print(tuned.threshold, tuned.f1)
Rubric-prompt rewrites. The threshold sweep handles where to draw the line; the rubric prompt handles what to score. The in-product authoring agent reads the feedback aggregate, identifies clauses that disagree with the signal, proposes a rewrite, and surfaces the diff for review. Approved changes ship as configuration, not a code path. The contract: the platform proposes, the human disposes.
Two-annotator discipline still applies. If the labels feeding the calibrator are single-annotator, Cohen’s kappa is undefined and the sweep optimizes against one opinion. The LLM golden-set design guide covers the labeling discipline.
Stage 4: promote
The step most teams skip. Capture, join, and calibrate produce a tighter judge but no new test cases. Without promotion the loop dies one quarter in: the judge sharpens, the failure modes the judge sharpened against disappear from production (because the rubric caught them), and the calibration loop has nothing fresh to learn from.
Promotion turns a cluster of similar failures into one labeled golden-set entry that becomes a permanent regression test:
- Cluster. HDBSCAN soft clustering runs nightly over trace embeddings in ClickHouse. Similar failures collapse into one named cluster regardless of surface wording. A noise bucket holds one-offs so they do not force a centroid.
- Triage. A Sonnet 4.5 Judge reads each cluster (eight-tool agent, 30-turn budget, Haiku Chauffeur for large spans, 90% prompt-cache hit ratio so per-cluster cost stays in cents) and writes an
immediate_fixpayload with root cause, quick fix, long-term recommendation. A 4-dimensional trace score (factual_grounding,privacy_and_safety,instruction_adherence,optimal_plan_execution, 1-5 each) joins the metadata. - Label. Negative-feedback spans from each cluster flow into an Annotation Queue. SMEs review 5 to 10 representative cases, apply the rubric, resolve disagreements with a senior reviewer.
- Promote. Survivors export to a versioned dataset in one API call. The new version carries the parent reference, the cluster’s
immediate_fixtext, and annotator provenance.
from fi import Client
from fi.queues import AnnotationQueue
client = Client(fi_api_key="...", fi_secret_key="...")
queue = AnnotationQueue(fi_api_key="...", fi_secret_key="...")
# 1. Pull failing spans from the cluster into the queue
queue.add_items(queue_id, items=[
{"source_type": "observation_span", "source_id": span_id}
for span_id in cluster["representative_spans"]
])
# 2. After SME labeling, promote survivors to the next golden-set version
queue.export_to_dataset(
queue_id,
dataset_name="refund-regressions-2026.05",
parent_version="refund-regressions-2026.04",
)
Every promoted row carries annotator, guidance_version, source_trace_id, and cluster_id metadata. A dataset entry without provenance is a fixture, not a regression test. When the CI gate fires green a month later, the trail back to the original production failure has to be one query away.
Stage 5: CI gate
The expanded dataset is what the next pull request runs against. Two gates earn their keep on every PR.
Floor gate. Per-rubric absolute thresholds (Groundedness 0.85, AnswerRefusal 0.90, FactualAccuracy 0.80) catch catastrophic regressions. The smoke test.
Delta gate. A two-sample test (Welch’s t for continuous rubrics, two-proportion z for binary) compares the current PR against the trailing 7-day rolling baseline. Gate fires only when p is under 0.05 AND the effect size exceeds the rubric’s noise floor (around 0.03 on a 0-1 scale). The calibration-aware gate; stops the false alarms from judge variance.
The promoted dataset wires into the gate through pytest plus the fi CLI:
import pytest
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, FactualAccuracy, AnswerRefusal
from fi.testcases import TestCase
evaluator = Evaluator(max_workers=16)
templates = [Groundedness(), FactualAccuracy(), AnswerRefusal()]
@pytest.mark.parametrize("route", ["refund", "status", "escalation"])
def test_regression(route):
cases = [TestCase(**row) for row in load_dataset(f"{route}-regressions-2026.05")]
result = evaluator.evaluate(eval_templates=templates, inputs=cases)
assert result.pass_rate >= 0.92, f"{route}: regression set failed"
The fi CLI exposes pass_rate, avg_score, p50_score, p90_score, and p95_score as native assertion metrics, so the gate compares percentiles without grep-on-stdout heuristics. The CI/CD eval gate guide covers the full GitHub Actions workflow, exit-code partition, and the canary auto-rollback that lives downstream of merge.
The cohort delta sits next to the per-route gate. A new prompt that improves the average but drops 4 points on the enterprise-admin cohort should fail the gate even if the floor passes. Cohort regressions hide in averages; the join from stage 2 makes them queryable.
Stage 6: optimize
The loop closes when the failure shape feeds optimization, not just regression.
Threshold optimization. Already covered by ThresholdCalibrator in stage 3. The same sweep retunes as the dataset grows in stage 4.
Prompt optimization. agent-opt ships six optimizers across the search-strategy spectrum: RandomSearchOptimizer for cheap baselines, BayesianSearchOptimizer (Optuna-backed, resumable, teacher-inferred few-shot) for sample-efficient search, MetaPromptOptimizer and ProTeGi for prompt-text rewriting, GEPAOptimizer and PromptWizardOptimizer for structured prompt engineering. EarlyStoppingConfig bounds the budget.
from agent_opt import BayesianSearchOptimizer, EarlyStoppingConfig
from agent_opt.spaces import PromptSpace
space = PromptSpace.from_template(
system_prompt="You are a refund assistant. {tone} {format}",
variables={
"tone": ["formal", "concise", "balanced"],
"format": ["bullet", "paragraph", "table"],
},
)
opt = BayesianSearchOptimizer(
space=space,
metric="groundedness_f1",
resumable=True,
early_stopping=EarlyStoppingConfig(patience=8, min_delta=0.01),
)
best = opt.optimize(validation_set=load_dataset("refund-regressions-2026.05"), n_trials=80)
The optimization reads the same dataset the gate reads. That is the loop closing: the row a user thumbs-downed last week is the row the optimizer maximizes against this week. Pair with the automated agent optimization guide.
Roadmap honesty. The trace-stream-to-agent-opt connector is on the development surface. Today agent-opt runs against datasets you assemble (often via the same FeedbackRetriever that drives ThresholdCalibrator). The promote step in stage 4 is intentionally a human gate; only reviewed cases enter optimization.
Production patterns that hold up
Fork the signal at the source. Every feedback event lands in both the training store and the eval feedback store. The judge rubric and the model walk forward against the same signal. Most common failure mode: signal goes into RLHF only, the judge stays frozen, the better model fails the outdated gate and rolls back.
Span-attach every score. The eval score lives on the same span the trace produced (EvalTag does this natively in traceAI). The rubric version that scored a trace has to be recoverable from the trace; debugging a rubric without knowing which version ran is impossible.
Version the rubric like code. The rubric is configuration. Tag versions, store diffs, pin every CI run to a rubric version. Score histories that span a version change are not comparable; tag the change point.
Treat the judge as a dependency. Pin the judge model and temperature, version them in the rubric file, bump the version when they change. When a closed-API judge gets retrained behind the scenes, your scores quietly land on a different scale. The best LLM judge models guide covers the trade-offs.
Run the loop on Day 1. The join is the structural cost; the widget can wait. Wiring trace_id, prompt_version, route, and user_cohort from Day 1 is one engineering day. Retrofitting after three months is a quarter.
Humans on policy, not on grinds. Automate clustering, threshold sweeps, rubric proposals. Leave the label, the policy call, and the deprecation to humans. A loop that auto-approves rubric rewrites on a regulated workload without review is a compliance incident waiting to happen.
Anti-patterns we see every month
Capture-only. Widget ships, dashboard renders, no rubric ever changes. Most common pattern. Gives the illusion of a loop with none of the compounding.
Feedback feeding only model training. RLHF gets the signal, the judge does not. Fork at source.
No clustering. Engineering triages 500 failures a day one trace at a time. Per-trace triage feels rigorous and is the slowest possible way to learn.
Manual threshold tuning past four rubrics. An engineer eyeballs traces, picks 0.7, ships it. At 30 rubrics the approach collapses. Use ThresholdCalibrator.
One signal channel only. Thumbs alone are sparse and biased. The loop closes properly when implicit signals, outcome metrics, and drift detection all flow.
Calibrate without promote. The judge sharpens but no labeled cases land in the golden set. The CI gate runs the same dataset version forever. The loop dies one quarter in.
Promote without versioning. Cases append to “the dataset” without a tag. A regression that turns out to be a labeling change three months back cannot be diffed. Every promote is a new version with a parent reference.
Eval suite owned by no team. Shared repo, no owner, the rubric ages, the on-call routes every “eval is broken” page to whoever is online. Make eval a product with an owner, a backlog, and a metric.
How Future AGI ships the loop end-to-end
One stack covers the six stages.
traceAI captures every request as an OpenInference span with EvalTag rubrics attached. 50+ AI surface integrations across Python, TypeScript, Java (including Spring Boot, Spring AI, LangChain4j, Semantic Kernel), and C#. fi.span.kind, session.id, and tag.tags are native stratification keys. score_source distinguishes human, API, and auto-grader labels on the same span. (Stages 1, 2.)
ai-evaluation SDK (Apache 2.0) is the calibrate-and-gate plumbing. 60+ EvalTemplate classes (Groundedness, ContextAdherence, FactualAccuracy, AnswerRefusal, Toxicity, TaskCompletion, EvaluateFunctionCalling). ThresholdCalibrator sweeps the threshold grid against ChromaFeedbackStore. 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B, WILDGUARD_7B, SHIELDGEMMA_2B, plus FAGI TURING_FLASH and TURING_SAFETY). 8 sub-10ms Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). Four distributed runners (Celery, Ray, Temporal, Kubernetes). Per-eval cost runs lower than Galileo Luna-2 on classifier-backed evals. (Stage 3.)
Annotation Queue is the promote stage. Six source types feed the queue (trace, observation_span, trace_session, call_execution, prototype_run, dataset_row). SMEs label with IAA per criterion; survivors export to a versioned dataset with one API call. Label types cover categorical, numeric, star, text, thumbs. (Stage 4.)
Error Feed is the cluster-to-fix layer that feeds the queue. HDBSCAN soft clustering over ClickHouse embeddings groups failures into named issues. A Sonnet 4.5 Judge agent (8 tools, 30-turn budget, Haiku Chauffeur for large spans, 90% prompt-cache hit ratio) writes immediate_fix per cluster plus the 4-dimensional trace score. Linear OAuth one-click pushes triage tickets; Slack, GitHub, Jira, and PagerDuty are on the integration surface. (Stage 4.)
fi CLI is the CI gate. Native assertion conditions cover pass_rate, avg_score, p50_score, p90_score, p95_score. Exit codes are the stable contract (0 success, 2 assertion fail, 3 strict-mode warning, 6 API error, 7 timeout) so the gate plugs into GitHub Actions, GitLab CI, Buildkite, Jenkins, or CircleCI without translation. (Stage 5.)
agent-opt is the optimize stage. Six optimizers from RandomSearchOptimizer baselines to BayesianSearchOptimizer (Optuna-backed, resumable) to MetaPromptOptimizer, ProTeGi, GEPAOptimizer, and PromptWizardOptimizer for prompt-text rewriting and structured prompt engineering. The promoted dataset is the validation set. (Stage 6.)
Agent Command Center is the SOC 2 Type II, HIPAA, GDPR, and CCPA certified hosted runtime that scores the canary downstream of merge. Per-key virtual budgets, the five-level budget tracker (org, team, user, key, tag), and gateway response headers (x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-cache, x-agentcc-guardrail-triggered) expose eval-relevant telemetry per request. The canary trips auto-rollback on a rolling-mean drop beyond the calibrated threshold while the rest of traffic stays on the incumbent.
pip install ai-evaluation futureagi, instrument with traceAI, attach feedback on the same span, and the loop closes without stitching three tools together. The LLM evaluation playbook is the pillar, the best LLM feedback collection tools comparison covers the capture surface, and deterministic vs LLM judge evals is the rubric-design companion.
The deeper insight: the dataset is the loop
The opinion this guide earns: the dataset is what compounds, not the dashboard. A team that ships the widget without promote has a sentiment chart. A team that adds calibrate without promote has a sharper sentiment chart. A team that runs the full six stages has a dataset that grows every week with cases the rubric did not catch the first time, a CI gate that blocks the same failure from shipping twice, and a judge that calibrates to the same signal the model trains on.
The eval suite is a product. The dataset is its source of truth. The feedback loop is how the dataset learns what production teaches. If the loop is not running on Day 1, it will not be running on Day 300 either; the team will just be paged more often, and the regression that broke production last quarter will break it again this quarter against a rubric that never knew it should care.
Ship the six stages in order. Skip none. Day 30 looks different from Day 1 because every failure has become a labeled case, every rubric has tightened around the patterns the data revealed, and every threshold sits where accuracy peaks against the real input distribution. That is the loop that compounds.
Frequently asked questions
What is an LLM eval feedback loop?
Why is dataset promotion the step most teams skip?
What is the difference between feedback going into a model versus into the evaluator?
How big should the feedback signal be before the loop is worth building?
What is the relationship between the feedback loop and the golden set?
Does the loop have to be fully automated?
How does Future AGI deliver the loop end-to-end?
Evaluating agent memory is four problems, not one: recall, freshness, contradiction handling, forgetting. A 2026 framework for Mem0, Zep, Letta, LangMem.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Evaluating browser-use agents in 2026: WebArena grades happy-path completion; production grades recovery from six failure modes nobody benchmarks.