Guides

LLM Eval Feedback Loop Design: A 2026 Engineering Guide

How to design an LLM evaluation feedback loop that compounds: capture, join, calibrate, promote to dataset, gate in CI, and optimize. The six-stage shape and the dataset-promotion step most teams skip.

·
Updated
·
13 min read
llm-evaluation feedback-loops dataset-promotion ci-gate evaluator-calibration agent-evaluation 2026
Editorial cover image for LLM Eval Feedback Loop Design: A 2026 Engineering Guide
Table of Contents

A team ships a thumbs widget. Three months in, it has captured 4,200 thumbs-up and 760 thumbs-down. The dashboard shows a sentiment chart. Nobody can tell you which of those 760 rows is in the regression test that runs on the next pull request, which rubric was retuned because of them, or which prompt version they cluster against. The widget worked. The loop didn’t.

A feedback loop without dataset promotion is theater. The loop that compounds runs six stages in order: capture → join → calibrate → promote → CI gate → optimize. Skip promotion and the loop dies at calibrate, because the next failure pattern never gets a labeled case to learn from. Skip the gate and the loop is a sharper dashboard. Skip optimize and the agent never improves against the signal users send.

This guide is the working pattern: the six-stage shape, the implementation per stage, the anti-patterns that show up monthly, and the FAGI surface that ships the loop on one stack.

TL;DR: the loop in one table

StageWhat you shipWhy it earns its keep
1. CaptureExplicit (thumbs, ratings, comments) + implicit (retry, regenerate, abandon, copy-paste)Thumbs alone cover 1-3% of conversations; implicit covers the other 97%
2. Jointrace_id, prompt_version, route, user_cohort, score_source on every eventA feedback row without context is debug-only
3. CalibrateTwo-annotator labels + ThresholdCalibrator sweep + rubric-prompt rewritesThe judge stops drifting because user labels keep pulling it back
4. PromoteCluster → SME labels → versioned golden-set entryWithout promotion the loop has nothing to write into
5. CI gatefi CLI runs the expanded set on every PR; per-cohort feedback delta blocks mergeSame failure cannot ship twice
6. OptimizeFailure shape feeds prompt or threshold optimization via agent-optThe agent moves with the signal, not behind it

One sentence: the loop is judged on whether a thumbs-down today becomes a regression test tomorrow and a prompt fix next week, not on whether the widget got a click.

Why most “feedback loops” are dashboards

Three failure modes show up in every audit.

Capture-only. Thumbs land in a Postgres table. Nobody queries it. The dashboard renders a sentiment chart. No rubric changes, no dataset grows. The loop got to stage 1 and stopped.

Capture plus calibrate, no promote. The team retunes the judge against the aggregate. Thresholds shift, the dashboard improves. But the labeled case never lands in a golden set, the CI gate runs against the same fixture it ran against three months ago, and the next regression that fits the same failure shape ships anyway because the test does not exist.

Capture plus gate, no calibrate. The dataset grows, the CI gate runs new cases, but the rubric is still calibrated to last quarter’s reality. The gate fires green on real failures because the rubric does not know what to score.

The compound failure underneath all three: capture without join. A feedback row that does not carry trace_id, prompt_version, route, and user_cohort is debug-only no matter how many other stages you build. Wiring the schema on Day 1 costs an engineering day. Retrofitting it on Day 300 costs a quarter.

The six-stage loop

The shape that survives contact with production. Each stage is one job and earns one named artifact.

Stage 1: capture

Two signal classes, both first-class.

Explicit. Thumbs up and thumbs down on every user-facing response. Star ratings, written comments, structured rubric scores where the surface allows. Cheap to interpret, biased toward extreme experiences. High-precision, low-recall: most users (97-99%) never click.

Implicit. Retry on the same prompt with slightly different wording. Regenerate clicks. Abandonment mid-conversation. Copy-paste (the user trusted it enough to use elsewhere). Edit before sending. Escalation to a human agent. Implicit signals are two to three orders of magnitude denser than thumbs and cover the conversations the rating button never touches.

For non-user-facing surfaces (batch summarization, an internal compliance agent), wire a relabel UI for the eval team instead. Either way, every response gets a feedback hook. Discipline that matters: never sample feedback. Sample traces if cost requires it; a feedback event sampled away is gone.

Stage 2: join

The feedback event has to carry enough context that a SQL fork three months later returns useful rows. The schema that holds up:

{
  "feedback_id": "fb_2026_05_142087",
  "trace_id": "01HZX...PQ",                  # joins to the span tree
  "prompt_version": "support-rag-v23",       # joins to the prompt registry
  "route": "refund",                         # joins to the per-route gate
  "user_cohort": "enterprise_admin",         # joins to the cohort delta
  "signal_type": "thumbs_down",              # or "regenerate", "abandon", "edit"
  "score_source": "human",                   # vs "api", "auto_grader"
  "comment": "Cited a policy that doesn't exist.",
  "created_at": "2026-05-19T14:33:00Z"
}

Five fields earn their keep on every row. trace_id joins to the span tree (which already carries prompt, retrieval, tool calls, rubric scores). prompt_version compares feedback rates across prompt iterations. route scopes the gate so a refund regression does not block a sales PR. user_cohort makes the cohort delta meaningful: a new feature shipping to power users first should be scored against power users, not averaged. score_source distinguishes human, LLM-judge, and auto-grader labels so disagreement surfaces explicitly later.

Instrument the join on the same day you instrument the widget. The retrofit is the most expensive line item in the loop.

Stage 3: calibrate

User labels are the ground truth a judge calibrates against. Two artifacts move together.

Threshold calibration. Sweep the rubric threshold across a fixed grid (typically 0.3 to 0.9 in 13 steps) and pick the value that maximizes F1 or accuracy on the labeled validation set assembled from feedback. Manual tuning does not survive past three or four rubrics; a production stack runs 30 to 60.

from fi.evals import Evaluator, ThresholdCalibrator
from fi.evals.feedback import FeedbackRetriever, ChromaFeedbackStore
from fi.evals.templates import Groundedness

store = ChromaFeedbackStore(collection="grounding-feedback-prod")
retriever = FeedbackRetriever(store=store)

calibrator = ThresholdCalibrator(
    template=Groundedness(),
    retriever=retriever,
    metric="f1",
)
tuned = calibrator.calibrate(validation_set=retriever.labeled_validation())
print(tuned.threshold, tuned.f1)

Rubric-prompt rewrites. The threshold sweep handles where to draw the line; the rubric prompt handles what to score. The in-product authoring agent reads the feedback aggregate, identifies clauses that disagree with the signal, proposes a rewrite, and surfaces the diff for review. Approved changes ship as configuration, not a code path. The contract: the platform proposes, the human disposes.

Two-annotator discipline still applies. If the labels feeding the calibrator are single-annotator, Cohen’s kappa is undefined and the sweep optimizes against one opinion. The LLM golden-set design guide covers the labeling discipline.

Stage 4: promote

The step most teams skip. Capture, join, and calibrate produce a tighter judge but no new test cases. Without promotion the loop dies one quarter in: the judge sharpens, the failure modes the judge sharpened against disappear from production (because the rubric caught them), and the calibration loop has nothing fresh to learn from.

Promotion turns a cluster of similar failures into one labeled golden-set entry that becomes a permanent regression test:

  1. Cluster. HDBSCAN soft clustering runs nightly over trace embeddings in ClickHouse. Similar failures collapse into one named cluster regardless of surface wording. A noise bucket holds one-offs so they do not force a centroid.
  2. Triage. A Sonnet 4.5 Judge reads each cluster (eight-tool agent, 30-turn budget, Haiku Chauffeur for large spans, 90% prompt-cache hit ratio so per-cluster cost stays in cents) and writes an immediate_fix payload with root cause, quick fix, long-term recommendation. A 4-dimensional trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1-5 each) joins the metadata.
  3. Label. Negative-feedback spans from each cluster flow into an Annotation Queue. SMEs review 5 to 10 representative cases, apply the rubric, resolve disagreements with a senior reviewer.
  4. Promote. Survivors export to a versioned dataset in one API call. The new version carries the parent reference, the cluster’s immediate_fix text, and annotator provenance.
from fi import Client
from fi.queues import AnnotationQueue

client = Client(fi_api_key="...", fi_secret_key="...")
queue = AnnotationQueue(fi_api_key="...", fi_secret_key="...")

# 1. Pull failing spans from the cluster into the queue
queue.add_items(queue_id, items=[
    {"source_type": "observation_span", "source_id": span_id}
    for span_id in cluster["representative_spans"]
])

# 2. After SME labeling, promote survivors to the next golden-set version
queue.export_to_dataset(
    queue_id,
    dataset_name="refund-regressions-2026.05",
    parent_version="refund-regressions-2026.04",
)

Every promoted row carries annotator, guidance_version, source_trace_id, and cluster_id metadata. A dataset entry without provenance is a fixture, not a regression test. When the CI gate fires green a month later, the trail back to the original production failure has to be one query away.

Stage 5: CI gate

The expanded dataset is what the next pull request runs against. Two gates earn their keep on every PR.

Floor gate. Per-rubric absolute thresholds (Groundedness 0.85, AnswerRefusal 0.90, FactualAccuracy 0.80) catch catastrophic regressions. The smoke test.

Delta gate. A two-sample test (Welch’s t for continuous rubrics, two-proportion z for binary) compares the current PR against the trailing 7-day rolling baseline. Gate fires only when p is under 0.05 AND the effect size exceeds the rubric’s noise floor (around 0.03 on a 0-1 scale). The calibration-aware gate; stops the false alarms from judge variance.

The promoted dataset wires into the gate through pytest plus the fi CLI:

import pytest
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, FactualAccuracy, AnswerRefusal
from fi.testcases import TestCase

evaluator = Evaluator(max_workers=16)
templates = [Groundedness(), FactualAccuracy(), AnswerRefusal()]

@pytest.mark.parametrize("route", ["refund", "status", "escalation"])
def test_regression(route):
    cases = [TestCase(**row) for row in load_dataset(f"{route}-regressions-2026.05")]
    result = evaluator.evaluate(eval_templates=templates, inputs=cases)
    assert result.pass_rate >= 0.92, f"{route}: regression set failed"

The fi CLI exposes pass_rate, avg_score, p50_score, p90_score, and p95_score as native assertion metrics, so the gate compares percentiles without grep-on-stdout heuristics. The CI/CD eval gate guide covers the full GitHub Actions workflow, exit-code partition, and the canary auto-rollback that lives downstream of merge.

The cohort delta sits next to the per-route gate. A new prompt that improves the average but drops 4 points on the enterprise-admin cohort should fail the gate even if the floor passes. Cohort regressions hide in averages; the join from stage 2 makes them queryable.

Stage 6: optimize

The loop closes when the failure shape feeds optimization, not just regression.

Threshold optimization. Already covered by ThresholdCalibrator in stage 3. The same sweep retunes as the dataset grows in stage 4.

Prompt optimization. agent-opt ships six optimizers across the search-strategy spectrum: RandomSearchOptimizer for cheap baselines, BayesianSearchOptimizer (Optuna-backed, resumable, teacher-inferred few-shot) for sample-efficient search, MetaPromptOptimizer and ProTeGi for prompt-text rewriting, GEPAOptimizer and PromptWizardOptimizer for structured prompt engineering. EarlyStoppingConfig bounds the budget.

from agent_opt import BayesianSearchOptimizer, EarlyStoppingConfig
from agent_opt.spaces import PromptSpace

space = PromptSpace.from_template(
    system_prompt="You are a refund assistant. {tone} {format}",
    variables={
        "tone": ["formal", "concise", "balanced"],
        "format": ["bullet", "paragraph", "table"],
    },
)
opt = BayesianSearchOptimizer(
    space=space,
    metric="groundedness_f1",
    resumable=True,
    early_stopping=EarlyStoppingConfig(patience=8, min_delta=0.01),
)
best = opt.optimize(validation_set=load_dataset("refund-regressions-2026.05"), n_trials=80)

The optimization reads the same dataset the gate reads. That is the loop closing: the row a user thumbs-downed last week is the row the optimizer maximizes against this week. Pair with the automated agent optimization guide.

Roadmap honesty. The trace-stream-to-agent-opt connector is on the development surface. Today agent-opt runs against datasets you assemble (often via the same FeedbackRetriever that drives ThresholdCalibrator). The promote step in stage 4 is intentionally a human gate; only reviewed cases enter optimization.

Production patterns that hold up

Fork the signal at the source. Every feedback event lands in both the training store and the eval feedback store. The judge rubric and the model walk forward against the same signal. Most common failure mode: signal goes into RLHF only, the judge stays frozen, the better model fails the outdated gate and rolls back.

Span-attach every score. The eval score lives on the same span the trace produced (EvalTag does this natively in traceAI). The rubric version that scored a trace has to be recoverable from the trace; debugging a rubric without knowing which version ran is impossible.

Version the rubric like code. The rubric is configuration. Tag versions, store diffs, pin every CI run to a rubric version. Score histories that span a version change are not comparable; tag the change point.

Treat the judge as a dependency. Pin the judge model and temperature, version them in the rubric file, bump the version when they change. When a closed-API judge gets retrained behind the scenes, your scores quietly land on a different scale. The best LLM judge models guide covers the trade-offs.

Run the loop on Day 1. The join is the structural cost; the widget can wait. Wiring trace_id, prompt_version, route, and user_cohort from Day 1 is one engineering day. Retrofitting after three months is a quarter.

Humans on policy, not on grinds. Automate clustering, threshold sweeps, rubric proposals. Leave the label, the policy call, and the deprecation to humans. A loop that auto-approves rubric rewrites on a regulated workload without review is a compliance incident waiting to happen.

Anti-patterns we see every month

Capture-only. Widget ships, dashboard renders, no rubric ever changes. Most common pattern. Gives the illusion of a loop with none of the compounding.

Feedback feeding only model training. RLHF gets the signal, the judge does not. Fork at source.

No clustering. Engineering triages 500 failures a day one trace at a time. Per-trace triage feels rigorous and is the slowest possible way to learn.

Manual threshold tuning past four rubrics. An engineer eyeballs traces, picks 0.7, ships it. At 30 rubrics the approach collapses. Use ThresholdCalibrator.

One signal channel only. Thumbs alone are sparse and biased. The loop closes properly when implicit signals, outcome metrics, and drift detection all flow.

Calibrate without promote. The judge sharpens but no labeled cases land in the golden set. The CI gate runs the same dataset version forever. The loop dies one quarter in.

Promote without versioning. Cases append to “the dataset” without a tag. A regression that turns out to be a labeling change three months back cannot be diffed. Every promote is a new version with a parent reference.

Eval suite owned by no team. Shared repo, no owner, the rubric ages, the on-call routes every “eval is broken” page to whoever is online. Make eval a product with an owner, a backlog, and a metric.

How Future AGI ships the loop end-to-end

One stack covers the six stages.

traceAI captures every request as an OpenInference span with EvalTag rubrics attached. 50+ AI surface integrations across Python, TypeScript, Java (including Spring Boot, Spring AI, LangChain4j, Semantic Kernel), and C#. fi.span.kind, session.id, and tag.tags are native stratification keys. score_source distinguishes human, API, and auto-grader labels on the same span. (Stages 1, 2.)

ai-evaluation SDK (Apache 2.0) is the calibrate-and-gate plumbing. 60+ EvalTemplate classes (Groundedness, ContextAdherence, FactualAccuracy, AnswerRefusal, Toxicity, TaskCompletion, EvaluateFunctionCalling). ThresholdCalibrator sweeps the threshold grid against ChromaFeedbackStore. 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B, WILDGUARD_7B, SHIELDGEMMA_2B, plus FAGI TURING_FLASH and TURING_SAFETY). 8 sub-10ms Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). Four distributed runners (Celery, Ray, Temporal, Kubernetes). Per-eval cost runs lower than Galileo Luna-2 on classifier-backed evals. (Stage 3.)

Annotation Queue is the promote stage. Six source types feed the queue (trace, observation_span, trace_session, call_execution, prototype_run, dataset_row). SMEs label with IAA per criterion; survivors export to a versioned dataset with one API call. Label types cover categorical, numeric, star, text, thumbs. (Stage 4.)

Error Feed is the cluster-to-fix layer that feeds the queue. HDBSCAN soft clustering over ClickHouse embeddings groups failures into named issues. A Sonnet 4.5 Judge agent (8 tools, 30-turn budget, Haiku Chauffeur for large spans, 90% prompt-cache hit ratio) writes immediate_fix per cluster plus the 4-dimensional trace score. Linear OAuth one-click pushes triage tickets; Slack, GitHub, Jira, and PagerDuty are on the integration surface. (Stage 4.)

fi CLI is the CI gate. Native assertion conditions cover pass_rate, avg_score, p50_score, p90_score, p95_score. Exit codes are the stable contract (0 success, 2 assertion fail, 3 strict-mode warning, 6 API error, 7 timeout) so the gate plugs into GitHub Actions, GitLab CI, Buildkite, Jenkins, or CircleCI without translation. (Stage 5.)

agent-opt is the optimize stage. Six optimizers from RandomSearchOptimizer baselines to BayesianSearchOptimizer (Optuna-backed, resumable) to MetaPromptOptimizer, ProTeGi, GEPAOptimizer, and PromptWizardOptimizer for prompt-text rewriting and structured prompt engineering. The promoted dataset is the validation set. (Stage 6.)

Agent Command Center is the SOC 2 Type II, HIPAA, GDPR, and CCPA certified hosted runtime that scores the canary downstream of merge. Per-key virtual budgets, the five-level budget tracker (org, team, user, key, tag), and gateway response headers (x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-cache, x-agentcc-guardrail-triggered) expose eval-relevant telemetry per request. The canary trips auto-rollback on a rolling-mean drop beyond the calibrated threshold while the rest of traffic stays on the incumbent.

pip install ai-evaluation futureagi, instrument with traceAI, attach feedback on the same span, and the loop closes without stitching three tools together. The LLM evaluation playbook is the pillar, the best LLM feedback collection tools comparison covers the capture surface, and deterministic vs LLM judge evals is the rubric-design companion.

The deeper insight: the dataset is the loop

The opinion this guide earns: the dataset is what compounds, not the dashboard. A team that ships the widget without promote has a sentiment chart. A team that adds calibrate without promote has a sharper sentiment chart. A team that runs the full six stages has a dataset that grows every week with cases the rubric did not catch the first time, a CI gate that blocks the same failure from shipping twice, and a judge that calibrates to the same signal the model trains on.

The eval suite is a product. The dataset is its source of truth. The feedback loop is how the dataset learns what production teaches. If the loop is not running on Day 1, it will not be running on Day 300 either; the team will just be paged more often, and the regression that broke production last quarter will break it again this quarter against a rubric that never knew it should care.

Ship the six stages in order. Skip none. Day 30 looks different from Day 1 because every failure has become a labeled case, every rubric has tightened around the patterns the data revealed, and every threshold sits where accuracy peaks against the real input distribution. That is the loop that compounds.

Frequently asked questions

What is an LLM eval feedback loop?
A feedback loop is the path a single user signal travels from a thumbs-down click to a recalibrated evaluator that blocks the same failure on the next release. The shape that compounds has six stages: capture (explicit and implicit signals), join (every event carries trace id, prompt version, route, cohort), calibrate (user labels tune the judge rubric), promote (the labeled cluster becomes a golden-set entry), CI gate (the new entry blocks regression on the next PR), and optimize (the failure shape feeds prompt or policy optimization). A loop that stops at capture is a dashboard. A loop that stops at calibrate without promotion is a sharper dashboard. A loop that closes through gate and optimize is the only one that compounds.
Why is dataset promotion the step most teams skip?
Because it is the boring step that requires owning a dataset like code. Capture is fun (widget plus event endpoint). Calibration is interesting (threshold sweeps, rubric rewrites). CI gating is satisfying (green check, merge proceeds). Dataset promotion sits in the middle: it asks who labels the row, who reviews it, what version it lands in, how the row is diffed against the previous version, and how the test gets pinned to a version. Without promotion the loop dies at calibration because the next failure pattern never gets a labeled case to learn from, and the judge that retuned today drifts again tomorrow.
What is the difference between feedback going into a model versus into the evaluator?
Most teams send thumbs-up and thumbs-down into an RLHF or DPO pipeline and stop there. The model improves. The evaluator that gates the new model still scores against last quarter's reality, so a better model fails an outdated rubric and rolls back. Fork the signal at the source: every feedback event lands in both the training store and the eval feedback store. The judge rubric, the threshold, and the model walk forward together against the same signal.
How big should the feedback signal be before the loop is worth building?
The loop is worth building on Day 1 because the join is what is expensive to retrofit. Wiring trace id, prompt version, route, and cohort into every feedback event on Day 1 costs roughly one engineering day. Doing it on Day 300 after feedback has been logged without context costs roughly two months. The size of the signal scales the labeling and judge-calibration cost, but the structural cost of the loop sits in the schema. Ship the schema before the signal, not after.
What is the relationship between the feedback loop and the golden set?
The golden set is the labeled artifact. The feedback loop is how the golden set learns what production teaches. The promote stage is the bridge: production failures cluster, a human labels representatives, the labeled cases land as new golden-set entries with full provenance. The CI gate runs the next PR against the expanded set, so the same failure cannot ship twice. The loop without the golden set has nothing to write into. The golden set without the loop ages out by week six.
Does the loop have to be fully automated?
No, and full automation is the wrong target. The cheap parts (clustering, threshold sweeps, rubric rewrites against a labeled aggregate) automate well. The expensive parts (the label itself, the policy call on a borderline case, the deprecation of a rubric) need humans. The right target is to give humans more time on the calls that matter, not to remove humans from the policy. A loop that auto-approves rubric rewrites without review on a regulated workload is a compliance incident waiting to happen.
How does Future AGI deliver the loop end-to-end?
traceAI captures every request with span-attached eval scores via EvalTag. Feedback events land on the same span with score_source distinguishing human, API, and auto-grader. Error Feed clusters failures nightly with HDBSCAN over ClickHouse-stored embeddings, then a Sonnet 4.5 Judge writes an immediate_fix per cluster. The Annotation Queue routes the cluster's representative spans to SMEs and exports survivors to a versioned dataset with one API call. ThresholdCalibrator sweeps the rubric thresholds on the new labeled set, and self-improving evaluators propose rubric-prompt rewrites against the same aggregate. The fi CLI runs the regression dataset on every PR and the canary keeps scoring through Agent Command Center. The trace-to-agent-opt connector is on the roadmap; eval-driven optimization on prompts and thresholds ships today.
Related Articles
View all