Guides

Evaluating GenAI in Production 2026: The Full Framework

How to evaluate GenAI in production in 2026. Pre-deploy CI evals, online metrics, LLM-as-judge calibration, drift, safety, and how to stand up a working stack.

·
Updated
·
7 min read
agents evaluations llms observability
Evaluating GenAI in Production 2026: The Full Framework
Table of Contents

Evaluating GenAI in production in 2026

Production GenAI is no longer a single LLM call returning a string. It is a multi-step agent, a RAG pipeline over private data, a gateway routing across providers, and a guardrail layer screening every input and output. Evaluating it well in 2026 means running five layers of evaluation in parallel: CI evals before deploy, online evals on live traffic, inline safety guardrails, drift monitoring across time, and trace-level observability for agents.

This is the full framework. It covers what to evaluate, how to score it, how often, and what to wire it into. It is the same stack we run internally at Future AGI and that we ship as a product.

TL;DR: the five layers of production GenAI evaluation

LayerWhat it scoresCadenceTools
Pre-deploy CI evalHeld-out prompts on every PRPer commitfi.evals + CI runner
Online traffic evalSampled live outputsContinuousfi.evals + traceAI
Safety guardrails100% of inputs + outputsInlinefi.evals.guardrails
Drift monitoringEval scores over timeDailyFuture AGI console
Trace observabilityAgent trajectoriesOn every runtraceAI (OTel)

Why benchmarks are not enough

Benchmarks (MMLU, HELM, BigBench, GPQA, MMLU-Pro) measure model capability on static multiple-choice tasks. Production GenAI faces:

  • Noisy user inputs (slang, typos, mixed languages).
  • Multi-turn context with reference resolution across turns.
  • Tool calls with retries and partial failures.
  • Retrieval over private data the model never saw in pretraining.
  • Adversarial prompts designed to extract data or bypass policy.

A model can top a benchmark and still hallucinate on your customer support corpus, leak PII through a tool call, or pick the wrong tool 30 percent of the time. Benchmarks are a coarse capability filter for model selection; they are not a substitute for evaluation on your own prompts, outputs, and traces.

The five layers

Layer 1: pre-deployment CI evaluation

The first line of defence is a held-out test set scored on every pull request. The pattern:

  1. Curate 100 to 1,000 prompts that represent the production distribution.
  2. Attach an expected behaviour or rubric to each prompt.
  3. Run evaluators in CI on every PR.
  4. Gate merge on regression past a tolerance threshold.

With fi.evals:

from fi.evals import evaluate

prompts = [
    {"input": "Return the EU population in 2024.", "context": "EU population was 449 million in 2024."},
    {"input": "Summarise the refund policy.", "context": "Refunds in 30 days, no exceptions."},
]

scores = []
for p in prompts:
    result = evaluate("faithfulness", output=run_app(p["input"]), context=p["context"])
    scores.append(result.score)

avg = sum(scores) / len(scores)
assert avg >= 0.85, f"Faithfulness regressed to {avg:.2f}"

For subjective qualities use CustomLLMJudge:

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

judge = CustomLLMJudge(
    name="support_tone",
    rubric="Score 1 if the reply is empathetic and concise, else 0.",
    provider=LiteLLMProvider(model="gpt-4o-mini"),
)

for p in prompts:
    out = run_app(p["input"])
    r = judge.evaluate(output=out)
    print(p["input"], r.score)

Layer 2: online evaluation on live traffic

Pre-deploy evals catch known regressions on a fixed set. They do not catch the long tail of what users actually ask. Online evaluation samples a slice of live traffic and scores it with the same rubrics, giving you a daily signal on real distribution.

Set up:

  1. Sample a fraction of live traffic (5 to 25 percent is typical).
  2. Attach the same evaluators used in CI.
  3. Stream scores into a time-series dashboard.
  4. Alert on day-over-day drops past a threshold.

traceAI captures each call as an OpenTelemetry span, including the model input, output, and any tool spans. Online evaluators run asynchronously against the sampled span and attach the score back to the trace.

from fi_instrumentation import register, FITracer

tracer_provider = register(project_name="support-bot", project_type="agent")
tracer = FITracer(tracer_provider.get_tracer(__name__))

@tracer.chain
def handle(message: str) -> str:
    context = retrieve(message)
    reply = generate(message, context)
    return reply

Spans land in the Future AGI console where you can attach an evaluator (faithfulness over the retrieved context) to a sampled slice of traffic.

Layer 3: safety guardrails

Sampling is fine for quality metrics. Safety is not negotiable and must run on 100 percent of traffic. The standard pattern is a pre-call input screen and a post-call output screen.

from fi.evals.guardrails import Guardrails, GuardrailModel

screener = Guardrails(models=[GuardrailModel.TURING_FLASH])

def safe_handle(message: str) -> str:
    verdict = screener.screen_input(user_text=message)
    if verdict.flagged:
        return "I cannot help with that request."
    reply = handle(message)
    out_verdict = screener.screen_output(model_text=reply)
    if out_verdict.flagged:
        return "I cannot share that information."
    return reply

The turing_flash model returns in about 1 to 2 seconds cloud latency and covers prompt injection, PII, toxicity, and category-specific policy violations. Choose turing_small (about 2 to 3 seconds) or turing_large (about 3 to 5 seconds) for higher-recall screens on high-risk surfaces.

Layer 4: drift and regression monitoring

Online evaluator scores form a time series. Drift monitoring compares today’s distribution to the last green window and alerts when it slips.

A reasonable default policy:

  • Track a rolling 7-day mean and standard deviation per evaluator.
  • Alert if today’s mean drops more than 1.5 standard deviations.
  • Page on safety-metric drops, ticket on quality drops.

The Future AGI console handles this natively. Drift monitoring usually catches issues like a model provider quietly updating a model behind the same name, a retrieval index growing stale, a prompt template change introducing a regression, or an upstream tool change breaking agent plans.

Layer 5: agent trace observability

For agent products, the trace is the unit of analysis. A single user message can spawn a 12-step plan with retries, partial tool failures, and intermediate decisions. Trace-level observability is the only way to debug these.

traceAI captures every step as an OpenTelemetry span. The console lets you:

  • Filter to traces where a specific tool failed or an evaluator scored low.
  • Replay a trajectory step by step.
  • Attach a step-level evaluator (tool selection accuracy on a known input).
  • Run a regression suite via fi.simulate against a held-out trace set.
from fi.simulate import TestRunner, AgentInput, AgentResponse

def my_agent(input: AgentInput) -> AgentResponse:
    reply = handle(input.message)
    return AgentResponse(content=reply)

runner = TestRunner(agent=my_agent)
report = runner.run(suite_id="support-regression-v3")
print(report.pass_rate, report.regressions)

What to actually measure

For most production GenAI systems, five metric families cover the working surface.

FamilyExample metricsWhere it lives
Faithfulness / RAGfaithfulness, groundedness, context relevancefi.evals managed
Agent qualitytool selection accuracy, retry rate, plan validityfi.simulate + traceAI
Safetyprompt injection, PII, toxicity, jailbreak resistancefi.evals.guardrails
User outcomeresolution rate, CSAT, deflection, follow-up rateproduct analytics + fi.evals
Operationallatency p50/p95, cost per query, cache hit rategateway + traceAI

The mistake to avoid is over-instrumenting. Pick 1 to 3 metrics per family that map to a known failure mode, score them well, and skip the rest until they show up in a postmortem.

LLM-as-judge calibration

LLM-as-judge is the only practical way to score subjective qualities at scale. It is also the noisiest signal in the stack. Calibration is the difference between a useful evaluator and a vibes-driven dashboard.

The standard pattern:

  1. Human-label a calibration set. 50 to 200 prompts is usually enough. Two annotators per prompt, with disagreements resolved by a third.
  2. Tune the rubric. Iterate on the rubric text until judge scores correlate with human labels at Cohen’s kappa above 0.6.
  3. Pick a model. Smaller models like turing_flash or gpt-4o-mini are usually fine for binary rubrics. Reserve frontier models for complex multi-criteria rubrics.
  4. Audit on a held-out set. Hold back 20 percent of the calibration set and never expose it to the judge. Spot-check drift quarterly.
  5. Multi-judge for high-stakes. For decisions that gate releases, run two independent judges and require agreement.

CustomLLMJudge wraps this. You set the rubric, pick the provider, and the calibration loop is the discipline of treating it like a tuned classifier, not a magic oracle.

Setting up a working stack

A reasonable phased rollout for a new GenAI product:

Week 1: instrumentation and CI

  • Install fi_instrumentation, set FI_API_KEY and FI_SECRET_KEY, decorate the main entrypoint with @tracer.chain or @tracer.agent.
  • Curate a 100-prompt held-out CI set.
  • Wire fi.evals into the CI runner with merge gating.

Week 2: online and safety

  • Sample 10 percent of live traffic into Future AGI.
  • Attach faithfulness + a custom tone judge as online evaluators.
  • Wrap inputs and outputs with Guardrails(models=[GuardrailModel.TURING_FLASH]).
  • Set up drift alerts on the online evaluators.

Week 3 onwards: agent and gateway

  • Move to /platform/monitor/command-center for routing across providers (BYOK).
  • Set per-route policy (model, max cost, guardrail strictness).
  • Add fi.simulate regression suites for the agent’s known good and known bad trajectories.

Case studies (anonymised)

RAG over private docs. A support team’s bot was scoring 92 percent on a static eval set but users complained about hallucinations. Online eval on faithfulness against retrieved context showed 78 percent in production. The gap was the static set covering only high-frequency questions. Adding stratified sampling on the live distribution and adding 50 long-tail prompts to CI closed the gap to 89 percent within three weeks.

Customer support agent. Tool selection accuracy dropped 8 points after a model provider quietly updated their underlying model. Online drift monitoring caught it within 24 hours. The team pinned the model version through the Agent Command Center and added a per-step tool-selection evaluator.

Document review assistant. Pre-deploy faithfulness was 88 percent. Live deployment showed 71 percent because real user documents had layout patterns missing from the test set. Adding traceAI span filtering on retrieval recall and feeding low-recall documents back into CI as new prompts surfaced the gap.

Trustworthy GenAI evaluation is a continuous loop

Production GenAI evaluation is not a quarterly review or a launch checklist. It is a continuous loop: CI on every PR, online evaluators on every shift, guardrails on every call, drift monitors over every week, and trace replays for every regression. The teams that ship reliable GenAI in 2026 treat evaluation infrastructure with the same rigour they apply to testing and observability in any other software domain.

Future AGI is built around exactly this loop. traceAI for OpenTelemetry-style observability (Apache 2.0), fi.evals for evaluators and judges, fi.simulate for agent regression, fi.evals.guardrails for safety screens, and the Agent Command Center at /platform/monitor/command-center for gateway and policy.

Frequently asked questions

What does it mean to evaluate GenAI in production?
Production evaluation means scoring live AI outputs against quality metrics on an ongoing basis, not just running benchmarks before launch. It combines pre-deployment CI evaluation on a held-out test set, online evaluators that score sampled production traffic, safety guardrails that block harmful inputs and outputs inline, and drift monitoring that flags when scores degrade. The goal is to catch regressions and harmful outputs within minutes, not weeks.
Why are benchmarks like MMLU not enough?
Benchmarks like MMLU, HELM, BigBench, and GPQA measure model capability on static multiple-choice tasks. Production GenAI faces noisy user inputs, multi-turn context, tool calls, retrieval over private data, and adversarial prompts. A model can top a benchmark and still hallucinate, leak PII, or fail a long-tail task. Benchmarks are useful as a coarse capability filter, but production evaluation needs task-specific evaluators on your own prompts and outputs.
How often should I run evaluations in production?
Pre-deployment evaluators should run on every pull request, gated as a CI check. Online evaluators on production traffic should sample continuously at a rate that fits your evaluation budget (commonly 5 to 25 percent of traffic). Critical safety screens (prompt injection, PII) should run on 100 percent of traffic inline. Drift comparisons should be computed at least daily, with alerting on a regression threshold.
What is LLM-as-a-judge and is it reliable?
LLM-as-a-judge uses a language model to score outputs against a rubric. It is the only practical way to evaluate subjective qualities (tone, helpfulness, faithfulness) at scale, but raw judges have well-documented biases (position, verbosity, self-preference). Reliable LLM-as-judge requires rubric calibration against a small human-labeled set, multi-judge ensembling, and ongoing audit. Future AGI's CustomLLMJudge wraps this pattern as a single class.
How do I evaluate an agent versus a single LLM call?
Agent evaluation has to score the full trajectory: tool choices, retries, plan quality, and final output. Single-call evaluators are insufficient because most agent failures are sequencing failures, not bad final tokens. The standard approach is to capture OpenTelemetry-style spans for every tool call, attach per-step evaluators (tool selection accuracy, hallucination per step), and use a simulation harness (fi.simulate) to replay agents on a fixed test set.
What metrics actually matter for production GenAI?
Five categories cover most production needs: faithfulness and groundedness for RAG systems, task success and tool selection accuracy for agents, safety metrics (toxicity, PII, prompt injection) for any user-facing surface, user-side outcomes (resolution rate, deflection, CSAT) for support and assistant products, and operational metrics (latency, cost per query, cache hit rate) for the platform layer.
How is GenAI evaluation different from traditional ML evaluation?
Traditional ML evaluation centers on labeled test sets, accuracy and F1 scores, and feature-level drift. GenAI evaluation centers on rubric-based scoring of free-form outputs, judge calibration, and trajectory-level evaluation for agents. The infrastructure is also different: ML drift detection compares feature distributions, while GenAI drift compares evaluator scores over time. Many teams keep their existing ML monitor for tabular models and add a separate GenAI evaluation stack for LLM and agent surfaces.
What does a complete production GenAI evaluation stack look like in 2026?
A minimum production stack has: an OpenTelemetry-style instrumentation layer (traceAI) capturing every model call, tool invocation, and retry; a CI evaluation suite (fi.evals) running on every PR; an online evaluator pipeline scoring a sample of live traffic; pre and post-call guardrails screening for prompt injection, PII, and toxicity; a drift monitor comparing evaluator scores day over day; and a gateway (Agent Command Center) routing across providers with policy enforcement.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.