Evaluating GenAI in Production 2026: The Full Framework
How to evaluate GenAI in production in 2026. Pre-deploy CI evals, online metrics, LLM-as-judge calibration, drift, safety, and how to stand up a working stack.
Table of Contents
Evaluating GenAI in production in 2026
Production GenAI is no longer a single LLM call returning a string. It is a multi-step agent, a RAG pipeline over private data, a gateway routing across providers, and a guardrail layer screening every input and output. Evaluating it well in 2026 means running five layers of evaluation in parallel: CI evals before deploy, online evals on live traffic, inline safety guardrails, drift monitoring across time, and trace-level observability for agents.
This is the full framework. It covers what to evaluate, how to score it, how often, and what to wire it into. It is the same stack we run internally at Future AGI and that we ship as a product.
TL;DR: the five layers of production GenAI evaluation
| Layer | What it scores | Cadence | Tools |
|---|---|---|---|
| Pre-deploy CI eval | Held-out prompts on every PR | Per commit | fi.evals + CI runner |
| Online traffic eval | Sampled live outputs | Continuous | fi.evals + traceAI |
| Safety guardrails | 100% of inputs + outputs | Inline | fi.evals.guardrails |
| Drift monitoring | Eval scores over time | Daily | Future AGI console |
| Trace observability | Agent trajectories | On every run | traceAI (OTel) |
Why benchmarks are not enough
Benchmarks (MMLU, HELM, BigBench, GPQA, MMLU-Pro) measure model capability on static multiple-choice tasks. Production GenAI faces:
- Noisy user inputs (slang, typos, mixed languages).
- Multi-turn context with reference resolution across turns.
- Tool calls with retries and partial failures.
- Retrieval over private data the model never saw in pretraining.
- Adversarial prompts designed to extract data or bypass policy.
A model can top a benchmark and still hallucinate on your customer support corpus, leak PII through a tool call, or pick the wrong tool 30 percent of the time. Benchmarks are a coarse capability filter for model selection; they are not a substitute for evaluation on your own prompts, outputs, and traces.
The five layers
Layer 1: pre-deployment CI evaluation
The first line of defence is a held-out test set scored on every pull request. The pattern:
- Curate 100 to 1,000 prompts that represent the production distribution.
- Attach an expected behaviour or rubric to each prompt.
- Run evaluators in CI on every PR.
- Gate merge on regression past a tolerance threshold.
With fi.evals:
from fi.evals import evaluate
prompts = [
{"input": "Return the EU population in 2024.", "context": "EU population was 449 million in 2024."},
{"input": "Summarise the refund policy.", "context": "Refunds in 30 days, no exceptions."},
]
scores = []
for p in prompts:
result = evaluate("faithfulness", output=run_app(p["input"]), context=p["context"])
scores.append(result.score)
avg = sum(scores) / len(scores)
assert avg >= 0.85, f"Faithfulness regressed to {avg:.2f}"
For subjective qualities use CustomLLMJudge:
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
judge = CustomLLMJudge(
name="support_tone",
rubric="Score 1 if the reply is empathetic and concise, else 0.",
provider=LiteLLMProvider(model="gpt-4o-mini"),
)
for p in prompts:
out = run_app(p["input"])
r = judge.evaluate(output=out)
print(p["input"], r.score)
Layer 2: online evaluation on live traffic
Pre-deploy evals catch known regressions on a fixed set. They do not catch the long tail of what users actually ask. Online evaluation samples a slice of live traffic and scores it with the same rubrics, giving you a daily signal on real distribution.
Set up:
- Sample a fraction of live traffic (5 to 25 percent is typical).
- Attach the same evaluators used in CI.
- Stream scores into a time-series dashboard.
- Alert on day-over-day drops past a threshold.
traceAI captures each call as an OpenTelemetry span, including the model input, output, and any tool spans. Online evaluators run asynchronously against the sampled span and attach the score back to the trace.
from fi_instrumentation import register, FITracer
tracer_provider = register(project_name="support-bot", project_type="agent")
tracer = FITracer(tracer_provider.get_tracer(__name__))
@tracer.chain
def handle(message: str) -> str:
context = retrieve(message)
reply = generate(message, context)
return reply
Spans land in the Future AGI console where you can attach an evaluator (faithfulness over the retrieved context) to a sampled slice of traffic.
Layer 3: safety guardrails
Sampling is fine for quality metrics. Safety is not negotiable and must run on 100 percent of traffic. The standard pattern is a pre-call input screen and a post-call output screen.
from fi.evals.guardrails import Guardrails, GuardrailModel
screener = Guardrails(models=[GuardrailModel.TURING_FLASH])
def safe_handle(message: str) -> str:
verdict = screener.screen_input(user_text=message)
if verdict.flagged:
return "I cannot help with that request."
reply = handle(message)
out_verdict = screener.screen_output(model_text=reply)
if out_verdict.flagged:
return "I cannot share that information."
return reply
The turing_flash model returns in about 1 to 2 seconds cloud latency and covers prompt injection, PII, toxicity, and category-specific policy violations. Choose turing_small (about 2 to 3 seconds) or turing_large (about 3 to 5 seconds) for higher-recall screens on high-risk surfaces.
Layer 4: drift and regression monitoring
Online evaluator scores form a time series. Drift monitoring compares today’s distribution to the last green window and alerts when it slips.
A reasonable default policy:
- Track a rolling 7-day mean and standard deviation per evaluator.
- Alert if today’s mean drops more than 1.5 standard deviations.
- Page on safety-metric drops, ticket on quality drops.
The Future AGI console handles this natively. Drift monitoring usually catches issues like a model provider quietly updating a model behind the same name, a retrieval index growing stale, a prompt template change introducing a regression, or an upstream tool change breaking agent plans.
Layer 5: agent trace observability
For agent products, the trace is the unit of analysis. A single user message can spawn a 12-step plan with retries, partial tool failures, and intermediate decisions. Trace-level observability is the only way to debug these.
traceAI captures every step as an OpenTelemetry span. The console lets you:
- Filter to traces where a specific tool failed or an evaluator scored low.
- Replay a trajectory step by step.
- Attach a step-level evaluator (tool selection accuracy on a known input).
- Run a regression suite via fi.simulate against a held-out trace set.
from fi.simulate import TestRunner, AgentInput, AgentResponse
def my_agent(input: AgentInput) -> AgentResponse:
reply = handle(input.message)
return AgentResponse(content=reply)
runner = TestRunner(agent=my_agent)
report = runner.run(suite_id="support-regression-v3")
print(report.pass_rate, report.regressions)
What to actually measure
For most production GenAI systems, five metric families cover the working surface.
| Family | Example metrics | Where it lives |
|---|---|---|
| Faithfulness / RAG | faithfulness, groundedness, context relevance | fi.evals managed |
| Agent quality | tool selection accuracy, retry rate, plan validity | fi.simulate + traceAI |
| Safety | prompt injection, PII, toxicity, jailbreak resistance | fi.evals.guardrails |
| User outcome | resolution rate, CSAT, deflection, follow-up rate | product analytics + fi.evals |
| Operational | latency p50/p95, cost per query, cache hit rate | gateway + traceAI |
The mistake to avoid is over-instrumenting. Pick 1 to 3 metrics per family that map to a known failure mode, score them well, and skip the rest until they show up in a postmortem.
LLM-as-judge calibration
LLM-as-judge is the only practical way to score subjective qualities at scale. It is also the noisiest signal in the stack. Calibration is the difference between a useful evaluator and a vibes-driven dashboard.
The standard pattern:
- Human-label a calibration set. 50 to 200 prompts is usually enough. Two annotators per prompt, with disagreements resolved by a third.
- Tune the rubric. Iterate on the rubric text until judge scores correlate with human labels at Cohen’s kappa above 0.6.
- Pick a model. Smaller models like turing_flash or gpt-4o-mini are usually fine for binary rubrics. Reserve frontier models for complex multi-criteria rubrics.
- Audit on a held-out set. Hold back 20 percent of the calibration set and never expose it to the judge. Spot-check drift quarterly.
- Multi-judge for high-stakes. For decisions that gate releases, run two independent judges and require agreement.
CustomLLMJudge wraps this. You set the rubric, pick the provider, and the calibration loop is the discipline of treating it like a tuned classifier, not a magic oracle.
Setting up a working stack
A reasonable phased rollout for a new GenAI product:
Week 1: instrumentation and CI
- Install fi_instrumentation, set FI_API_KEY and FI_SECRET_KEY, decorate the main entrypoint with @tracer.chain or @tracer.agent.
- Curate a 100-prompt held-out CI set.
- Wire fi.evals into the CI runner with merge gating.
Week 2: online and safety
- Sample 10 percent of live traffic into Future AGI.
- Attach faithfulness + a custom tone judge as online evaluators.
- Wrap inputs and outputs with Guardrails(models=[GuardrailModel.TURING_FLASH]).
- Set up drift alerts on the online evaluators.
Week 3 onwards: agent and gateway
- Move to /platform/monitor/command-center for routing across providers (BYOK).
- Set per-route policy (model, max cost, guardrail strictness).
- Add fi.simulate regression suites for the agent’s known good and known bad trajectories.
Case studies (anonymised)
RAG over private docs. A support team’s bot was scoring 92 percent on a static eval set but users complained about hallucinations. Online eval on faithfulness against retrieved context showed 78 percent in production. The gap was the static set covering only high-frequency questions. Adding stratified sampling on the live distribution and adding 50 long-tail prompts to CI closed the gap to 89 percent within three weeks.
Customer support agent. Tool selection accuracy dropped 8 points after a model provider quietly updated their underlying model. Online drift monitoring caught it within 24 hours. The team pinned the model version through the Agent Command Center and added a per-step tool-selection evaluator.
Document review assistant. Pre-deploy faithfulness was 88 percent. Live deployment showed 71 percent because real user documents had layout patterns missing from the test set. Adding traceAI span filtering on retrieval recall and feeding low-recall documents back into CI as new prompts surfaced the gap.
Trustworthy GenAI evaluation is a continuous loop
Production GenAI evaluation is not a quarterly review or a launch checklist. It is a continuous loop: CI on every PR, online evaluators on every shift, guardrails on every call, drift monitors over every week, and trace replays for every regression. The teams that ship reliable GenAI in 2026 treat evaluation infrastructure with the same rigour they apply to testing and observability in any other software domain.
Future AGI is built around exactly this loop. traceAI for OpenTelemetry-style observability (Apache 2.0), fi.evals for evaluators and judges, fi.simulate for agent regression, fi.evals.guardrails for safety screens, and the Agent Command Center at /platform/monitor/command-center for gateway and policy.
Related reading
Frequently asked questions
What does it mean to evaluate GenAI in production?
Why are benchmarks like MMLU not enough?
How often should I run evaluations in production?
What is LLM-as-a-judge and is it reliable?
How do I evaluate an agent versus a single LLM call?
What metrics actually matter for production GenAI?
How is GenAI evaluation different from traditional ML evaluation?
What does a complete production GenAI evaluation stack look like in 2026?
Honest 2026 comparison of Future AGI vs Fiddler AI: LLM eval, agent observability, traditional ML monitoring, pricing, integrations, and which platform fits which team.
OpenAI AgentKit (Oct 2025) + Future AGI in 2026: visual builder, traceAI auto-instrumentation, fi.evals scoring, BYOK gateway. Real code, real APIs, no hype.
Set up real-time LLM evaluation in 2026 with span-attached evals, 1 to 2 second judges, and code. 7 platforms compared, FAGI traceAI walkthrough.