How to Build an LLM Evaluation Framework From Scratch (2026)
Building an LLM eval framework is a one-week project and a one-year maintenance burden. The eight components, the honest cost map, and what to build vs buy.
Table of Contents
You can write an LLM eval framework in two hundred lines of Python over a weekend. The hello-world demos beautifully. Six months in you’ve got three judges with three parsing regexes, a JSONL dataset nobody updated since November, a CI gate that fails on judge-model drift instead of prompt regressions, and the second engineer on rotation just asked “how does this even work.” The framework didn’t fail. The 80 percent of engineering effort the tutorials skipped did.
The hello-world is easy. Parsing variance, judge calibration, distributed orchestration, OTel emission, dataset versioning, and the four-line refactor that triggers every time a vendor changes their model are the framework.
The opinion this post earns: building an LLM evaluation framework from scratch is a one-week project and a one-year maintenance burden. Build the rubric. Buy the runner. The rubric encodes your domain (legal citation validity, clinical refusal taxonomy, ad-policy adherence). It’s hours of work and worth every minute because nobody else can write it. The runner is the judge engine, retry layer, cache, parallel executor, OTel emitter, classifier cascade, and self-improvement loop. Same six engineers at every company, same six months, same parsing bug. No edge in writing it twice.
This guide walks the build component by component, with the honest cost map for each. Code shaped against the ai-evaluation SDK and the fi CLI.
TL;DR: the eight components, scored build-or-buy
| Component | What it does | Build? | Why |
|---|---|---|---|
| 1. Rubric registry | Versioned scoring definitions | Build | Encodes your domain; nobody else can write it |
| 2. Dataset schema | Typed cases, JSONL, route tags | Build | The dataset is your worldview |
| 3. Judge engine | Model call, parse, retry, cache | Buy | Parsing variance and drift, same everywhere |
| 4. Parallel execution | Sharded, distributed for scale | Buy | Celery/Ray/Temporal wiring is not your edge |
| 5. Statistical gating | Welch’s t-test on deltas, percentile floors | Build | Thresholds are policy; math is library |
| 6. OTel emission | Span-attached scoring on live traffic | Buy | 50+ AI surfaces, four languages |
| 7. Calibration loop | Inter-rater reliability vs human labels | Build | The labels are yours; the math is shared |
| 8. Closed loop | Failing traces cluster, promote to dataset | Buy | HDBSCAN over span embeddings is generic |
Build four, buy four. The four you build make the framework specific to your product. The four you buy look identical at every company that gets it right.
The two questions that decide everything
Specific or generic? Rubrics are specific. Judge engines are not; every team writes the same backoff loop, the same regex on integer output, the same cache key. Specific work is worth owning.
Compounds or depreciates? A dataset that grows from production failures compounds; year-two you have a regression suite no competitor can replicate. A custom OTel emitter depreciates; the moment a vendor adds a span kind you don’t support, you’re behind.
Build the components that score high on both. Buy the rest.
Components 1 and 2: rubric registry and dataset (build)
Both belong in the same git history as the agent code. The rubric holds the prompt template, scoring scale, and pinned judge model; the dataset holds typed cases tagged by route.
from dataclasses import dataclass, field
from typing import Callable, Optional
@dataclass(frozen=True)
class Rubric:
name: str
version: str
judge_model: str # pin; bump deliberately
template: str # {input}, {output}, {expected}, {context}
scale: tuple[int, int] = (1, 5)
parser: Optional[Callable[[str], float]] = None
route: Optional[str] = None
@dataclass
class EvalCase:
id: str
route: str
input: str
expected_output: Optional[str]
retrieval_context: Optional[list[str]] = None
expected_tool_calls: Optional[list[dict]] = None
tenant_id: Optional[str] = None
metadata: dict = field(default_factory=dict)
Six rules decide whether these earn their keep:
- Rubrics in git, not a database. Diff invisibility is how drift compounds.
- Versioned with the prompt. Same PR, traceable regressions.
- Decomposed, not fuzzy. “Rate the quality” is noise. “Every claim supported, 5 = none unsupported, 1 = largely fabricated” is signal. Decomposed rubrics run 0.20-0.30 higher inter-rater reliability on public datasets.
- Dataset sampled from production. A 200-case set from imagination loses to a 100-case set from real traces. Error Feed (Component 8) promotes weekly.
- Skew to failures. Most regression signal comes from the hardest 10 percent.
- Route-tag every case. A PR touching the legal-RAG prompt shouldn’t rerun support-bot evals.
The 72-class ai-evaluation SDK ships Groundedness, ContextAdherence, Completeness, FactualAccuracy, AnswerRefusal, IsHarmfulAdvice, EvaluateFunctionCalling, TaskCompletion, and the rest as EvalTemplate classes. Use them where they fit; subclass CustomLLMJudge where they don’t. Dataset sweet spot: 100-200 cases per route for PR-blocking. Below 100, variance buries the signal; above 500, the bill grows faster than detection sharpens.
Components 3 and 4: judge engine and parallel execution (buy)
Three failure modes you didn’t budget for, every one a quarter of an engineer’s time.
Output parsing variance. Every judge returns scores differently. Anthropic prefers <score>4</score>; GPT-4 class models tend toward Score: 4; Gemini wraps in markdown; a stronger model rambles first and tucks the integer at the end. Your regex catches 95 percent on day one. On day 90 a vendor ships a finetune that wraps numbers in Unicode bold and your parser silently returns the scale midpoint on the long tail. The bug shows up as “evals trending up” because midpoint scores anchor toward the mean.
Calibration drift. A judge that scored 4.1 last month scores 4.3 on the same input after a vendor bump. Your CI gate fires on prompts nobody changed. Pinned model, versioned judge, and the calibration loop in Component 7 are the fix.
Cost economics. A frontier judge at 1.2 seconds per call against a 10K-example nightly suite plus 5 percent of production traffic is a six-figure quarterly bill. The classifier cascade (NLI classifier triages, frontier judge adjudicates the long tail) is the lever, and it’s a months-long build to get right. The Future AGI Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2; the cascade comes built in.
Parallel execution piles on. A 200-case suite at 1.2 seconds per judge call takes four minutes serially. Sharded across 16 workers, 30 seconds. Above the single-host rate limit, you need a distributed runner (Celery, Ray, Temporal, Kubernetes) and a job graph that handles retries, partial failures, and budget caps. The SDK ships all four as drop-in.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextRelevance
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase
evaluator = Evaluator(max_workers=16) # FI_API_KEY / FI_SECRET_KEY from env
custom_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "faithfulness_v3",
"model": "claude-sonnet-4-5-20250929",
"grading_criteria": (
"Score whether the answer is faithful to the retrieval context. "
"1 = largely fabricated. 5 = every claim supported."
),
},
)
result = evaluator.evaluate(
eval_templates=[Groundedness(), ContextRelevance()],
inputs=[TestCase(input=ex.input, output=run_agent(ex),
context="\n\n".join(ex.retrieval_context or []), id=ex.id)
for ex in load_dataset("evals/datasets/support-bot.jsonl")],
)
Pin the model. Cache on (rubric_version, input, output, model). Invalidate on version change, not on every PR. What you’d write yourself otherwise: parsing variance handler, Pydantic output schema, Jinja2 templates, worker pool, rate-limit backoff, partial-failure recovery, budget cap, sharding, dead-letter queue. None of it is your edge.
Component 5: statistical gating
Build this. The math is library; the thresholds are policy.
The mistake every team makes once: “fail the PR if mean Groundedness drops below 0.85.” A 30-case dataset gives a 95 percent confidence interval of roughly ±0.07 on a 1-5 rubric mean. A 2-point drop sits inside the noise. The gate fires green on real regressions and red on judge noise; engineers learn to ignore it inside a month.
The fix is a two-threshold gate. An absolute floor catches catastrophic drops; a delta gate fires on statistically meaningful regressions against a 7-day rolling baseline.
import statistics
from scipy import stats
def regression_gate(current, baseline, alpha=0.05, min_effect=0.03):
delta = statistics.mean(current) - statistics.mean(baseline)
if delta >= 0:
return True, f"no regression (delta=+{delta:.3f})"
_, p = stats.ttest_ind(current, baseline, equal_var=False)
if p >= alpha:
return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"
if abs(delta) < min_effect:
return True, f"delta={delta:.3f} below effect floor {min_effect}"
return False, f"regression: delta={delta:.3f}, p={p:.3f}"
Welch’s t-test on per-example scores against the trailing baseline. Two-proportion z-test on pass-rate rubrics like citation validity. Percentile floors for long-tail failures: a regression pushing p95_score below the tail floor while the mean stays intact fails on the percentile and tells the on-call where to look. The fi CLI ships pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles as native assertion metrics.
# fi-evaluation.yaml — native CI assertions
assertions:
- template: "groundedness"
condition: "p95_score >= 0.78" # gate on the tail, not the mean
on_fail: "error"
- template: "context_relevance"
condition: "pass_rate >= 0.90"
on_fail: "error"
fi run --check exits with code 2 on assertion failure (distinct from 1 for evaluation errors), 3 when --strict assertions warn, 6 on API failure. CI policies wire into GitHub Actions or Buildkite cleanly, without grep heuristics on stdout. The thresholds, the rolling-baseline policy, the per-route effect floors are the part you build. Build them once; they outlive the judge model you wrote them against.
Component 6: OpenTelemetry emission and span-attached scoring
Buy this. A framework that only runs in CI catches half the bugs. The same rubric has to run against live OTel spans and attach its score back to the trace.
Writing this yourself: an OTel install per framework (OpenAI, Anthropic, LangChain, LangGraph, LlamaIndex, Vercel AI SDK, Spring AI, Semantic Kernel, n vector DBs), a semantic convention you maintain forever, a span-attached score writer that doesn’t conflict with framework auto-instrumentation, a sampler that respects budget. A year of platform work in disguise.
traceAI (Apache 2.0) ships 50+ AI surfaces across Python (46), TypeScript (39), Java (24 including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C# (1 core). Phoenix, Langfuse, and DeepEval don’t ship JVM coverage. Pluggable semantic conventions at register() time (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest into Phoenix or Traceloop without re-instrumenting. EvalTag attaches the rubric you wrote for pytest as a span-attached scorer; 62 built-in evals run server-side at zero inline latency.
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
EvalTag, EvalTagType, EvalSpanKind, EvalName, ProjectType,
)
from traceai_openai import OpenAIInstrumentor
eval_tags = [EvalTag(
eval_name=EvalName.GROUNDEDNESS,
type=EvalTagType.OBSERVATION_SPAN, value=EvalSpanKind.LLM, config={},
mapping={"context": "retrieval.documents", "output": "output.value"},
)]
trace_provider = register(project_type=ProjectType.OBSERVE,
project_name="support-bot", eval_tags=eval_tags)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
Same rubric, two contexts. A regression in CI that doesn’t show in production means the dataset stopped being representative. A drift in production with green CI means the dataset is missing a failure mode. Both signals are content.
Components 7 and 8: calibration and closed loop
Calibration is half build, half buy. Pin a 50-200 case hold-out with human labels. Run the judge against it weekly. Track Cohen’s kappa or Spearman correlation; alarm when correlation drops below an agreed floor (0.55 is a typical starting threshold for subjective rubrics, higher for factual ones). The math is one library; the labels are work only your team can do. The Platform’s self-improving evaluators retune the rubric from thumbs feedback when calibration slips, so you don’t re-author the prompt every quarter.
from scipy.stats import spearmanr
def calibration_check(judge_scores, human_scores, threshold=0.55):
rho, _ = spearmanr(judge_scores, human_scores)
if rho < threshold:
raise CalibrationDrift(f"judge correlation fell to {rho:.2f}")
return rho
The closed loop is pure buy. The dataset that grows weekly from production failures is your moat; the clustering and triage that get it there is not. Writing it yourself: HDBSCAN over span embeddings in ClickHouse, soft-cluster assignment that survives schema migrations, a triage agent that writes the RCA without running ten minutes per call, integrations into Linear/Slack/GitHub/Jira. Six months of platform work for an output identical at every company.
Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering groups every trace failure into a named issue; a Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 characters, 90 percent prompt-cache hit ratio) writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). Fixes feed the Platform’s self-improving evaluators. An engineer (or a scheduled job) promotes representative traces into the dataset; the next CI run gates against them.
The honest cost map
Eight components, with the engineering cost you’d otherwise own.
| Component | Build cost | Maintenance | Fail mode if you skip the buy |
|---|---|---|---|
| Rubric registry | 2-3 days | 1 day/quarter | Drift between prompt and rubric |
| Dataset schema | 1-2 days | 1 hr/week (promotion) | 2024 dataset evaluating a 2026 product |
| Judge engine | 2-3 weeks + 1 engineer-quarter | 0.5 FTE | Parsing variance; midpoint anchoring on long tail |
| Parallel execution | 4-6 weeks initial | 0.25 FTE | 30-minute gates; engineers disable them by month two |
| Statistical gating | 3-5 days | 1 day/quarter | False positives drown out real regressions |
| OTel emission | 2-3 engineer-quarters per framework | 0.5 FTE | Production scoring drifts from CI |
| Calibration loop | 1 week + ongoing labels | 0.25 FTE | CI gates fire on prompts nobody changed |
| Closed loop | 6-9 months | 1 FTE | Dataset stops growing |
Roughly 1 to 1.5 engineers full-time once you ship two products. The cost shows up in the months you didn’t plan for: the judge-model upgrade, the OTel semantic convention bump, the dataset that quietly stopped covering the support-bot’s top failure mode. The hybrid pattern (Apache 2.0 SDK plus the Platform) is what most teams converge on by month four anyway. The time saved goes back into the rubrics and dataset where you have an edge.
When build-from-scratch is the right call (and when it isn’t)
Three cases where writing the full framework yourself is defensible:
- You’re a research lab benchmarking eval methodology. The framework is the product; owning the parsing variance is the contribution.
- You have a regulated isolation requirement. Public-sector, defense, or healthcare deployments where every dependency clears a six-month security review. The 17-MB self-hosted Agent Command Center Go binary clears most of these; if it doesn’t, write a thin custom runner against the OSS SDK rather than the full stack.
- Your domain is so narrow the generic stack is dead weight. A single-rubric, single-route, single-tenant product. Even here, copying the 200-line MVP is faster than maintaining your own retry loop forever.
Three signals you’ve crossed the line and should stop:
- You’re on engineer two. The framework stopped being your edge; it’s institutional knowledge in one person’s head.
- You’re maintaining more than one parser. The day you write the fourth
if response.startswith(...)is the day to stop. - CI cost is a quarterly conversation. Writing a classifier cascade is months of work; if the bill is climbing, buy the cascade.
Common pitfalls
- Framework as a side project. Eval definitions belong in the same repo as agent code, reviewed in the same PR as the prompt they score.
- Rubric prompts in a database. Diff invisibility is how drift compounds. Git or nothing.
- Floating judge model. Score drift across runs of “the same eval.” Pin and version alongside the rubric.
- 30-case dataset, mean-based gate. Variance wider than the regressions you’re catching. Grow the dataset or gate on percentiles.
- Full LLM-judge sweep on every PR. Prices the gate out of existence. PR-blocking gets cheap rubrics; the LLM-judge sweep runs nightly.
- Custom parser per judge model. Three parsers means you’re maintaining the engine you should’ve bought.
- CI-only framework. Production drifts past the dataset inside a quarter. Span-attached scoring is the bridge.
- No calibration loop. Judge drift surfaces as “the gate started failing on random PRs.” The hold-out set is the cheap fix.
- Static dataset frozen at launch. A 2024 set evaluating a 2026 product is a benchmark, not a regression suite. Promote weekly.
Three deliberate tradeoffs
- The build-vs-buy line moves. A 200-line framework is fine for one product. By the time the team owns three products, three suites, three judges, and a custom OTel pipeline, the engineering cost beats a managed runner. The right time to switch is usually when the second engineer joins the rotation.
- Self-improving rubrics need oversight. Rubrics that learn from production traces can drift in unexpected directions. Pin a human-labeled hold-out; alarm when judge-to-human correlation crosses an agreed floor. Build the hold-out before turning the self-improvement loop on.
- LLM-as-judge cost matters at scale. Sample by failure signal, classifier models for high-volume passes, frontier models for adjudication. The Future AGI Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2, which makes daily full-dataset reruns financially viable instead of weekly.
How Future AGI ships the eval stack as a package
A real eval framework needs eight components, four of which are the same six engineers, six months, at every company. Future AGI ships the four you’d otherwise rewrite as Apache 2.0 OSS, plus the Platform for the cost and scale parts. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics, classifier-backed cost, and the closed loop.
- ai-evaluation SDK (Apache 2.0) — 72
EvalTemplateclasses (Groundedness,ContextAdherence,Completeness,FactualAccuracy,AnswerRefusal,IsHarmfulAdvice,TaskCompletion,EvaluateFunctionCalling, and the rest). Real API:from fi.evals import Evaluator;Evaluator(...).evaluate(eval_templates=[Template()], inputs=[TestCase(...)]).CustomLLMJudgewith multi-modal inputs, 13 guardrail backends (9 open-weight + 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes). fiCLI —fi initscaffoldsfi-evaluation.yaml.fi run --check --strict --parallel 16runs assertions onpass_rate,avg_score,p50/p90/p95_score, and runtime percentiles with CI-distinct exit codes (0/2/3/6).--mode hybridroutes between local classifier rubrics and cloud judges.- Future AGI Platform — self-improving evaluators tuned by thumbs feedback; an in-product authoring agent that writes unlimited custom evaluators from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- traceAI (Apache 2.0) — 50+ AI surfaces across Python (46) / TypeScript (39) / Java (24 modules including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel) / C# (1 core). Pluggable semantic conventions at
register()time. 14 span kinds (Phoenix has 8, Langfuse 5) includingA2A_CLIENT/A2A_SERVER. 62 built-in evals wired server-side viaEvalTag. - agent-opt — six optimizers under one API (
RandomSearchOptimizer,BayesianSearchOptimizer,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer). Scores against any of the SDK’s 72 evaluators. - Error Feed (inside the eval stack) — HDBSCAN clustering over ClickHouse embeddings plus a Sonnet 4.5 Judge agent writes the
immediate_fix; fixes feed the Platform’s self-improving evaluators. - Agent Command Center — 17 MB Go binary self-hosts in your VPC. 20+ providers via six native adapters plus OpenAI-compatible presets. RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA certified, AWS Marketplace.
You write the rubrics and the dataset. The package handles the rest. Closed loop ratchets the dataset, self-improving evaluators retune the rubric, classifier cascade keeps the bill bounded, and the same definitions run in CI and against live OTel spans on every product you ship.
Ready to write the rubric and stop maintaining the runner? pip install ai-evaluation, fi init, point the YAML at your eval set, set FI_API_KEY and FI_SECRET_KEY in CI secrets, wire fi run --check --strict into your pull-request workflow. The eight components are already there; you bring the four that make the framework yours.
Related reading
- The 2026 LLM Evaluation Playbook
- LLM Evaluation Architecture (2026)
- Best LLM Evaluation Tools (2026)
- How to Evaluate RAG Applications in CI/CD Pipelines (2026)
- LLM Arena as a Judge: Pairwise Comparison Evals (2026)
- Evaluating Fine-Tuned LLMs: A 2026 Playbook
- Your AI Agent Passes Evals But Still Fails in Production
- Why LLM-as-a-Judge Is the Best LLM Evaluation Method
- Deterministic LLM Evaluation Metrics (2026)
- Synthetic Test Data for LLM Evaluation (2026)
Frequently asked questions
Should I build an LLM eval framework from scratch?
How much engineering does a real eval framework take to maintain?
What are the eight components of a real eval framework?
Why is the LLM-as-judge engine the worst part to build yourself?
Where does the framework live in the repo?
What's the minimum viable eval framework?
Can the same rubric run in CI and against live production traces?
How does Future AGI fit the build-vs-buy line?
Your eval set is a snapshot. Production is a river. The six drift modes that age every eval set the day it ships, and the trace-as-eval-surface loop that closes the gap.
RAG eval in CI/CD without the theatre: the cheap-fast-significant triangle, statistical gating, sharded parallelism, classifier cascades, production bridge.
Eval dataset drift is the silent killer. A 2026 methodology for catching input, prompt-template, and retrieval-corpus drift before your CI gate tests yesterday's traffic.