Build an LLM Evaluation Framework From Scratch in 2026: A Practical Guide
Build an LLM evaluation framework from scratch in 2026. Deterministic, rubric, LLM-as-judge, and agent evals, with working Python code and a CI gate.
Table of Contents
TL;DR
| Step | What to build | Tools to reach for |
|---|---|---|
| 1. Pick evaluator types | Deterministic, rubric, LLM-as-judge, agent | fi.evals, deepeval, ragas, langchain.evaluation |
| 2. Dataset stack | Public baseline + domain golden set + adversarial slice | MMLU, GSM8K, IFEval, internal labels, jailbreak set |
| 3. Instrument traces | Auto-instrumentation around every LLM and agent call | traceai-langchain, traceai-openai, fi_instrumentation |
| 4. CI gate | Block PRs that fall below faithfulness, F1, latency thresholds | pytest, fi.evals.evaluate, GitHub Actions |
| 5. Online evals | Sample 1 to 5 percent of live traffic and score the same way | fi.evals on live spans, alerting on drift |
| 6. Human review | Weekly 50 lowest scoring + 50 random traces | Spreadsheets or the Agent Command Center labeling queue |
Or skip the from-scratch route: install ai-evaluation (fi.evals, Apache 2.0) and traceai-langchain (Apache 2.0), and you get all six rows out of the box.
What changed since 2025
Three shifts. First, agent stacks went mainstream. LangGraph 1.0 (release notes), the OpenAI Agents SDK, and CrewAI all ship production-ready trajectories now, and single-turn evals miss the failure modes that show up across multi-step tool calls. Second, LLM-as-judge moved from research curiosity to default evaluator. The G-Eval (arXiv 2303.16634) and Prometheus 2 (arXiv 2405.01535) lines of work convinced most teams that a calibrated LLM judge with a clear rubric is a better signal than BLEU or ROUGE for free-form generation. Third, public benchmarks rotated: MMLU has saturated for frontier models, IFEval (arXiv 2311.07911) replaced it as the instruction-following bar, and GPQA (arXiv 2311.12022) became the standard hard-reasoning slice.
The model side moved with it. Most production frameworks now generate against GPT-5, Claude Opus 4.x, Gemini 3, and Llama 4 variants, with o3 and o4-mini still common for hard-reasoning routes. Always pull the exact version off the vendor changelog when you ship.
Why a rigorous LLM evaluation framework matters
A model that scores 92 on MMLU and ships into your product can still hallucinate every fifth answer in your domain. Public benchmarks tell you the model is broadly capable, but only your evaluation framework tells you whether it works for your users on your data with your prompts. Without one you ship blind.
A real framework gives you three things:
- A floor. Threshold-gated CI prevents a prompt edit or model swap from regressing the chain below an agreed quality bar.
- A signal. Online evals on sampled live traffic surface drift, prompt regressions, and retriever decay days before users complain.
- A paper trail. Every release ships with a score and a trace, so post-mortems point at the actual failure span, not at a hunch.
The four evaluator types you need
Every production framework in 2026 covers at least these four. Skip any of them and you have a blind spot.
1. Deterministic evaluators
The fast, cheap, fully reproducible layer.
- Exact match, regex, JSON schema match, output-length bounds.
- Tool-call argument validation, function-signature checks.
- Latency, token usage, cost per request.
Run these on every single response. They cost nothing and they catch the most obvious regressions (model returns prose instead of JSON, output is too long, tool gets called with the wrong argument).
2. Rubric / reference-based evaluators
Score against a reference answer.
- BLEU (Papineni 2002) for translation and paraphrase.
- ROUGE (Lin 2004) for summarization.
- BERTScore (Zhang 2019) for semantic similarity.
- Token F1 and exact-match for closed-domain QA.
These are useful but mechanical. They reward surface overlap, so a paraphrased correct answer can score low and a fluent wrong answer can score high. Treat them as one signal among many.
3. LLM-as-judge evaluators
A bigger LLM scores the output against a rubric. This is the workhorse of 2026 for free-form generation.
- Faithfulness: is every claim grounded in the retrieved context?
- Relevance: does the answer address the actual user intent?
- Custom rubric: domain-specific scoring (e.g., “is the tone appropriate for legal advice?”).
The risk with LLM-as-judge is that the judge has its own biases. You calibrate it by running the judge against 100 to 200 human-labeled examples, computing Cohen’s kappa or Krippendorff’s alpha against the humans, and iterating the rubric until kappa is above roughly 0.6. Then re-sample 50 examples each month to catch judge drift.
4. Agent-level evaluators
For multi-step agents, single-turn scoring misses the actual failure modes.
- Trajectory match: did the agent take a reasonable path through tools?
- Tool-use correctness: were tool arguments well-formed and the right tools chosen?
- Goal completion: did the final state match the user’s request?
Future AGI’s agent simulation docs and fi.simulate cover this category. Anthropic’s evaluator patterns (Building effective agents) and the OpenAI Agents SDK both ship similar trajectory hooks.
Setting up the dataset stack
A 2026 evaluation framework runs against three distinct dataset slices.
Public benchmarks (baseline)
- MMLU (source): broad knowledge, saturated for frontier models, still useful as a sanity check.
- IFEval (arXiv 2311.07911): instruction-following.
- GPQA (arXiv 2311.12022): hard-reasoning slice.
- GSM8K (arXiv 2110.14168) and MATH (arXiv 2103.03874): math.
- HumanEval (arXiv 2107.03374) and MBPP for code.
- SQuAD 2.0 (leaderboard) for closed-domain QA.
Run these on every new model version. They give you an external anchor.
Domain golden set
The 200 to 500 examples that matter most to your users. Label every example with the desired output and the acceptable failure mode. Refresh on a fixed cadence (monthly is fine). This is the set CI gates against.
Adversarial slice
Jailbreaks, prompt injections, ambiguous phrasing, out-of-distribution dates, deliberately misleading questions. Build it once, freeze it, never train on it. See our jailbreaking and prompt injection guide for a concrete starter set.
Defining metrics and thresholds
Pick the metrics that match the task. A reasonable default for a RAG chat app:
- Faithfulness above 0.80 (LLM-as-judge against retrieved context).
- Context recall above 0.75.
- Token F1 above 0.60 on the closed-QA slice.
- p50 latency under 2 seconds, p95 under 6 seconds.
- Refusal rate on the adversarial slice above 0.90.
Write the thresholds into the test file. A merge that falls below any of them blocks. Adjust thresholds quarterly based on user feedback and online drift data, not weekly based on vibes.
Setting up the development environment
Libraries
For a from-scratch build:
transformersanddatasetsfrom Hugging Face.evaluatefor rubric metrics (pip install evaluate).pytestfor the harness.- One LLM client SDK (OpenAI, Anthropic, Google).
For a 30-line shortcut, install ai-evaluation (fi.evals) and traceai-* packages instead.
Trace instrumentation
Every model and agent call must emit a structured span. The pattern:
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_langchain import LangChainInstrumentor
tracer_provider = register(
project_type=ProjectType.OBSERVE,
project_name="eval-framework-prod",
)
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
traceai-langchain is Apache 2.0 (LICENSE). Env vars are FI_API_KEY and FI_SECRET_KEY. The same package family covers OpenAI, Anthropic, CrewAI, OpenAI Agents SDK, and LangGraph.
Infrastructure
- CI: GitHub Actions or similar, runs evaluators on every PR.
- Storage: trace storage (Agent Command Center, Phoenix, LangSmith) plus a regression bench of recent failures.
- Compute: GPU only if you run open-source judges. Cloud judges work fine for most teams.
Implementing the evaluator harness
Below is a small scaffold in fi.evals style. It loads a golden set, runs deterministic and LLM-as-judge evaluators, and prints a pass / fail verdict against your thresholds. Wire your real chain into run_chain() and the script runs end to end.
import json
from pathlib import Path
from fi.evals import evaluate
from fi.evals.metrics import CustomLLMJudge
GOLDEN = Path("eval/golden.jsonl")
def load_golden():
with GOLDEN.open() as f:
for line in f:
yield json.loads(line)
def run_chain(question: str, docs: list) -> str:
# Pseudocode. Replace with your real chain call.
raise NotImplementedError("wire to your LangChain / Agents SDK chain here")
def score_row(row: dict) -> dict:
answer = run_chain(row["question"], row["docs"])
deterministic = answer.strip().endswith(".") and len(answer) > 0
faithfulness = evaluate(
"faithfulness",
output=answer,
context="\n\n".join(row["docs"]),
)
tone = CustomLLMJudge(
rubric="Score 1-5: is the answer professional and not hedged?",
)
tone_score = tone(answer=answer)
return {
"id": row["id"],
"deterministic_pass": bool(deterministic),
"faithfulness": faithfulness,
"tone": tone_score,
}
FAITHFULNESS_FLOOR = 0.80
DETERMINISTIC_FLOOR = 0.98
def main():
results = [score_row(row) for row in load_golden()]
n = len(results)
det_rate = sum(r["deterministic_pass"] for r in results) / n
faith_avg = sum(r["faithfulness"] for r in results) / n
print(f"deterministic pass rate: {det_rate:.3f}")
print(f"avg faithfulness: {faith_avg:.3f}")
if det_rate >= DETERMINISTIC_FLOOR and faith_avg >= FAITHFULNESS_FLOOR:
print("VERDICT: PASS")
else:
print("VERDICT: FAIL")
if __name__ == "__main__":
main()
A few notes on the snippet:
evaluate()fromfi.evalsaccepts a string template name ("faithfulness","context_recall", etc.). See the Future AGI evals reference for the current metric catalog.CustomLLMJudgefromfi.evals.metricstakes a free-form rubric and returns a score. Use it for domain-specific quality.- The pattern is the same for online evals: read
outputandcontextoff a live span instead of a JSONL row, call the sameevaluate().
Wiring evals into CI
The minimal harness becomes a CI gate when you add thresholds. A pytest test that fails the build when scores drop:
import pytest
import statistics
from eval.harness import load_golden, score_row
@pytest.fixture(scope="module")
def scores():
return [score_row(row) for row in load_golden()]
def test_faithfulness_floor(scores):
avg = statistics.mean(r["faithfulness"] for r in scores)
assert avg >= 0.80, f"faithfulness {avg:.3f} below 0.80"
def test_deterministic_pass_rate(scores):
rate = sum(r["deterministic_pass"] for r in scores) / len(scores)
assert rate >= 0.98, f"deterministic pass rate {rate:.3f} below 0.98"
Run the test on every PR. Keep a bench/ folder with the last 30 days of production failures, frozen as test cases, so the chain cannot regress on a known-bad input.
Online evals: sampling live traffic
Offline evals catch known regressions. Online evals catch the cases your golden set did not anticipate.
- Sample 1 to 5 percent of live traffic.
- Run the same evaluators against the live span (
fi.evals.evaluatereads inputs directly off the trace). - Alert when the rolling average drops below your threshold.
- Queue the lowest-scoring traces for the weekly human review.
The cost stays bounded because you sample, and the latency stays bounded because cloud judges like turing_flash run in roughly 1 to 2 seconds, turing_small in 2 to 3 seconds, and turing_large in 3 to 5 seconds (Future AGI cloud evals docs). For most teams a sampled turing_flash judge adds no perceptible p95.
Top tools and platforms for building an LLM eval framework in 2026
Where Future AGI competes, here is the ranked list. The criterion: best-fit if you are building an LLM evaluation framework end to end.
1. Future AGI
The eval superset. fi.evals ships deterministic, rubric, LLM-as-judge, and agent evaluators behind one Python SDK (ai-evaluation, Apache 2.0). fi.simulate covers agent trajectories. traceai-* auto-instrumentation covers LangChain, LangGraph, OpenAI Agents SDK, CrewAI, and direct provider SDKs. The Agent Command Center at /platform/monitor/command-center joins traces, scores, and alerts in one dashboard. Pick Future AGI when you want one stack that covers offline CI, online sampling, and agent simulation.
2. DeepEval
Open-source evaluator library by Confident AI (source). Strong rubric coverage and a pytest-style harness. Lighter on agent-level evals and trace storage than Future AGI; teams often pair it with a separate observability tool.
3. Ragas
Open source, focused on RAG metrics (source). Excellent faithfulness, context recall, and context precision evaluators. Narrower than Future AGI or DeepEval outside of RAG; usually one of several libraries in a stack rather than the whole stack.
4. LangSmith
LangChain’s evaluator + dataset platform (docs). Strong inside the LangChain ecosystem. Cross-framework teams (OpenAI Agents SDK, CrewAI, custom Python agents) usually pair it with a wider stack.
5. Braintrust
Polished UX for prompt experiments and dataset runs (docs). Strong on the “compare two prompts side by side” workflow, lighter on agent-level evals.
Handling common evaluation challenges
Hallucinations
A faithfulness evaluator catches them, but only if your retrieval is good enough to surface the right context in the first place. Score context recall in the same harness. If recall is low, the LLM cannot answer correctly no matter how good the prompt.
Bias and fairness
Slice your evaluation by user cohort (region, language, role) and watch for score gaps. A model that scores 0.85 overall but 0.55 on one cohort is broken for that cohort.
Overfitting
If your golden set scores keep going up but public benchmark scores stay flat, you are overfitting the eval set. Refresh the golden set on a fixed cadence and keep the adversarial slice frozen.
Latency bottlenecks
Profile by chain stage. Most latency lives in the LLM call, but retrievers, parsers, and tool calls each contribute. Use streaming where you can, cache aggressively, and pick a faster model for low-stakes routes.
A/B testing and iteration
Once the framework is in place, the iteration loop is straightforward:
- A/B test two prompt or model variants on a percentage of traffic.
- Compare evaluator scores and latency p50/p95.
- Stress test the winner at 2x and 3x expected peak.
- Run an adversarial sweep on the winner before merge.
- Version every prompt template alongside its eval scores, so rollback is one config flip.
Conversational consistency tests deserve a separate pass: multi-turn threads regress in ways that single-turn evals miss. A 5-turn replay of the top 50 user conversations every release catches most of it.
How Future AGI helps teams build an LLM evaluation framework
Install ai-evaluation and the traceai-* package for your stack. fi.evals gives you the four evaluator types, traceai gives you the spans, and the Agent Command Center dashboard at /platform/monitor/command-center joins them.
pip install ai-evaluation traceai-langchain
Then call evaluate() from fi.evals in CI and on live spans. Same code path, different inputs. Future AGI quickstart.
Frequently asked questions
What is an LLM evaluation framework in 2026?
Should I build my own evaluator stack or use a platform?
What metrics matter most for an LLM eval framework?
How do I calibrate LLM-as-judge evaluators?
Offline evals or online evals?
How does Future AGI fit into a custom evaluation framework?
What is the right cadence for human review?
How do I keep evaluator costs bounded?
What LLM observability means in 2026: traces, spans, evals, span-attached scores. Compare top 5 platforms, see real traceAI code, and learn what to alert on.
LLM observability in 2026 for CTOs. Metrics, logs, traces, tool selection, lifecycle integration, an Instacart case study, plus traceAI in production.
Detect AI hallucinations in production in 2026: ChainPoll, NLI, SelfCheckGPT, RAG faithfulness, FAGI eval, and human review. Code, latency, and trade-offs.