Guides

Build an LLM Evaluation Framework From Scratch in 2026: A Practical Guide

Build an LLM evaluation framework from scratch in 2026. Deterministic, rubric, LLM-as-judge, and agent evals, with working Python code and a CI gate.

April 14, 2025

Updated May 14, 2026

9 min read

evaluations hallucination llms

Table of Contents

TL;DR

Step	What to build	Tools to reach for
1. Pick evaluator types	Deterministic, rubric, LLM-as-judge, agent	`fi.evals`, `deepeval`, `ragas`, `langchain.evaluation`
2. Dataset stack	Public baseline + domain golden set + adversarial slice	MMLU, GSM8K, IFEval, internal labels, jailbreak set
3. Instrument traces	Auto-instrumentation around every LLM and agent call	`traceai-langchain`, `traceai-openai`, `fi_instrumentation`
4. CI gate	Block PRs that fall below faithfulness, F1, latency thresholds	`pytest`, `fi.evals.evaluate`, GitHub Actions
5. Online evals	Sample 1 to 5 percent of live traffic and score the same way	`fi.evals` on live spans, alerting on drift
6. Human review	Weekly 50 lowest scoring + 50 random traces	Spreadsheets or the Agent Command Center labeling queue

Or skip the from-scratch route: install ai-evaluation (fi.evals, Apache 2.0) and traceai-langchain (Apache 2.0), and you get all six rows out of the box.

What changed since 2025

Three shifts. First, agent stacks went mainstream. LangGraph 1.0 (release notes), the OpenAI Agents SDK, and CrewAI all ship production-ready trajectories now, and single-turn evals miss the failure modes that show up across multi-step tool calls. Second, LLM-as-judge moved from research curiosity to default evaluator. The G-Eval (arXiv 2303.16634) and Prometheus 2 (arXiv 2405.01535) lines of work convinced most teams that a calibrated LLM judge with a clear rubric is a better signal than BLEU or ROUGE for free-form generation. Third, public benchmarks rotated: MMLU has saturated for frontier models, IFEval (arXiv 2311.07911) replaced it as the instruction-following bar, and GPQA (arXiv 2311.12022) became the standard hard-reasoning slice.

The model side moved with it. Most production frameworks now generate against GPT-5, Claude Opus 4.x, Gemini 3, and Llama 4 variants, with o3 and o4-mini still common for hard-reasoning routes. Always pull the exact version off the vendor changelog when you ship.

Why a rigorous LLM evaluation framework matters

A model that scores 92 on MMLU and ships into your product can still hallucinate every fifth answer in your domain. Public benchmarks tell you the model is broadly capable, but only your evaluation framework tells you whether it works for your users on your data with your prompts. Without one you ship blind.

A real framework gives you three things:

A floor. Threshold-gated CI prevents a prompt edit or model swap from regressing the chain below an agreed quality bar.
A signal. Online evals on sampled live traffic surface drift, prompt regressions, and retriever decay days before users complain.
A paper trail. Every release ships with a score and a trace, so post-mortems point at the actual failure span, not at a hunch.

The four evaluator types you need

Every production framework in 2026 covers at least these four. Skip any of them and you have a blind spot.

1. Deterministic evaluators

The fast, cheap, fully reproducible layer.

Exact match, regex, JSON schema match, output-length bounds.
Tool-call argument validation, function-signature checks.
Latency, token usage, cost per request.

Run these on every single response. They cost nothing and they catch the most obvious regressions (model returns prose instead of JSON, output is too long, tool gets called with the wrong argument).

2. Rubric / reference-based evaluators

Score against a reference answer.

BLEU (Papineni 2002) for translation and paraphrase.
ROUGE (Lin 2004) for summarization.
BERTScore (Zhang 2019) for semantic similarity.
Token F1 and exact-match for closed-domain QA.

These are useful but mechanical. They reward surface overlap, so a paraphrased correct answer can score low and a fluent wrong answer can score high. Treat them as one signal among many.

3. LLM-as-judge evaluators

A bigger LLM scores the output against a rubric. This is the workhorse of 2026 for free-form generation.

Faithfulness: is every claim grounded in the retrieved context?
Relevance: does the answer address the actual user intent?
Custom rubric: domain-specific scoring (e.g., “is the tone appropriate for legal advice?”).

The risk with LLM-as-judge is that the judge has its own biases. You calibrate it by running the judge against 100 to 200 human-labeled examples, computing Cohen’s kappa or Krippendorff’s alpha against the humans, and iterating the rubric until kappa is above roughly 0.6. Then re-sample 50 examples each month to catch judge drift.

4. Agent-level evaluators

For multi-step agents, single-turn scoring misses the actual failure modes.

Trajectory match: did the agent take a reasonable path through tools?
Tool-use correctness: were tool arguments well-formed and the right tools chosen?
Goal completion: did the final state match the user’s request?

Future AGI’s agent simulation docs and fi.simulate cover this category. Anthropic’s evaluator patterns (Building effective agents) and the OpenAI Agents SDK both ship similar trajectory hooks.

Setting up the dataset stack

A 2026 evaluation framework runs against three distinct dataset slices.

Public benchmarks (baseline)

MMLU (source): broad knowledge, saturated for frontier models, still useful as a sanity check.
IFEval (arXiv 2311.07911): instruction-following.
GPQA (arXiv 2311.12022): hard-reasoning slice.
GSM8K (arXiv 2110.14168) and MATH (arXiv 2103.03874): math.
HumanEval (arXiv 2107.03374) and MBPP for code.
SQuAD 2.0 (leaderboard) for closed-domain QA.

Run these on every new model version. They give you an external anchor.

Domain golden set

The 200 to 500 examples that matter most to your users. Label every example with the desired output and the acceptable failure mode. Refresh on a fixed cadence (monthly is fine). This is the set CI gates against.

Adversarial slice

Jailbreaks, prompt injections, ambiguous phrasing, out-of-distribution dates, deliberately misleading questions. Build it once, freeze it, never train on it. See our jailbreaking and prompt injection guide for a concrete starter set.

Defining metrics and thresholds

Pick the metrics that match the task. A reasonable default for a RAG chat app:

Faithfulness above 0.80 (LLM-as-judge against retrieved context).
Context recall above 0.75.
Token F1 above 0.60 on the closed-QA slice.
p50 latency under 2 seconds, p95 under 6 seconds.
Refusal rate on the adversarial slice above 0.90.

Write the thresholds into the test file. A merge that falls below any of them blocks. Adjust thresholds quarterly based on user feedback and online drift data, not weekly based on vibes.

Setting up the development environment

Libraries

For a from-scratch build:

transformers and datasets from Hugging Face.
evaluate for rubric metrics (pip install evaluate).
pytest for the harness.
One LLM client SDK (OpenAI, Anthropic, Google).

For a 30-line shortcut, install ai-evaluation (fi.evals) and traceai-* packages instead.

Trace instrumentation

Every model and agent call must emit a structured span. The pattern:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_langchain import LangChainInstrumentor

tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="eval-framework-prod",
)
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

traceai-langchain is Apache 2.0 (LICENSE). Env vars are FI_API_KEY and FI_SECRET_KEY. The same package family covers OpenAI, Anthropic, CrewAI, OpenAI Agents SDK, and LangGraph.

Infrastructure

CI: GitHub Actions or similar, runs evaluators on every PR.
Storage: trace storage (Agent Command Center, Phoenix, LangSmith) plus a regression bench of recent failures.
Compute: GPU only if you run open-source judges. Cloud judges work fine for most teams.

Implementing the evaluator harness

Below is a small scaffold in fi.evals style. It loads a golden set, runs deterministic and LLM-as-judge evaluators, and prints a pass / fail verdict against your thresholds. Wire your real chain into run_chain() and the script runs end to end.

import json
from pathlib import Path

from fi.evals import evaluate
from fi.evals.metrics import CustomLLMJudge


GOLDEN = Path("eval/golden.jsonl")


def load_golden():
    with GOLDEN.open() as f:
        for line in f:
            yield json.loads(line)


def run_chain(question: str, docs: list) -> str:
    # Pseudocode. Replace with your real chain call.
    raise NotImplementedError("wire to your LangChain / Agents SDK chain here")


def score_row(row: dict) -> dict:
    answer = run_chain(row["question"], row["docs"])

    deterministic = answer.strip().endswith(".") and len(answer) > 0

    faithfulness = evaluate(
        "faithfulness",
        output=answer,
        context="\n\n".join(row["docs"]),
    )

    tone = CustomLLMJudge(
        rubric="Score 1-5: is the answer professional and not hedged?",
    )
    tone_score = tone(answer=answer)

    return {
        "id": row["id"],
        "deterministic_pass": bool(deterministic),
        "faithfulness": faithfulness,
        "tone": tone_score,
    }


FAITHFULNESS_FLOOR = 0.80
DETERMINISTIC_FLOOR = 0.98


def main():
    results = [score_row(row) for row in load_golden()]
    n = len(results)
    det_rate = sum(r["deterministic_pass"] for r in results) / n
    faith_avg = sum(r["faithfulness"] for r in results) / n

    print(f"deterministic pass rate: {det_rate:.3f}")
    print(f"avg faithfulness: {faith_avg:.3f}")

    if det_rate >= DETERMINISTIC_FLOOR and faith_avg >= FAITHFULNESS_FLOOR:
        print("VERDICT: PASS")
    else:
        print("VERDICT: FAIL")


if __name__ == "__main__":
    main()

A few notes on the snippet:

evaluate() from fi.evals accepts a string template name ("faithfulness", "context_recall", etc.). See the Future AGI evals reference for the current metric catalog.
CustomLLMJudge from fi.evals.metrics takes a free-form rubric and returns a score. Use it for domain-specific quality.
The pattern is the same for online evals: read output and context off a live span instead of a JSONL row, call the same evaluate().

Wiring evals into CI

The minimal harness becomes a CI gate when you add thresholds. A pytest test that fails the build when scores drop:

import pytest
import statistics

from eval.harness import load_golden, score_row


@pytest.fixture(scope="module")
def scores():
    return [score_row(row) for row in load_golden()]


def test_faithfulness_floor(scores):
    avg = statistics.mean(r["faithfulness"] for r in scores)
    assert avg >= 0.80, f"faithfulness {avg:.3f} below 0.80"


def test_deterministic_pass_rate(scores):
    rate = sum(r["deterministic_pass"] for r in scores) / len(scores)
    assert rate >= 0.98, f"deterministic pass rate {rate:.3f} below 0.98"

Run the test on every PR. Keep a bench/ folder with the last 30 days of production failures, frozen as test cases, so the chain cannot regress on a known-bad input.

Online evals: sampling live traffic

Offline evals catch known regressions. Online evals catch the cases your golden set did not anticipate.

Sample 1 to 5 percent of live traffic.
Run the same evaluators against the live span (fi.evals.evaluate reads inputs directly off the trace).
Alert when the rolling average drops below your threshold.
Queue the lowest-scoring traces for the weekly human review.

The cost stays bounded because you sample, and the latency stays bounded because cloud judges like turing_flash run in roughly 1 to 2 seconds, turing_small in 2 to 3 seconds, and turing_large in 3 to 5 seconds (Future AGI cloud evals docs). For most teams a sampled turing_flash judge adds no perceptible p95.

Top tools and platforms for building an LLM eval framework in 2026

Where Future AGI competes, here is the ranked list. The criterion: best-fit if you are building an LLM evaluation framework end to end.

1. Future AGI

The eval superset. fi.evals ships deterministic, rubric, LLM-as-judge, and agent evaluators behind one Python SDK (ai-evaluation, Apache 2.0). fi.simulate covers agent trajectories. traceai-* auto-instrumentation covers LangChain, LangGraph, OpenAI Agents SDK, CrewAI, and direct provider SDKs. The Agent Command Center at /platform/monitor/command-center joins traces, scores, and alerts in one dashboard. Pick Future AGI when you want one stack that covers offline CI, online sampling, and agent simulation.

2. DeepEval

Open-source evaluator library by Confident AI (source). Strong rubric coverage and a pytest-style harness. Lighter on agent-level evals and trace storage than Future AGI; teams often pair it with a separate observability tool.

3. Ragas

Open source, focused on RAG metrics (source). Excellent faithfulness, context recall, and context precision evaluators. Narrower than Future AGI or DeepEval outside of RAG; usually one of several libraries in a stack rather than the whole stack.

4. LangSmith

LangChain’s evaluator + dataset platform (docs). Strong inside the LangChain ecosystem. Cross-framework teams (OpenAI Agents SDK, CrewAI, custom Python agents) usually pair it with a wider stack.

5. Braintrust

Polished UX for prompt experiments and dataset runs (docs). Strong on the “compare two prompts side by side” workflow, lighter on agent-level evals.

Handling common evaluation challenges

Hallucinations

A faithfulness evaluator catches them, but only if your retrieval is good enough to surface the right context in the first place. Score context recall in the same harness. If recall is low, the LLM cannot answer correctly no matter how good the prompt.

Bias and fairness

Slice your evaluation by user cohort (region, language, role) and watch for score gaps. A model that scores 0.85 overall but 0.55 on one cohort is broken for that cohort.

Overfitting

If your golden set scores keep going up but public benchmark scores stay flat, you are overfitting the eval set. Refresh the golden set on a fixed cadence and keep the adversarial slice frozen.

Latency bottlenecks

Profile by chain stage. Most latency lives in the LLM call, but retrievers, parsers, and tool calls each contribute. Use streaming where you can, cache aggressively, and pick a faster model for low-stakes routes.

A/B testing and iteration

Once the framework is in place, the iteration loop is straightforward:

A/B test two prompt or model variants on a percentage of traffic.
Compare evaluator scores and latency p50/p95.
Stress test the winner at 2x and 3x expected peak.
Run an adversarial sweep on the winner before merge.
Version every prompt template alongside its eval scores, so rollback is one config flip.

Conversational consistency tests deserve a separate pass: multi-turn threads regress in ways that single-turn evals miss. A 5-turn replay of the top 50 user conversations every release catches most of it.

How Future AGI helps teams build an LLM evaluation framework

Install ai-evaluation and the traceai-* package for your stack. fi.evals gives you the four evaluator types, traceai gives you the spans, and the Agent Command Center dashboard at /platform/monitor/command-center joins them.

pip install ai-evaluation traceai-langchain

Then call evaluate() from fi.evals in CI and on live spans. Same code path, different inputs. Future AGI quickstart.

Frequently asked questions

What is an LLM evaluation framework in 2026?

An LLM evaluation framework is the code, datasets, evaluators, and dashboards that score a model or agent on accuracy, faithfulness, latency, and safety before and after deployment. The 2026 version combines four evaluator types (deterministic, rubric, LLM-as-judge, agent), runs offline against a golden set in CI, and runs online against a sampled slice of live traffic through trace storage.

Should I build my own evaluator stack or use a platform?

Build a minimal one to understand the pieces, then move to a platform once you need rubric libraries, LLM-as-judge calibration, trace storage, and agent-level evaluators. Future AGI covers all four evaluator types across `fi.evals` (deterministic, rubric, LLM-as-judge) and `fi.simulate` (agent trajectories) so you do not need to wire judges, prompts, parsers, and storage by hand. Most teams ship a 200-line custom harness first and graduate to a platform within a quarter.

What metrics matter most for an LLM eval framework?

For RAG: faithfulness, context recall, context precision. For closed-domain QA: exact-match and token F1. For free-form generation: rubric-driven LLM-as-judge scores against a calibrated reference. For agents: tool-use correctness, trajectory match, and goal completion. Latency p50 and p95 cut across every category. BLEU and ROUGE are still useful for surface overlap but should not be the only gate.

How do I calibrate LLM-as-judge evaluators?

Run the judge against 100 to 200 human-labeled examples, compute Cohen's kappa or Krippendorff's alpha between judge and humans, and iterate the rubric until agreement is above 0.6. Then sample 50 fresh examples every month to detect judge drift. Future AGI exposes calibration metrics through its `CustomLLMJudge` flow so you can compare a new judge prompt to the labeled set in one call.

Offline evals or online evals?

Both. Offline evals run against a fixed golden set, catch regressions before merge, and gate CI. Online evals run on a sampled slice of live traffic, catch drift, prompt regressions, and retriever updates that the static set misses. The same evaluator code path should work for both: read inputs from a row or from a live span, score, and store.

How does Future AGI fit into a custom evaluation framework?

Future AGI ships the four evaluator types as a Python library (`fi.evals` and `fi.simulate`) and pairs them with traceAI auto-instrumentation for LangChain, LangGraph, OpenAI Agents SDK, CrewAI, and more. You can run the SDK against a local JSONL in CI and against live spans in production, then view scores, traces, and alerts in the Agent Command Center dashboard at `/platform/monitor/command-center`.

What is the right cadence for human review?

Sample the 50 lowest-scoring traces every week and the 50 random traces every week. Label them blind, compare labels to evaluator scores, and write down every disagreement. Roll up the disagreements monthly into a rubric update or a new test case. This loop is what keeps your evaluator stack honest as user behavior drifts.

How do I keep evaluator costs bounded?

Three levers: sample (run online evals on 1 to 5 percent of traffic, not 100 percent), tier (deterministic and rubric evals run on every span, LLM-as-judge runs on a sample), and cascade (cheap judge first, expensive judge only on borderline cases). Future AGI's `turing_flash` runs at roughly 1 to 2 second cloud latency, so a sampled LLM-as-judge does not stretch your p95.

View all

Guides

LLM Observability and Monitoring in 2026: The Field Guide

What LLM observability means in 2026: traces, spans, evals, span-attached scores. Compare top 5 platforms, see real traceAI code, and learn what to alert on.

NVJK Kartik · May 2, 2025

9 min

Guides

LLM Observability in 2026: A CTO Playbook for Tools and Tradeoffs

LLM observability in 2026 for CTOs. Metrics, logs, traces, tool selection, lifecycle integration, an Instacart case study, plus traceAI in production.

Rishav Hada · Apr 14, 2025

9 min

Guides

Detect Hallucinations in Generative AI: 6 Methods That Work in 2026

Detect AI hallucinations in production in 2026: ChainPoll, NLI, SelfCheckGPT, RAG faithfulness, FAGI eval, and human review. Code, latency, and trade-offs.

Rishav Hada · Mar 22, 2025

7 min