Engineering

How to Build an LLM Evaluation Framework From Scratch (2026)

Building an LLM eval framework is a one-week project and a one-year maintenance burden. The eight components, honest cost map, build vs buy guidance.

May 5, 2026

Updated May 19, 2026

14 min read

llm-evaluation build-vs-buy ci-cd llm-observability python 2026

Table of Contents

You can write an LLM eval framework in two hundred lines of Python over a weekend. The hello-world demos beautifully. Six months in you’ve got three judges with three parsing regexes, a JSONL dataset nobody updated since November, a CI gate that fails on judge-model drift instead of prompt regressions, and the second engineer on rotation just asked “how does this even work.” The framework didn’t fail. The 80 percent of engineering effort the tutorials skipped did.

The hello-world is easy. Parsing variance, judge calibration, distributed orchestration, OTel emission, dataset versioning, and the four-line refactor that triggers every time a vendor changes their model are the framework.

The opinion this post earns: building an LLM evaluation framework from scratch is a one-week project and a one-year maintenance burden. Build the rubric. Buy the runner. The full version of that build-vs-buy decision framework is worth reading alongside this guide. The rubric encodes your domain (legal citation validity, clinical refusal taxonomy, ad-policy adherence). It’s hours of work and worth every minute because nobody else can write it. The runner is the judge engine, retry layer, cache, parallel executor, OTel emitter, classifier cascade, and self-improvement loop. Same six engineers at every company, same six months, same parsing bug. No edge in writing it twice.

This guide walks the build component by component, with the honest cost map for each. Code shaped against the ai-evaluation SDK and the fi CLI.

TL;DR: the eight components, scored build-or-buy

Component	What it does	Build?	Why
1. Rubric registry	Versioned scoring definitions	Build	Encodes your domain; nobody else can write it
2. Dataset schema	Typed cases, JSONL, route tags	Build	The dataset is your worldview
3. Judge engine	Model call, parse, retry, cache	Buy	Parsing variance and drift, same everywhere
4. Parallel execution	Sharded, distributed for scale	Buy	Celery/Ray/Temporal wiring is not your edge
5. Statistical gating	Welch’s t-test on deltas, percentile floors	Build	Thresholds are policy; math is library
6. OTel emission	Span-attached scoring on live traffic	Buy	50+ AI surfaces, four languages
7. Calibration loop	Inter-rater reliability vs human labels	Build	The labels are yours; the math is shared
8. Closed loop	Failing traces cluster, promote to dataset	Buy	HDBSCAN over span embeddings is generic

Build four, buy four. The four you build make the framework specific to your product. The four you buy look identical at every company that gets it right.

The two questions that decide everything

Specific or generic? Rubrics are specific. Judge engines are not; every team writes the same backoff loop, the same regex on integer output, the same cache key. Specific work is worth owning.

Compounds or depreciates? A dataset that grows from production failures compounds; year-two you have a regression suite no competitor can replicate. A custom OTel emitter depreciates; the moment a vendor adds a span kind you don’t support, you’re behind.

Build the components that score high on both. Buy the rest.

Components 1 and 2: rubric registry and dataset (build)

Both belong in the same git history as the agent code. The rubric holds the prompt template, scoring scale, and pinned judge model; the dataset holds typed cases tagged by route.

from dataclasses import dataclass, field
from typing import Callable, Optional

@dataclass(frozen=True)
class Rubric:
    name: str
    version: str
    judge_model: str            # pin; bump deliberately
    template: str               # {input}, {output}, {expected}, {context}
    scale: tuple[int, int] = (1, 5)
    parser: Optional[Callable[[str], float]] = None
    route: Optional[str] = None

@dataclass
class EvalCase:
    id: str
    route: str
    input: str
    expected_output: Optional[str]
    retrieval_context: Optional[list[str]] = None
    expected_tool_calls: Optional[list[dict]] = None
    tenant_id: Optional[str] = None
    metadata: dict = field(default_factory=dict)

Six rules decide whether these earn their keep:

Rubrics in git, not a database. Diff invisibility is how drift compounds.
Versioned with the prompt. Same PR, traceable regressions.
Decomposed, not fuzzy. “Rate the quality” is noise. “Every claim supported, 5 = none unsupported, 1 = largely fabricated” is signal. Decomposed rubrics run 0.20-0.30 higher inter-rater reliability on public datasets.
Dataset sampled from production. A 200-case set from imagination loses to a 100-case set from real traces. Error Feed (Component 8) promotes weekly.
Skew to failures. Most regression signal comes from the hardest 10 percent.
Route-tag every case. A PR touching the legal-RAG prompt shouldn’t rerun support-bot evals.

The 72-class ai-evaluation SDK ships Groundedness, ContextAdherence, Completeness, FactualAccuracy, AnswerRefusal, IsHarmfulAdvice, EvaluateFunctionCalling, TaskCompletion, and the rest as EvalTemplate classes. Use them where they fit; subclass CustomLLMJudge where they don’t. Dataset sweet spot: 100-200 cases per route for PR-blocking. Below 100, variance buries the signal; above 500, the bill grows faster than detection sharpens.

Components 3 and 4: judge engine and parallel execution (buy)

Three failure modes you didn’t budget for, every one a quarter of an engineer’s time.

Output parsing variance. Every judge returns scores differently. Anthropic prefers <score>4</score>; GPT-4 class models tend toward Score: 4; Gemini wraps in markdown; a stronger model rambles first and tucks the integer at the end. Your regex catches 95 percent on day one. On day 90 a vendor ships a finetune that wraps numbers in Unicode bold and your parser silently returns the scale midpoint on the long tail. The bug shows up as “evals trending up” because midpoint scores anchor toward the mean.

Calibration drift. A judge that scored 4.1 last month scores 4.3 on the same input after a vendor bump. Your CI gate fires on prompts nobody changed. Pinned model, versioned judge, and the calibration loop in Component 7 are the fix.

Cost economics. A frontier judge at 1.2 seconds per call against a 10K-example nightly suite plus 5 percent of production traffic is a six-figure quarterly bill. The classifier cascade (NLI classifier triages, frontier judge adjudicates the long tail) is the lever, and it’s a months-long build to get right. The Future AGI Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2; the cascade comes built in.

Parallel execution piles on. A 200-case suite at 1.2 seconds per judge call takes four minutes serially. Sharded across 16 workers, 30 seconds. Above the single-host rate limit, you need a distributed runner (Celery, Ray, Temporal, Kubernetes) and a job graph that handles retries, partial failures, and budget caps. The SDK ships all four as drop-in.

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextRelevance
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase

evaluator = Evaluator(max_workers=16)  # FI_API_KEY / FI_SECRET_KEY from env

custom_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "faithfulness_v3",
        "model": "claude-sonnet-4-5-20250929",
        "grading_criteria": (
            "Score whether the answer is faithful to the retrieval context. "
            "1 = largely fabricated. 5 = every claim supported."
        ),
    },
)
result = evaluator.evaluate(
    eval_templates=[Groundedness(), ContextRelevance()],
    inputs=[TestCase(input=ex.input, output=run_agent(ex),
                    context="\n\n".join(ex.retrieval_context or []), id=ex.id)
            for ex in load_dataset("evals/datasets/support-bot.jsonl")],
)

Pin the model. Cache on (rubric_version, input, output, model). Invalidate on version change, not on every PR. What you’d write yourself otherwise: parsing variance handler, Pydantic output schema, Jinja2 templates, worker pool, rate-limit backoff, partial-failure recovery, budget cap, sharding, dead-letter queue. None of it is your edge.

Component 5: statistical gating

Build this. The math is library; the thresholds are policy.

The mistake every team makes once: “fail the PR if mean Groundedness drops below 0.85.” A 30-case dataset gives a 95 percent confidence interval of roughly ±0.07 on a 1-5 rubric mean. A 2-point drop sits inside the noise. The gate fires green on real regressions and red on judge noise; engineers learn to ignore it inside a month.

The fix is a two-threshold gate. An absolute floor catches catastrophic drops; a delta gate fires on statistically meaningful regressions against a 7-day rolling baseline.

import statistics
from scipy import stats

def regression_gate(current, baseline, alpha=0.05, min_effect=0.03):
    delta = statistics.mean(current) - statistics.mean(baseline)
    if delta >= 0:
        return True, f"no regression (delta=+{delta:.3f})"
    _, p = stats.ttest_ind(current, baseline, equal_var=False)
    if p >= alpha:
        return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"
    if abs(delta) < min_effect:
        return True, f"delta={delta:.3f} below effect floor {min_effect}"
    return False, f"regression: delta={delta:.3f}, p={p:.3f}"

Welch’s t-test on per-example scores against the trailing baseline. Two-proportion z-test on pass-rate rubrics like citation validity. Percentile floors for long-tail failures: a regression pushing p95_score below the tail floor while the mean stays intact fails on the percentile and tells the on-call where to look. The fi CLI ships pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles as native assertion metrics.

# fi-evaluation.yaml: native CI assertions
assertions:
  - template: "groundedness"
    condition: "p95_score >= 0.78"   # gate on the tail, not the mean
    on_fail: "error"
  - template: "context_relevance"
    condition: "pass_rate >= 0.90"
    on_fail: "error"

fi run --check exits with code 2 on assertion failure (distinct from 1 for evaluation errors), 3 when --strict assertions warn, 6 on API failure. CI policies wire into GitHub Actions or Buildkite cleanly, without grep heuristics on stdout. The thresholds, the rolling-baseline policy, the per-route effect floors are the part you build. Build them once; they outlive the judge model you wrote them against.

Component 6: OpenTelemetry emission and span-attached scoring

Buy this. A framework that only runs in CI catches half the bugs. The same rubric has to run against live OTel spans and attach its score back to the trace.

Writing this yourself: an OTel install per framework (OpenAI, Anthropic, LangChain, LangGraph, LlamaIndex, Vercel AI SDK, Spring AI, Semantic Kernel, n vector DBs), a semantic convention you maintain forever, a span-attached score writer that doesn’t conflict with framework auto-instrumentation, a sampler that respects budget. A year of platform work in disguise.

traceAI (Apache 2.0) ships 50+ AI surfaces across Python (46), TypeScript (39), Java (24 including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C# (1 core). Phoenix, Langfuse, and DeepEval don’t ship JVM coverage. Pluggable semantic conventions at register() time (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest into Phoenix or Traceloop without re-instrumenting. EvalTag attaches the rubric you wrote for pytest as a span-attached scorer; 62 built-in evals run server-side at zero inline latency.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    EvalTag, EvalTagType, EvalSpanKind, EvalName, ProjectType,
)
from traceai_openai import OpenAIInstrumentor

eval_tags = [EvalTag(
    eval_name=EvalName.GROUNDEDNESS,
    type=EvalTagType.OBSERVATION_SPAN, value=EvalSpanKind.LLM, config={},
    mapping={"context": "retrieval.documents", "output": "output.value"},
)]
trace_provider = register(project_type=ProjectType.OBSERVE,
                          project_name="support-bot", eval_tags=eval_tags)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

Same rubric, two contexts. A regression in CI that doesn’t show in production means the dataset stopped being representative. A drift in production with green CI means the dataset is missing a failure mode. Both signals are content.

Components 7 and 8: calibration and closed loop

Calibration is half build, half buy. Pin a 50-200 case hold-out with human labels. Run the judge against it weekly. Track Cohen’s kappa or Spearman correlation; alarm when correlation drops below an agreed floor (0.55 is a typical starting threshold for subjective rubrics, higher for factual ones). The math is one library; the labels are work only your team can do. The Platform’s self-improving evaluators retune the rubric from thumbs feedback when calibration slips, so you don’t re-author the prompt every quarter.

from scipy.stats import spearmanr

def calibration_check(judge_scores, human_scores, threshold=0.55):
    rho, _ = spearmanr(judge_scores, human_scores)
    if rho < threshold:
        raise CalibrationDrift(f"judge correlation fell to {rho:.2f}")
    return rho

The closed loop is pure buy. The dataset that grows weekly from production failures is your moat; the clustering and triage that get it there is not. Writing it yourself: HDBSCAN over span embeddings in ClickHouse, soft-cluster assignment that survives schema migrations, a triage agent that writes the RCA without running ten minutes per call, integrations into Linear/Slack/GitHub/Jira. Six months of platform work for an output identical at every company.

Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering groups every trace failure into a named issue; a Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 characters, 90 percent prompt-cache hit ratio) writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). Fixes feed the Platform’s self-improving evaluators. An engineer (or a scheduled job) promotes representative traces into the dataset; the next CI run gates against them.

The honest cost map

Eight components, with the engineering cost you’d otherwise own.

Component	Build cost	Maintenance	Fail mode if you skip the buy
Rubric registry	2-3 days	1 day/quarter	Drift between prompt and rubric
Dataset schema	1-2 days	1 hr/week (promotion)	2024 dataset evaluating a 2026 product
Judge engine	2-3 weeks + 1 engineer-quarter	0.5 FTE	Parsing variance; midpoint anchoring on long tail
Parallel execution	4-6 weeks initial	0.25 FTE	30-minute gates; engineers disable them by month two
Statistical gating	3-5 days	1 day/quarter	False positives drown out real regressions
OTel emission	2-3 engineer-quarters per framework	0.5 FTE	Production scoring drifts from CI
Calibration loop	1 week + ongoing labels	0.25 FTE	CI gates fire on prompts nobody changed
Closed loop	6-9 months	1 FTE	Dataset stops growing

Roughly 1 to 1.5 engineers full-time once you ship two products. The cost shows up in the months you didn’t plan for: the judge-model upgrade, the OTel semantic convention bump, the dataset that quietly stopped covering the support-bot’s top failure mode. The hybrid pattern (Apache 2.0 SDK plus the Platform) is what most teams converge on by month four anyway. The time saved goes back into the rubrics and dataset where you have an edge.

When build-from-scratch is the right call (and when it isn’t)

Three cases where writing the full framework yourself is defensible:

You’re a research lab benchmarking eval methodology. The framework is the product; owning the parsing variance is the contribution.
You have a regulated isolation requirement. Public-sector, defense, or healthcare deployments where every dependency clears a six-month security review. The 17-MB self-hosted Agent Command Center Go binary clears most of these; if it doesn’t, write a thin custom runner against the OSS SDK rather than the full stack.
Your domain is so narrow the generic stack is dead weight. A single-rubric, single-route, single-tenant product. Even here, copying the 200-line MVP is faster than maintaining your own retry loop forever.

Three signals you’ve crossed the line and should stop:

You’re on engineer two. The framework stopped being your edge; it’s institutional knowledge in one person’s head.
You’re maintaining more than one parser. The day you write the fourth if response.startswith(...) is the day to stop.
CI cost is a quarterly conversation. Writing a classifier cascade is months of work; if the bill is climbing, buy the cascade.

Common pitfalls

Framework as a side project. Eval definitions belong in the same repo as agent code, reviewed in the same PR as the prompt they score.
Rubric prompts in a database. Diff invisibility is how drift compounds. Git or nothing.
Floating judge model. Score drift across runs of “the same eval.” Pin and version alongside the rubric.
30-case dataset, mean-based gate. Variance wider than the regressions you’re catching. Grow the dataset or gate on percentiles.
Full LLM-judge sweep on every PR. Prices the gate out of existence. PR-blocking gets cheap rubrics; the LLM-judge sweep runs nightly.
Custom parser per judge model. Three parsers means you’re maintaining the engine you should’ve bought.
CI-only framework. Production drifts past the dataset inside a quarter. Span-attached scoring is the bridge.
No calibration loop. Judge drift surfaces as “the gate started failing on random PRs.” The hold-out set is the cheap fix.
Static dataset frozen at launch. A 2024 set evaluating a 2026 product is a benchmark, not a regression suite. Promote weekly.

Three deliberate tradeoffs

The build-vs-buy line moves. A 200-line framework is fine for one product. By the time the team owns three products, three suites, three judges, and a custom OTel pipeline, the engineering cost beats a managed runner. The right time to switch is usually when the second engineer joins the rotation.
Self-improving rubrics need oversight. Rubrics that learn from production traces can drift in unexpected directions. Pin a human-labeled hold-out; alarm when judge-to-human correlation crosses an agreed floor. Build the hold-out before turning the self-improvement loop on.
LLM-as-judge cost matters at scale. Sample by failure signal, classifier models for high-volume passes, frontier models for adjudication. The Future AGI Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2, which makes daily full-dataset reruns financially viable instead of weekly.

How Future AGI ships the eval stack as a package

A real eval framework needs eight components, four of which are the same six engineers, six months, at every company. Future AGI ships the four you’d otherwise rewrite as Apache 2.0 OSS, plus the Platform for the cost and scale parts. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics, classifier-backed cost, and the closed loop.

ai-evaluation SDK (Apache 2.0) — 72 EvalTemplate classes (Groundedness, ContextAdherence, Completeness, FactualAccuracy, AnswerRefusal, IsHarmfulAdvice, TaskCompletion, EvaluateFunctionCalling, and the rest). Real API: from fi.evals import Evaluator; Evaluator(...).evaluate(eval_templates=[Template()], inputs=[TestCase(...)]). CustomLLMJudge with multi-modal inputs, 13 guardrail backends (9 open-weight + 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes).
fi CLI — fi init scaffolds fi-evaluation.yaml. fi run --check --strict --parallel 16 runs assertions on pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles with CI-distinct exit codes (0/2/3/6). --mode hybrid routes between local classifier rubrics and cloud judges.
Future AGI Platform — self-improving evaluators tuned by thumbs feedback; an in-product authoring agent that writes unlimited custom evaluators from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
traceAI (Apache 2.0) — 50+ AI surfaces across Python (46) / TypeScript (39) / Java (24 modules including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel) / C# (1 core). Pluggable semantic conventions at register() time. 14 span kinds (Phoenix has 8, Langfuse 5) including A2A_CLIENT/A2A_SERVER. 62 built-in evals wired server-side via EvalTag.
agent-opt — six optimizers under one API (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer). Scores against any of the SDK’s 72 evaluators.
Error Feed (inside the eval stack) — HDBSCAN clustering over ClickHouse embeddings plus a Sonnet 4.5 Judge agent writes the immediate_fix; fixes feed the Platform’s self-improving evaluators.
Agent Command Center — 17 MB Go binary self-hosts in your VPC. 20+ providers via six native adapters plus OpenAI-compatible presets. RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA certified, AWS Marketplace.

You write the rubrics and the dataset. The package handles the rest. Closed loop ratchets the dataset, self-improving evaluators retune the rubric, classifier cascade keeps the bill bounded, and the same definitions run in CI and against live OTel spans on every product you ship.

Ready to write the rubric and stop maintaining the runner? pip install ai-evaluation, fi init, point the YAML at your eval set, set FI_API_KEY and FI_SECRET_KEY in CI secrets, wire fi run --check --strict into your pull-request workflow. The eight components are already there; you bring the four that make the framework yours.

Frequently asked questions

Should I build an LLM eval framework from scratch?

Build the parts only you can write: the rubrics that encode your domain (legal-citation validity, clinical refusal taxonomy, support-bot tone), the dataset sampled from your real traces, the CI policy that fits your release process. Buy the runner: judge engine, parallel execution, OTel span emission, dataset versioning, classifier cascade, observability dashboard, evaluator self-improvement, eval-gated rollback. The first list is hours of work and unique to you. The second list is the same five engineers, the same six months, at every company that tries. The hybrid pattern (Apache 2.0 ai-evaluation SDK plus the Future AGI Platform for cost and scale) is what most teams converge on by month four anyway.

How much engineering does a real eval framework take to maintain?

More than you'd budget. The 200-line MVP is a weekend. The production version needs parsing variance fixes every time a judge model upgrades, calibration drift tracking against a human-labeled hold-out, distributed orchestration for 10K+ examples per nightly run, OTel span emission across every framework you ship on, dataset versioning that survives team turnover, a classifier cascade so the LLM-judge bill doesn't bury the quarter, and a self-improvement loop so the rubric ages with the product. Roughly 1 to 1.5 engineers full-time once you ship two products. The cost shows up in the months you didn't plan for.

What are the eight components of a real eval framework?

1. Rubric registry (versioned definitions, in code, alongside prompts). 2. Judge engine (model invocation, output parsing, retry, cache). 3. Dataset versioning (typed schemas, JSONL on disk, git-tracked, route-tagged). 4. Parallel execution (sharded by route, distributed runners for the long tail). 5. Statistical gating (Welch's t-test on deltas, not means; percentile floors on tails). 6. OTel emission and span-attached scoring (same rubric in CI and on live traffic). 7. Calibration loop (inter-rater reliability vs human labels, drift alarms). 8. Closed loop (failing production traces auto-cluster into named issues and promote back into the dataset). Skip any one and the framework becomes theatre by month six.

Why is the LLM-as-judge engine the worst part to build yourself?

Three reasons compound. Output parsing variance: every judge model returns score formats differently, every model upgrade changes the format, and your parser silently fails on the long tail. Calibration drift: a judge that scored 4.1 last month scores 4.3 on the same input this month after a model bump; you find this when CI gates start firing on prompts nobody changed. Cost economics: a frontier judge at 1.2 seconds per call against a 10K nightly dataset is a quarterly budget conversation. The Future AGI Platform runs classifier-backed judges at lower per-eval cost than Galileo Luna-2, which is what makes daily full-dataset reruns financially viable instead of weekly.

Where does the framework live in the repo?

Same repo as the agent code, alongside the prompts and tools. The eval definition is part of the contract the agent fulfills; treating evals as a separate project leads to drift between the prompt that ships and the rubric that scores it. Top-level evals/ directory: datasets/ (JSONL, route-tagged, versioned), rubrics/ (Python EvalTemplate classes or CustomLLMJudge configs), test_*.py (pytest), and a fi-evaluation.yaml the fi CLI assertion runner reads in CI. The rubric file is reviewed in the same PR as the prompt it scores.

What's the minimum viable eval framework?

Five pieces. A typed dataset class with route, input, expected output, and optional retrieval context. A rubric (prompt template plus scoring scale plus parser). A judge wrapper with cache and retry. A pytest fixture that loads the dataset, runs the rubric, asserts thresholds with a delta gate against a 7-day baseline. A span-attached version of the same rubric for live OTel traces. Five pieces in maybe 200 lines of Python; teams stop there and ship for six months before the maintenance burden compounds. The hybrid pattern moves the runner and the observability layer to the SDK on day one so the team owns only the rubric and the dataset.

Can the same rubric run in CI and against live production traces?

Yes, and the moment they diverge is the moment the framework stops working. The rubric definition lives in code, runs in pytest against the versioned dataset for the CI gate, and runs against live OTel spans via traceAI plus EvalTag for the production canary. A regression in CI that doesn't show in production means the dataset stopped being representative. A drift in production with green CI means the dataset is missing a failure mode the gate now needs to learn. Both signals are content, and the unified rubric is what makes the two signals comparable instead of arguable.

How does Future AGI fit the build-vs-buy line?

Future AGI ships the eval stack as a package designed for the hybrid pattern. The ai-evaluation SDK (Apache 2.0) is the runner you don't build: 72 EvalTemplate classes, CustomLLMJudge with multi-modal inputs, 13 guardrail backends, four distributed runners (Celery, Ray, Temporal, Kubernetes), a fi CLI with native CI assertions on pass_rate and p50/p90/p95_score. The Future AGI Platform is the operational layer: self-improving evaluators tuned by thumbs feedback, an in-product authoring agent that writes rubrics from natural-language descriptions, classifier-backed scoring at lower per-eval cost than Galileo Luna-2. traceAI (50+ AI surfaces across Python, TypeScript, Java, C#) carries the rubric to live traces. Error Feed sits inside the eval stack: HDBSCAN clustering plus a Sonnet 4.5 Judge writes the immediate_fix, fixes feed self-improving evaluators. You write the rubrics and the dataset. The package handles the rest.

View all

Engineering

How to Evaluate RAG Applications in CI/CD Pipelines (2026)

RAG eval in CI/CD without theatre: the cheap-fast-significant triangle, statistical gating, sharded parallelism, classifier cascades, production bridge.

Rishav Hada · May 20, 2026

13 min

Engineering

Your Agent Passes Evals and Fails in Production. Here's Why. (2026)

Your eval set is a snapshot, production is a river. Six drift modes that age eval sets, and the trace-as-eval loop that closes the gap.

Rishav Hada · Apr 21, 2026

14 min

Engineering

LLM Eval Data Drift Detection: Three Drifts That Age Your Golden Set

Eval dataset drift is the silent killer. A 2026 method for catching input, prompt-template, and retrieval-corpus drift before CI is wrong.

NVJK Kartik · Mar 3, 2026

12 min

TL;DR: the eight components, scored build-or-buy

The two questions that decide everything

Components 1 and 2: rubric registry and dataset (build)

Components 3 and 4: judge engine and parallel execution (buy)

Component 5: statistical gating

Component 6: OpenTelemetry emission and span-attached scoring

Components 7 and 8: calibration and closed loop

The honest cost map

When build-from-scratch is the right call (and when it isn’t)

Common pitfalls

Three deliberate tradeoffs

How Future AGI ships the eval stack as a package

Related reading

Frequently asked questions