LLM Eval vs Classical ML Eval: A 2026 Bridge for ML Teams
Classical ML eval is closed-form. LLM eval is open-form. Here's the discipline that carries, the metrics that break, and the mapping that turns sklearn intuitions into a working LLM eval suite.
Table of Contents
You hired the data scientists pre-2024. They came up on Kaggle, sklearn, MLflow, train/test splits, AUC, F1, RMSE, calibration curves. The discipline is excellent. The intuitions are sharp. Then you put them on an LLM eval workstream and half of those intuitions misfire in subtle, expensive ways.
The intuitions aren’t wrong. They’re under-specified for a new output space. Most posts on this topic frame LLM eval as a new discipline. It isn’t. It’s the same discipline against a different function. This post is the bridge: what carries cleanly from sklearn into LLM eval, what breaks, and how the ai-evaluation SDK makes the transition feel like extending a scikit-learn workflow rather than learning a new field.
The opinion this post earns: classical ML eval is closed-form. LLM eval is open-form. The mental model carries; the formulas don’t. Bring the discipline; throw away the metrics.
TL;DR: closed-form vs open-form
| Classical ML eval | LLM eval (2026) | Carries? |
|---|---|---|
| Labels in a CSV | Rubrics as prose | Discipline yes, formula no |
| Bounded output (classes, real numbers) | Unbounded text | Re-binarize via rubric |
| Metric is a formula (F1, AUC, RMSE) | Metric is a definition (judge or classifier) | Re-build per rubric |
| Train/test split | Frozen holdout plus rolling production sample | Mostly yes |
| Eval cost effectively free | Eval cost $0.001 to $0.50 per call | New budget axis |
| Drift = data drift over time | Data drift + model drift + judge drift | Partially |
| Stratified sampling | Stratified sampling | Yes, unchanged |
| Confusion matrix | Confusion matrix against rubric labels | Yes, unchanged |
| Cohen’s kappa | Cohen’s kappa for judge-versus-human | Yes, unchanged |
| Brier / ECE / reliability | Brier / ECE / reliability on judge scores | Yes, unchanged |
| CI gate thresholds | CI gate thresholds | Yes, unchanged |
Closed-form metrics break. Open-form discipline carries. Everything below is which is which.
What classical ML eval gets right (and LLM eval inherits)
The thing pre-2024 ML teams already do well is the part of LLM eval that matters most.
Define-measure-gate. You don’t ship a model on intuition. You define the metric, you measure against a held-out set, you gate the deploy on a CI threshold. That loop is identical for LLM systems. The only thing that changes is what sits behind the word “metric.”
Stratified sampling. If 5% of your traffic is high-risk legal questions, your eval set is 5% high-risk legal questions. Same for LLM eval. Random sampling underweights the long tail; stratified sampling doesn’t. The discipline transfers verbatim.
Holdout discipline. Keep a clean test set. Never let it leak into training. Score against it at the end. This still applies to LLM eval, with two caveats covered below (pre-training contamination and rubric leakage). The instinct is right.
Per-class reporting. Single-number scores hide problems. A 0.87 F1 hides a 0.4 recall on the minority class. Same for LLM eval. Report per-route, per-rubric, per-risk-tier. Aggregate scores lie.
Calibration as a first-class concern. Brier score, Expected Calibration Error, reliability diagrams. These tools tell you whether your model’s confidences mean what they claim to mean. They apply unchanged to LLM-judge scores and classifier confidences.
CI gating. Every classical ML team runs a build that fails if AUC drops two points. Run the same pattern with rubric scores. The discipline is the difference between “we have a metric somewhere” and “regressions can’t ship.”
This list is the discipline. Bring all of it.
What breaks: the closed-form metrics
The math you know assumes the world looks a certain way. LLM outputs don’t.
Accuracy assumes a single right answer. A model asked for the capital of France can answer “Paris,” “Paris, the capital of France,” “It’s Paris,” or “The capital is Paris (located in Europe).” Four responses, all correct, no two share a token sequence. Exact match fails three. Accuracy is undefined.
F1 assumes binary or multi-class labels. A faithfulness rubric returns 0.73 on a paragraph that’s mostly grounded but contains one unsupported clause. There’s no class to compute precision and recall against until you binarize at a threshold, and the threshold is a separate decision that has to be justified.
AUC assumes a probability and a binary truth. A judge returns a 0-to-1 score for “is this answer helpful.” Treating that score as a probability is fine for calibration math, but the “truth” you’re scoring against is itself a rubric label produced by a different judge or a human, not a measured fact. The metric carries; the truth gets fuzzier.
RMSE assumes a numeric target with a meaningful distance. A response can be wrong in fifteen different ways. There is no real-valued ground truth to subtract from. The formula doesn’t apply.
BLEU and ROUGE assume the right answer is the reference. Paraphrase tanks the score. A better-worded answer scores worse than a worse-worded one that happens to match the reference. The metric measures surface overlap, not correctness. Useful as a CI floor for closed-form contracts (JSON schema, regex match) and obsolete for “is this answer good.”
The closed-form metrics aren’t wrong. They’re under-specified for an unbounded output space. The replacement isn’t “no metric.” It’s three new primitives that each handle one slice of the open-form problem. The gentle introduction to LLM evaluation covers them at a slower pace.
The three replacement primitives
Open-form evaluation needs three primitives. Each answers a different question. The mistake is reaching for one when another would have been cheaper or sharper.
Deterministic checks. A function with no model in the loop. Parse the response into JSON, check against a schema. Run a regex for refusal phrasings. Match the tool call against an expected signature. Microsecond-fast, free, never drifts. The right tool for closed-form contracts (schema, format, refusal regex). The wrong tool for “is this helpful.” Use as the CI floor under the judge so the judge doesn’t run on cases a parser already failed.
Classifier-backed evals. A pre-trained classifier returns a label and a confidence. Toxicity, jailbreak, PII, language, bias. The output is multi-class so the math from classical ML applies unchanged: precision, recall, F1, confusion matrix, calibration. The ai-evaluation SDK ships 13 backends (nine open-weight, four API) behind one Guardrails class with RailType.INPUT/OUTPUT/RETRIEVAL and aggregation strategies (ANY, ALL, MAJORITY, WEIGHTED). Sub-100ms latency, $0.001 to $0.01 per call, sharp targets only.
LLM-as-a-judge. A capable model reads the rubric, reads the candidate, returns a score against a prose definition. The only general-purpose tool for semantic rubrics (faithfulness, refusal calibration, answer completeness, tool-use appropriateness). It is also the most expensive primitive and the one most prone to bias. The why LLM-as-a-judge post is the long-form on when and how to use it.
Three primitives, three jobs. Reach for the cheapest tool that answers the question honestly. The pattern shows up in every eval audit: a $0.04-per-call frontier judge running on a binary toxicity decision a 4B Gemma adapter answers in 65 milliseconds. Wrong tool, right answer, wasted money.
Mapping classical metrics to LLM-eval equivalents
The mapping is sharper than ML teams expect. Same shape, different inputs.
Precision and recall map to groundedness and refusal calibration. A grounded response is a true positive on the “claims-supported-by-context” axis. An ungrounded one is a false positive. An over-refusal (the model refuses a benign request) is a false positive on the “should-have-answered” axis; an under-refusal is a false negative. The 2x2 matrix is identical to the one a fraud-detection team already knows. The labels come from a rubric instead of a database, and the rest is the same.
AUC maps to judge calibration. Sweep the threshold on a judge’s 0-to-1 score, plot precision against recall, take the area. The ThresholdCalibrator in ai-evaluation runs this sweep for any rubric: sweep across operating points, return the threshold that maximizes the F-beta you actually care about. The classical PR-curve sweep transfers unchanged.
Confusion matrices apply unchanged. Binarize the rubric at the gating threshold (faithfulness above 0.75 is “good,” below is “bad”) and you have a literal confusion matrix. Per-route, per-risk-tier, per-prompt-version. Feed it straight into sklearn’s classification_report.
from fi.evals import Guardrails
from fi.evals.types import RailType, AggregationStrategy
from sklearn.metrics import classification_report
guardrails = Guardrails(
backends=["QWEN3GUARD_4B", "LLAMAGUARD_3_8B"],
rail_type=RailType.OUTPUT,
aggregation=AggregationStrategy.MAJORITY,
)
predictions = [guardrails.check(text=case.output) for case in eval_set]
y_pred = [int(p.triggered) for p in predictions]
y_true = [int(case.label_is_unsafe) for case in eval_set]
print(classification_report(y_true, y_pred, target_names=["safe", "unsafe"]))
Cohen’s kappa maps to judge-versus-human agreement. Two annotators on the same fifty examples? Compute kappa. Judge versus human majority? Compute kappa. The gate to remember: a rubric with judge-human kappa below 0.7 is too subjective for CI. Tighten the rubric prompt, add anchor examples, or move to a classifier-backed eval where the labels are objective. This is the inter-rater-reliability discipline classical ML teams already know.
Brier score and ECE apply unchanged. Treat the judge’s 0-to-1 score as a probability, the binarized human label as truth, compute Brier and Expected Calibration Error the way you would for a sklearn binary classifier. Reliability diagrams are built the same way. Calibrated judge scores are tradeable. Uncalibrated ones aren’t.
Stratified train/test split becomes a stratified eval-set construction.
from sklearn.model_selection import train_test_split
eval_set, _ = train_test_split(
production_traces,
test_size=0.95,
stratify=production_traces[["route", "risk_tier"]],
random_state=42,
)
Same primitive, applied to LLM eval-set construction. The annotation effort downstream is what gets bigger.
Drift monitoring becomes three-axis drift monitoring. Classical ML drift means data drift over time. LLM eval drift has three sources at once: data drift (user inputs shift), model drift (the provider quietly upgraded the model behind the API), and judge drift (your judge model upgraded too). Pin the judge model version (gpt-4o-2024-08-06, not gpt-4o), hold a stable golden-set baseline the judge re-scores weekly, and decompose observed drift into “system regressed” versus “judge changed.” The best AI drift detection tools survey covers the operational pattern.
The five transition mistakes we see most
These are the ones we see in eval reviews when ML teams onboard their first LLM eval workstream.
Defaulting to LLM-judge for everything. The most expensive mistake. If your eval bill is more than a few hundred dollars a week, run the cost decomposition: how many judge calls are on cases a classifier would have answered for one-hundredth the cost? Usually 70 to 90 percent. The cascade pattern (augment=True in ai-evaluation) runs the classifier first, routes only the ambiguous remainder to the judge, and cuts eval cost 60 to 80 percent without losing detection rate.
Trusting the holdout you curated. Classical ML teams curate a test set and trust it forever. Two leaks specific to LLM eval break this: pre-training contamination (the test cases you wrote are in the model’s training data, especially for benchmark-shaped prompts) and rubric leakage (you wrote the rubric, looked at scores, rewrote the rubric — the rubric just trained on the test set in a subtle way). The 2026 practice is twofold: a small frozen holdout (50 to 200 examples) for hard regression-gating, plus a rolling production-sampled set that refreshes weekly. The LLM benchmarks vs production evals treatment goes deeper.
Treating the rubric as ground truth. A rubric is opinionated prose. Two engineers writing the same rubric produce different scores on edge cases. The mitigation is the kappa gate: get two annotators, label fifty examples, compute Cohen’s kappa, only trust the rubric for CI if kappa exceeds 0.7. Skipping this step means your eval scores measure rubric-writer preference as much as model behavior. The evaluating LLM judge bias mitigation post is the deeper read.
Porting power calculations without rebuilding for cost. A classical statistical-power calculation says “you need 10,000 samples to detect a 2-point AUC drop with 80% power.” That calculation is free to honor when eval cost is microseconds. It’s $2,000 per run when the judge costs 20 cents per call. The right calculation balances detection power against eval budget, and usually lands on smaller, better-stratified eval sets refreshed weekly rather than huge static sets run once a quarter.
Assuming retrain is the only lever. Classical ML has one fix: retrain on better data. LLM systems have at least six: rewrite the rubric, tune the classifier threshold, adjust the judge prompt, rewrite the system prompt, improve retrieval, fine-tune the model. Each has a different cost and cycle time. The ML team’s instinct (“retrain to fix”) is usually the last lever to reach for. The agent passes evals fails production post is the diagnostic flow for which lever to pull when.
The Future AGI surface for ML teams
The vocabulary is intentionally familiar. The ai-evaluation SDK is a sklearn-shape API.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, Toxicity, Completeness
from fi.evals.types import TestCase
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
results = evaluator.evaluate(
eval_templates=[Groundedness(), Toxicity(), Completeness()],
inputs=[
TestCase(
input="What's the refund window?",
output="The refund window is 14 days from purchase.",
context=["Refunds are accepted within 14 days of purchase."],
expected_output="14 days",
),
],
)
That shape is deliberate. Evaluator(...).evaluate(eval_templates=[...], inputs=[TestCase(...)]) is the sklearn-shape API: instantiate, pass metrics and inputs, get scores back. 60+ EvalTemplate classes are the pre-built metric library, the equivalent of sklearn.metrics. Groundedness, ContextAdherence, FactualAccuracy, AnswerRefusal, TaskCompletion, EvaluateFunctionCalling, and the rest are pre-built rubrics that behave consistently across teams. Custom rubrics drop in via CustomLLMJudge with a Jinja2 prompt template and grading_criteria, the equivalent of writing a custom sklearn scorer.
For the classifier-first lever: the 13 guardrail backends sit behind the Guardrails class — nine open-weight (Llama Guard 3 8B/1B, Qwen3-Guard 8B/4B/0.6B, Granite Guardian 8B/5B, WildGuard 7B, ShieldGemma 2B) for self-hosted air-gapped use, four API (OpenAI Moderation, Azure Content Safety, Turing Flash, Turing Safety) for hosted convenience, all with the rail-stage and aggregation knobs a classifier ensemble would expect.
For calibration: ThresholdCalibrator sweeps operating points the way a precision-recall curve does. For the cost cascade: augment=True runs the classifier first and reserves the judge for the ambiguous remainder. For scaling: four distributed runners (Celery, Ray, Temporal, Kubernetes) — the Ray runner is the natural choice for ML teams who already use Ray for distributed training.
For closing the loop: the Future AGI Platform ships self-improving evaluators (the LLM-eval analog of automated hyperparameter tuning on rubric configs and classifier thresholds) and an Error Feed that clusters production failures via HDBSCAN soft-clustering over ClickHouse-stored embeddings, then a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span tools) writes the immediate_fix. Honest framing: Linear is the only Error Feed integration in production today; Slack, GitHub, Jira, and PagerDuty are roadmap. The end-state is the same as the team’s existing MLOps loop: production signal in, eval gate, retuning cycle, gated re-deploy.
The bigger point
LLM evaluation is not a different discipline. It’s the same one applied to an open-form output space. The math primitives still apply: stratified sampling, confusion matrices, kappa, Brier and ECE, CI gating thresholds. What changes is the function on the left-hand side and the cost shape of running it.
The closed-form metrics break. Accuracy assumes a single right answer. F1 needs binary labels. AUC needs a probability and a measured truth. RMSE needs a numeric target. None of those assumptions hold for an unbounded text output. The replacement is three primitives — deterministic, classifier, judge — each tuned for one slice of the open-form problem.
The discipline that classical ML teams already do well — define the metric, measure against a clean set, gate on regression, refresh the dataset — is the discipline that produces reliable LLM systems too. Bring all of it. Throw away the formulas that assume bounded outputs and objective labels. The intuitions are right; they just need a new function to run against.
The teams that ship reliable LLM systems in 2026 are the ones that treat LLM eval as an extension of their existing ML eval practice, not as a new department. The ai-evaluation SDK is designed for that handoff: a sklearn-shape API, a pre-built metric library, a classifier-first cost lever, distributed runners that scale the way ML teams already scale training, and a Platform layer that closes the loop with self-improving evaluators.
Ready to run your first LLM eval on your own workload? Install ai-evaluation, drop a Groundedness rubric against your last fifty production traces this afternoon, and wire the same rubric as an EvalTag on live spans via traceAI tomorrow. The same rubric in both places is what turns an LLM eval from a notebook experiment into a feedback loop that holds for two years.
Related reading
Frequently asked questions
What's the one-sentence difference between classical ML eval and LLM eval?
What transfers from sklearn intuitions to LLM eval?
What breaks when you port classical ML eval to LLM eval?
How do classical metrics map to LLM-eval equivalents?
What's the cheapest way for an ML team to start LLM eval?
When does an ML team graduate from classifier-backed evals to LLM-as-a-judge?
How does Future AGI shorten the transition for ML teams?
G-Eval in 2026: what the paper actually shipped, where the method breaks in production, the four biases that wreck a rubric judge, and how to harden it for real traffic.
Five use cases where G-Eval is the right primitive: subjective rubric scoring, faithfulness on free-form text, custom-domain rubrics, multi-criterion weighted scoring, and reasoning-step evaluation. Plus when to switch.
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.