Evaluation

What Is Reference-Based Evaluation?

An LLM evaluation method that scores generated output against a trusted reference answer, label, document, or structured field.

What Is Reference-Based Evaluation?

Reference-based evaluation is an LLM-evaluation method that scores a model or agent output against a trusted reference answer, label, document, or structured field. It belongs in eval pipelines where teams know the expected result, such as classification, extraction, RAG answers, and tool outputs. In production traces, it turns a model response into a pass/fail or graded score tied to gold data. FutureAGI anchors this workflow with GroundTruthMatch on the eval:GroundTruthMatch surface and companion metrics for paraphrase and factual agreement.

Why It Matters in Production LLM and Agent Systems

Reference-based evals fail loud only when the reference is present. If you skip them on tasks with expected answers, regressions move downstream as plausible but wrong automation. A claims agent assigns billing instead of claim_status; an extraction model drops a required policy_id; a RAG assistant answers with the right tone but contradicts the approved answer. These are not style problems. They change routes, database writes, escalation decisions, and audit records.

The pain splits across teams. Developers lose a deterministic regression gate and start inspecting samples by hand. SREs see downstream error spikes with no model-level reason attached. Compliance teams cannot show that approved labels, refusal text, or regulated fields still match the reviewed dataset. Product teams see low trust, correction clicks, and support tickets after a prompt or model upgrade.

Agentic systems make the mistake compound. In 2026 multi-step pipelines, one wrong reference-backed decision can choose the wrong tool, feed stale context into a planner, or trigger a model fallback path that hides the original error. Symptoms include rising eval-fail-rate-by-cohort, disagreement between expected_response and llm.output, high thumbs-down rate on one intent, and regressions tied to a specific prompt version or retrieved document set.

How FutureAGI Handles Reference-Based Evaluation

FutureAGI’s approach is to keep the reference, output, trace context, and evaluator result in the same workflow. The specific FAGI surface for this entry is eval:GroundTruthMatch. Engineers use GroundTruthMatch when the row has a canonical answer or label, FactualConsistency when the reference can be paraphrased, and FuzzyMatch when minor wording differences should not fail the run. The row usually includes input, response, expected_response, dataset_version, prompt_version, and cohort tags.

Real example: a fintech support agent must classify messages into card_dispute, wire_status, kyc_update, or human_review. The team imports 8,000 reviewed conversations into a FutureAGI Dataset, attaches GroundTruthMatch for the class label, and runs the eval in CI for every prompt and model change. Production traces arrive through traceAI-langchain, so failed rows can link back to llm.output, tool observations, and the agent step that produced the label. If pass rate drops from 0.944 to 0.901 only on kyc_update, the engineer opens those failures, finds an instruction conflict added in the latest prompt, and blocks release until a regression eval passes.

Unlike Ragas answer correctness, which often asks an LLM judge to compare answer and reference, this workflow keeps canonical decisions auditable. The score points to the reference row, not only to a judge rationale.

How to Measure or Detect Reference-Based Evaluation

Measure it only when the task has a trusted reference. The useful signals are:

  • fi.evals.GroundTruthMatch - returns agreement between response and expected_response; aggregate into pass rate by cohort, prompt version, and dataset version.
  • fi.evals.FactualConsistency - checks whether generated claims agree with a reference when exact wording is not required.
  • fi.evals.FuzzyMatch - returns graded similarity for near matches where punctuation, case, or minor edits should not fail a row.
  • Dashboard signal - reference_eval_fail_rate split by model, route, task, and reviewer label; alert on cohort drops, not only global average.
  • User-feedback proxy - corrected-label rate, thumbs-down rate, and escalation rate on rows without gold coverage; these rows become candidates for the next golden dataset refresh.

Keep the reference field immutable for a given dataset version; otherwise a pass-rate change may reflect label churn instead of model behavior.

from fi.evals import GroundTruthMatch

metric = GroundTruthMatch()
result = metric.evaluate(
    response="kyc_update",
    expected_response="human_review",
)
print(result.score, result.reason)

Reference-based vs reference-free evaluators in 2026

In our 2026 evals, the working rule is: use reference-based when the task has a canonical answer, use reference-free when the task is open-ended. The table is how we decide:

TaskEval modeEvaluator
Intent classificationReference-basedGroundTruthMatch
Tool selectionReference-basedToolSelectionAccuracy
Entity extractionReference-basedFuzzyMatch, RegexMatch
Structured outputReference-basedJSONValidation, schema diff
RAG answer (canonical)Reference-basedFactualConsistency
RAG answer (open-ended)Reference-freeGroundedness, Faithfulness
Tone, helpfulnessReference-freeTone, IsHelpful
SafetyReference-freeToxicity, PII
Summary qualityMixedROUGE + CustomEvaluation judge

Frontier 2026 models. Claude Opus 4.7, GPT-5.1, Gemini 3 Pro. paraphrase aggressively, which breaks naive exact match on RAG answers. On TruthfulQA’s 817 questions, frontier models score 60-80% on the “supported by reference” rubric but under 30% on naive string equality. a five-fold gap that argues for paraphrase-tolerant scoring on canonical-answer tasks. The 2026 pattern is reference-based on closed-form decisions (labels, tools, fields) and reference-free on open-ended answers, with a golden dataset row carrying both an expected_response and a rubric. Unlike Ragas answer correctness, which collapses both modes into one judge call, FutureAGI keeps the reference and the judge rubric next to the trace, so a release-gate failure points to a specific row and a specific reviewer label.

Common Mistakes

The failure pattern is usually metric design, not model quality.

  • Using one reference for many valid answers. Open-ended generation needs semantic or judge scoring; one gold sentence makes correct paraphrases look wrong.
  • Updating references during the same release test. If prompt and gold data both move, the regression signal loses meaning.
  • Averaging across unrelated tasks. Classification, extraction, and RAG answer checks need separate thresholds because their error costs differ.
  • Ignoring reference provenance. A gold answer without reviewer, timestamp, dataset version, and source document is hard to defend in audits.
  • Treating reference-free evals as substitutes. LLM judges can grade style or helpfulness, but they cannot replace gold labels for canonical decisions.
  • Ignoring per-route reference drift. A reference set built against GPT-5.1 may not match Claude Opus 4.7 output for paraphrase-heavy tasks; use FuzzyMatch and Factual Consistency where exact match fails.
  • Skipping the trace join. A failing reference-based row is only useful when bound to the trace id, prompt version, and model route that produced it.
  • Treating one judge model as ground truth. A judge pinned to Claude Opus 4.7 evaluating GPT-5.1 outputs may not match a judge pinned to Gemini 3 Pro on the same paraphrase; rotate or ensemble judges for paraphrase-heavy cohorts.
  • Letting the reference set rot. Production traffic drifts; sample new failing traces into the golden dataset every release to keep the reference set alive.

Frequently Asked Questions

What is reference-based evaluation?

Reference-based evaluation scores a model or agent output against a trusted answer, label, document, or structured field. It is the right eval pattern when the task has an expected result and teams need auditable pass rates.

How is reference-based evaluation different from reference-free evaluation?

Reference-based evaluation compares output to gold data. Reference-free evaluation judges quality without a gold answer, which helps for open-ended style, helpfulness, or safety checks but is weaker for canonical decisions.

How do you measure reference-based evaluation?

FutureAGI measures it with evaluators such as GroundTruthMatch for canonical answers and FactualConsistency for paraphrased claims. Track pass rate and eval-fail-rate-by-cohort across dataset, prompt, model, and trace slices.