Evaluation

What Is Reference-Based Evaluation?

An LLM evaluation method that scores generated output against a trusted reference answer, label, document, or structured field.

What Is Reference-Based Evaluation?

Reference-based evaluation is an LLM-evaluation method that scores a model or agent output against a trusted reference answer, label, document, or structured field. It belongs in eval pipelines where teams know the expected result, such as classification, extraction, RAG answers, and tool outputs. In production traces, it turns a model response into a pass/fail or graded score tied to gold data. FutureAGI anchors this workflow with GroundTruthMatch on the eval:GroundTruthMatch surface and companion metrics for paraphrase and factual agreement.

Why It Matters in Production LLM and Agent Systems

Reference-based evals fail loud only when the reference is present. If you skip them on tasks with expected answers, regressions move downstream as plausible but wrong automation. A claims agent assigns billing instead of claim_status; an extraction model drops a required policy_id; a RAG assistant answers with the right tone but contradicts the approved answer. These are not style problems. They change routes, database writes, escalation decisions, and audit records.

The pain splits across teams. Developers lose a deterministic regression gate and start inspecting samples by hand. SREs see downstream error spikes with no model-level reason attached. Compliance teams cannot show that approved labels, refusal text, or regulated fields still match the reviewed dataset. Product teams see low trust, correction clicks, and support tickets after a prompt or model upgrade.

Agentic systems make the mistake compound. In 2026 multi-step pipelines, one wrong reference-backed decision can choose the wrong tool, feed stale context into a planner, or trigger a model fallback path that hides the original error. Symptoms include rising eval-fail-rate-by-cohort, disagreement between expected_response and llm.output, high thumbs-down rate on one intent, and regressions tied to a specific prompt version or retrieved document set.

How FutureAGI Handles Reference-Based Evaluation

FutureAGI’s approach is to keep the reference, output, trace context, and evaluator result in the same workflow. The specific FAGI surface for this entry is eval:GroundTruthMatch. Engineers use GroundTruthMatch when the row has a canonical answer or label, FactualConsistency when the reference can be paraphrased, and FuzzyMatch when minor wording differences should not fail the run. The row usually includes input, response, expected_response, dataset_version, prompt_version, and cohort tags.

Real example: a fintech support agent must classify messages into card_dispute, wire_status, kyc_update, or human_review. The team imports 8,000 reviewed conversations into a FutureAGI Dataset, attaches GroundTruthMatch for the class label, and runs the eval in CI for every prompt and model change. Production traces arrive through traceAI-langchain, so failed rows can link back to llm.output, tool observations, and the agent step that produced the label. If pass rate drops from 0.944 to 0.901 only on kyc_update, the engineer opens those failures, finds an instruction conflict added in the latest prompt, and blocks release until a regression eval passes.

Unlike Ragas answer correctness, which often asks an LLM judge to compare answer and reference, this workflow keeps canonical decisions auditable. The score points to the reference row, not only to a judge rationale.

How to Measure or Detect Reference-Based Evaluation

Measure it only when the task has a trusted reference. The useful signals are:

  • fi.evals.GroundTruthMatch - returns agreement between response and expected_response; aggregate into pass rate by cohort, prompt version, and dataset version.
  • fi.evals.FactualConsistency - checks whether generated claims agree with a reference when exact wording is not required.
  • fi.evals.FuzzyMatch - returns graded similarity for near matches where punctuation, case, or minor edits should not fail a row.
  • Dashboard signal - reference_eval_fail_rate split by model, route, task, and reviewer label; alert on cohort drops, not only global average.
  • User-feedback proxy - corrected-label rate, thumbs-down rate, and escalation rate on rows without gold coverage; these rows become candidates for the next golden dataset refresh.

Keep the reference field immutable for a given dataset version; otherwise a pass-rate change may reflect label churn instead of model behavior.

from fi.evals import GroundTruthMatch

metric = GroundTruthMatch()
result = metric.evaluate(
    response="kyc_update",
    expected_response="human_review",
)
print(result.score, result.reason)

Common Mistakes

The failure pattern is usually metric design, not model quality.

  • Using one reference for many valid answers. Open-ended generation needs semantic or judge scoring; one gold sentence makes correct paraphrases look wrong.
  • Updating references during the same release test. If prompt and gold data both move, the regression signal loses meaning.
  • Averaging across unrelated tasks. Classification, extraction, and RAG answer checks need separate thresholds because their error costs differ.
  • Ignoring reference provenance. A gold answer without reviewer, timestamp, dataset version, and source document is hard to defend in audits.
  • Treating reference-free evals as substitutes. LLM judges can grade style or helpfulness, but they cannot replace gold labels for canonical decisions.

Frequently Asked Questions

What is reference-based evaluation?

Reference-based evaluation scores a model or agent output against a trusted answer, label, document, or structured field. It is the right eval pattern when the task has an expected result and teams need auditable pass rates.

How is reference-based evaluation different from reference-free evaluation?

Reference-based evaluation compares output to gold data. Reference-free evaluation judges quality without a gold answer, which helps for open-ended style, helpfulness, or safety checks but is weaker for canonical decisions.

How do you measure reference-based evaluation?

FutureAGI measures it with evaluators such as GroundTruthMatch for canonical answers and FactualConsistency for paraphrased claims. Track pass rate and eval-fail-rate-by-cohort across dataset, prompt, model, and trace slices.