What is a correctness metric?

A correctness metric scores whether an LLM or agent output is right for the task, usually by comparing it with a reference answer, label, or ground-truth record. It is the eval signal teams use when fluency is not enough.

How is a correctness metric different from accuracy?

A correctness metric can score one response, one field, or one trace. Accuracy is usually the aggregate percentage of correct examples across a dataset or cohort.

How do you measure correctness?

FutureAGI uses fi.evals.Equals for deterministic equality and GroundTruthMatch for ground-truth-backed evaluation. Teams track correctness-rate-by-cohort and block releases when it drops below threshold.

What Is Correctness Metric? FutureAGI Guide (2026)

What Is Correctness Metric?

A correctness metric is an LLM-evaluation metric that measures whether an AI system’s output is right for the task, not merely fluent or relevant. It usually compares a response with an expected answer, label, tool result, or ground-truth record, then returns a score or pass/fail result. In production, correctness metrics appear in eval pipelines, release gates, and trace-level monitoring. FutureAGI uses fi.evals.Equals for deterministic equality and GroundTruthMatch for ground-truth-backed evaluation.

Why Correctness Metrics Matter in Production LLM and Agent Systems

Correctness is the boundary between a model sounding helpful and a system doing the job. If you ignore it, two failures show up: silent false positives and brittle release gates. A support agent may recommend an expired refund policy. A coding agent may choose the right tool but pass the wrong argument. A RAG answer may be grounded in context yet still answer the wrong user question because it selected the wrong row. Groundedness and relevance can both pass while correctness fails.

Developers feel this first because correctness failures look like flaky model behavior until the dataset is sliced. SREs see it as a steady rise in thumbs-down rate, escalation rate, and eval-fail-rate-by-cohort while latency and token cost stay flat. Product teams see users stop trusting the workflow because the answer is close enough to read as confident but wrong enough to create work. Compliance teams feel the audit gap when nobody can prove which answer was expected for a regulated decision.

In 2026-era agent pipelines, correctness is not only a final-answer metric. It applies to intermediate plans, retrieved record IDs, tool parameters, SQL outputs, classification labels, and final responses. One wrong intermediate value can be reused by three later steps, so the right metric must attach to the span or dataset row where the error enters.

How FutureAGI Handles Correctness Metrics

FutureAGI’s approach is to split deterministic correctness from reference-backed semantic correctness. For closed-form tasks, fi.evals.Equals checks the response against expected_response and returns a binary result. That is the right surface for labels, IDs, enum values, single-token answers, and tool outputs where there is exactly one valid answer. For rows that carry a ground-truth reference, GroundTruthMatch is the FutureAGI evaluator class to use in an eval pipeline or dataset run.

A real workflow: a fintech team has a transaction-classification agent that returns one of 42 canonical categories. Each dataset row stores the merchant text, the expected category, and the model’s category. The engineer attaches Equals through the eval stack, watches correctness rate by merchant cohort, and sets a release threshold at 0.975. A model update that improves overall helpfulness but drops “medical billing” correctness from 0.99 to 0.91 gets blocked before deploy.

For production traces, the same team instruments model calls with traceAI-openai, logs the input, output, and reference label when available, then writes the evaluator result beside the trace. Unlike Ragas answer_correctness, which is useful for judge-based QA scoring, Equals is deterministic and cheap enough to run on every eligible trace. When correctness drops, the engineer reviews failing examples, updates the golden dataset, or routes uncertain cases to human annotation before widening the rollout.

How to Measure or Detect Correctness

Measure correctness at the row, cohort, and release-gate levels:

fi.evals.Equals — compares response with expected_response for deterministic tasks and returns a binary score.
GroundTruthMatch — FutureAGI evaluator class for examples backed by a ground-truth answer or label.
Correctness-rate-by-cohort — dashboard signal split by route, model version, customer segment, prompt version, and dataset slice.
User-feedback proxy — thumbs-down rate, escalation rate, refund rate, or manual-review overturn rate for cases without gold labels.

Minimal Python:

from fi.evals import Equals

metric = Equals()
result = metric.evaluate(
    response="billing_refund",
    expected_response="billing_refund",
)
print(result.score)

Use a hard threshold only when the answer space is canonical. For open-ended answers, pair correctness with FactualAccuracy, AnswerRelevancy, or a rubric-based judge so paraphrases do not fail just because they use different wording.

Common Mistakes

Engineers usually get correctness wrong by making the metric either too strict or too broad:

Using exact equality for paraphrasable answers. Correct natural-language answers can differ from the reference and still satisfy the task.
Treating groundedness as correctness. A response can be grounded in retrieved context and still answer the wrong user request.
Averaging away cohort failures. A 98% global score can hide a 72% correctness rate for one regulated customer segment.
Letting missing ground truth pass. Unknown expected values should be excluded or routed for annotation, not counted as correct.
Using one correctness rule for agents. Tool choice, tool arguments, intermediate state, and final answer often need separate metrics.