How is completeness different from answer relevancy?

Answer relevancy asks whether the response addresses the prompt. Completeness asks whether the response covers all required parts; a relevant answer can still miss a deadline, caveat, tool result, or required field.

How do you measure completeness?

In FutureAGI, use Completeness for free-form answers and FieldCompleteness for structured outputs. Track the missing-required-item rate by dataset, prompt version, and trace cohort.

What Is Completeness? FutureAGI Guide (2026)

Q: What is completeness in LLM evaluation?

Completeness is an eval metric that checks whether an answer contains every required fact, step, field, or constraint needed to satisfy the task. It catches plausible but partial answers before they ship.

What Is Completeness in LLM Evaluation?

Completeness is an LLM-evaluation metric that measures whether a model response covers every required part of the task, not just whether it is relevant or fluent. It shows up in eval pipelines, production traces, and structured-output tests when an answer omits required facts, fields, constraints, caveats, or steps. FutureAGI uses Completeness for free-form answer coverage and FieldCompleteness for schema-like outputs, so teams can catch partial responses before users treat them as finished.

Why Completeness Matters in Production LLM and Agent Systems

Completeness failures are quiet because the answer can still look useful. A support assistant lists refund eligibility but omits the claim deadline, so the user misses the window. A coding agent edits the handler but skips the migration. A RAG answer cites the right policy paragraph but leaves out the exception that changes the decision. The named failure mode is silent omission: no contradiction, no syntax error, no toxic content, just missing material information.

The pain lands across the stack. Developers see high answer relevancy but low task success. SREs see clean request logs while escalation rate climbs. Compliance reviewers find audit records with mandatory disclosures missing. Product teams see users ask the same follow-up question because the first answer handled only half the job. In logs, the symptoms look like short final outputs after long contexts, missing JSON keys, empty required fields, tool results that never appear in the final answer, and thumbs-down clusters around multi-part prompts.

The usual dashboard trap is celebrating low error rate while completeness quality decays; the system did not crash, it merely stopped carrying all required facts forward. Agentic systems make completeness harder than single-turn chat. A 2026 pipeline may retrieve documents, call tools, hand off to a specialist agent, and synthesize a final answer. Each step can succeed locally while the final response drops one required item. Completeness is the metric that checks the whole answer against the whole obligation.

How FutureAGI Handles Completeness

FutureAGI’s approach is to separate semantic completeness from field completeness. The eval:Completeness anchor maps to the exact Completeness evaluator class in fi.evals, documented as the cloud-template eval for whether a response completely answers the query. The eval:FieldCompleteness anchor maps to FieldCompleteness, the local metric that measures required field presence, optional field presence, and nested field coverage in structured output. That split matters: a prose answer and a JSON tool result fail in different ways.

Consider an insurance-claim assistant that must return eligibility, filing deadline, evidence needed, and escalation path. The team instruments its LangChain workflow with traceAI-langchain, captures final llm.output text plus each agent.trajectory.step, and stores a dataset row with the required items. They attach Completeness to the free-form answer and FieldCompleteness to the JSON envelope requiring decision, deadline, missing_documents, and next_action.

When a prompt change drops the completeness pass rate below 0.90 for claims involving missing receipts, the engineer inspects the failing traces, sees that the retrieval tool returned the deadline but the final answer omitted it, and fixes the synthesis prompt. If the JSON field score fails instead, the owner updates the schema instructions or blocks the release. Unlike BLEU or ROUGE, which reward word overlap with a reference, completeness asks whether the required obligations were actually covered.

How to Measure or Detect Completeness

Wire completeness as a coverage gate, not a style score:

fi.evals.Completeness - cloud-template evaluator with eval_name="completeness" and eval_id="10"; use it for whether the response completely answers the query.
FieldCompleteness - structured-output metric that returns a 0.0-1.0 output, a reason, and missing required or optional fields.
agent.trajectory.step spans - detect cases where a tool found required information but the final answer dropped it.
Required-item recall - count required facts present divided by required facts expected; use it when the obligation list is explicit and stable.
Dashboard signals - track eval-fail-rate-by-cohort, missing-required-field rate, and completeness regressions by prompt version.
User-feedback proxy - repeated follow-up questions and escalation rate often rise after completeness falls.

Minimal Python:

from fi.evals import Evaluator, Completeness
from fi.testcases import TestCase

case = TestCase(
    input="Explain refund eligibility.",
    output="Refunds are available with a receipt.",
    context="Must include deadline, receipt evidence, and escalation path.",
)
result = Evaluator().evaluate(eval_templates=[Completeness()], inputs=[case], model_name="turing_flash")
print(result)

Common Mistakes

Most mistakes come from measuring polish or parseability and calling it coverage. The fix is to name the missing obligation before scoring the output.

Treating relevance as coverage. A response can address the prompt and still miss required clauses, deadlines, citations, or tool results.
Rewarding length instead of required-item coverage. A long answer can be incomplete if it expands the easy parts and skips the hard constraint.
Checking JSON syntax but not field completeness. Valid JSON with missing required keys is a production failure, not a formatting success.
Burying required items inside vague rubrics. Write explicit required facts or fields; “good answer” gives the evaluator nothing concrete to check.
Using one global threshold. A support summary, tool call, and compliance disclosure need different completeness gates and failure handling.