What is task completion in agent evaluation?

Task completion is a 0–1 metric that scores whether a multi-step agent actually finished the job — matching the expected final result, meeting listed success criteria, and reaching a final step rather than abandoning the trajectory.

How is task completion different from goal progress?

Task completion measures the end state — did the agent finish the task. Goal progress measures the journey — how close the agent got, even if it never finished. Use task completion as the pass/fail gate and goal progress for partial credit on hard tasks.

How do you measure task completion?

FutureAGI's fi.evals.TaskCompletion takes the agent trajectory, expected result, and a list of success criteria, then returns a score plus a reason. Pair it with TrajectoryScore for a weighted view across completion, efficiency, and tool selection.

What Is Task Completion? Agent Eval Definition (2026)

What Is Task Completion?

Task completion is an agent-evaluation metric that measures whether a multi-step agent actually finished the job it was given, not just whether its final message looked plausible. It compares the agent’s final result against an expected outcome, checks every listed success criterion, and verifies the trajectory has a is_final step rather than running out the step budget. The metric returns a 0–1 score with a structured reason. In FutureAGI it is the TaskCompletion class in fi.evals, used as the headline pass/fail gate in agent regression suites.

Why It Matters in Production LLM and Agent Systems

A polite, fluent final message is not a finished task. Agents fail completion in three quiet ways: they stop early because they hit a tool error and apologise, they loop until the step budget runs out and emit a summary that never accomplished the goal, or they answer a different, easier question than the one asked. None of these get caught by answer-relevancy metrics — the response is relevant, just to the wrong task.

The pain lands on whoever owns the agent’s KPI. A support-automation team sees deflection rate creep down because the agent now closes tickets with “I’ve escalated this” instead of issuing the refund. A coding-agent team ships a release where the agent edits the wrong file and reports success. A travel-booking agent confirms the flight but never charges the card; the user sees a confirmation, the airline sees no booking, and the team learns about it from chargebacks.

In 2026-era agentic stacks — multi-step planners, MCP-connected tools, agent-handoff graphs — completion has to be evaluated at the trajectory level, not the message level. A single user request expands into ten or more spans across tools, retrievers, and sub-agents. Without a trajectory-aware completion score, regressions in any sub-agent surface as a slow erosion in user-reported “did this actually work” rates, with no signal in the logs.

How FutureAGI Handles Task Completion

FutureAGI’s approach is to reduce task completion to a deterministic check across three signals exposed by the agent trajectory. The fi.evals.TaskCompletion class consumes an AgentTrajectoryInput containing the trajectory steps, the final_result, the expected_result, and a task definition with success_criteria. Internally it weights three components: 20% for whether the trajectory marked a final step, 50% for outcome match against the expected result (using exact, substring, and keyword-overlap fallbacks), and 30% for the fraction of success criteria found in the result or trajectory observations. The output is a 0–1 score with a reason like “Agent reached final step. Result matches expected (90%). Criteria: 3/4 met. Unmet: refund_issued”.

Concretely: a team building a reimbursement agent on traceAI-langchain instruments their LangGraph workflow, captures each step as an agent.trajectory.step span, and attaches TaskCompletion to every offline regression run via Dataset.add_evaluation(). The success criteria list reads like an acceptance test — ["receipt_parsed", "policy_match_verified", "refund_request_filed"]. When a prompt change drops completion from 0.91 to 0.78 on the regression set, the eval-fail-rate-by-cohort dashboard surfaces it before deploy. The team pairs TaskCompletion with TrajectoryScore so they also see how it failed — was it a tool selection regression or an outcome mismatch — and routes the failure to the right owner.

How to Measure or Detect It

Bullet-list of measurement signals you can wire to a TaskCompletion eval:

fi.evals.TaskCompletion — returns a 0–1 score, a reason, and has_final_step / result_produced flags. Threshold at 0.7 for green, 0.5–0.7 for yellow, below 0.5 fails the regression.
agent.trajectory.step OTel attribute — the per-step span tag traceAI emits for agent frameworks (LangChain, LangGraph, OpenAI Agents SDK); a missing terminal is_final=true step is itself a completion-fail signal.
Eval-fail-rate-by-cohort dashboard — slice TaskCompletion by user segment, agent variant, and tool surface; sudden drops on one cohort point straight at the regression source.
User-feedback proxy — thumbs-down rate on agent runs trails completion failures by hours; alert on the eval signal first.

Minimal Python:

from fi.evals import TaskCompletion

metric = TaskCompletion()
result = metric.evaluate(trajectory=run.trajectory,
                         final_result=run.final_result,
                         expected_result="refund issued for $42.00",
                         task={"success_criteria": ["refund_issued",
                                                    "email_sent"]})
print(result.score, result.reason)

Common Mistakes

Grading on the final message instead of the trajectory. A “Done!” reply with no refund issued still passes a text-similarity check; only trajectory-level completion catches it.
Conflating task completion with goal progress. Completion is binary-ish (did it finish); progress is partial credit. Tracking only completion hides agents that consistently get 80% of the way and stall.
Empty success_criteria. Without criteria, the score collapses to “did the agent stop and produce a result”, which most agents trivially pass. Write the criteria list as an acceptance test.
Reusing the same judge model that drives the agent. When TaskCompletion’s LLM-judge mode is enabled, pin it to a different model family — self-evaluation inflates the score.
No regression eval on TaskCompletion. A green score in dev is a snapshot, not a guarantee; rerun it on every prompt and tool change.

What Is Task Completion (Agent Eval)?

What Is Task Completion?

Why It Matters in Production LLM and Agent Systems

How FutureAGI Handles Task Completion

How to Measure or Detect It

Common Mistakes

Frequently Asked Questions

What Is Task Completion?

Why It Matters in Production LLM and Agent Systems

How FutureAGI Handles Task Completion

How to Measure or Detect It

Common Mistakes

Frequently Asked Questions

Related Terms