What Is Task Completion (Agent Eval)?
An agent-evaluation metric that scores whether a multi-step agent finished its assigned task by matching outcome and success criteria.
What Is Task Completion?
Task completion is an agent-evaluation metric that measures whether a multi-step agent actually finished the job it was given, not just whether its final message looked plausible. It compares the agent’s final result against an expected outcome, checks every listed success criterion, and verifies the trajectory has an is_final step rather than running out the step budget. The metric returns a 0–1 score with a structured reason. In FutureAGI it is the TaskCompletion class in fi.evals, used as the headline pass/fail gate in agent regression suites. By May 2026, with frontier agents (OpenAI Agents SDK on GPT-5.x, Claude Opus 4.7 sub-agents, Gemini 3.x with Vertex Agents, Llama 4-based open agents) routinely running 20-50 step trajectories across MCP-connected tool servers, completion has replaced answer-quality as the canonical headline metric.
Why task completion matters in production LLM and agent systems
A polite, fluent final message is not a finished task. Agents fail completion in three quiet ways: they stop early because they hit a tool error and apologise, they loop until the step budget runs out and emit a summary that never accomplished the goal, or they answer a different, easier question than the one asked. None of these get caught by AnswerRelevancy. the response is relevant, just to the wrong task.
The pain lands on whoever owns the agent’s KPI. A support-automation team sees deflection rate creep down because the agent now closes tickets with “I’ve escalated this” instead of issuing the refund. A coding-agent team ships a release where the agent edits the wrong file and reports success. exactly the SWE-Bench Verified-style failure mode that frontier labs spent 2025 hardening against. A travel-booking agent confirms the flight but never charges the card; the user sees a confirmation, the airline sees no booking, and the team learns about it from chargebacks. None of these are hallucinations in the classical sense. the agent is honest, it just stopped short.
In 2026-era agentic stacks. multi-step planners, MCP-connected tools, agent-handoff graphs, A2A sub-agent calls. completion has to be evaluated at the trajectory level, not the message level. A single user request expands into ten or more spans across tools, retrievers, and sub-agents. Without a trajectory-aware completion score, regressions in any sub-agent surface as a slow erosion in user-reported “did this actually work” rates, with no signal in the logs. We’ve found that the strongest leading indicator of a degraded agent release is not TaskCompletion average drop. it is the spread between average and worst-decile per-cohort, because a 5-point drop on the worst cohort often arrives weeks before the average moves.
Task completion vs adjacent agent metrics
Completion is one of several agent-eval metrics; the family only makes sense together:
| Metric | What it measures | When it fails | FutureAGI evaluator |
|---|---|---|---|
| TaskCompletion | Did the agent finish? Outcome match + criteria + final step | Early stop, loop until budget, wrong-task drift | TaskCompletion |
| Goal Progress | How far did it get? Partial credit on hard tasks | Stalls just before the finish on long trajectories | TaskCompletion with partial credit mode |
| Trajectory Score | Weighted blend of completion, efficiency, tool-selection | Catches “did it the wrong way but worked” | TrajectoryScore |
| Tool Selection Accuracy | Did the agent pick the right tools? | Hallucinated tool, wrong tool, missing required tool | ToolSelectionAccuracy |
| Step Efficiency | How few steps vs optimal? | Wandering, retry storms, redundant tool calls | Composite via TrajectoryScore |
| Function Call Accuracy | Did the call have correct name + args? | Schema violations, wrong types | FunctionCallAccuracy |
| Faithfulness | Are the agent’s claims grounded in tool results? | Confident hallucination on tool output | Faithfulness |
| Answer Relevancy | Does the final message address the user? | ”Answered a different question” | AnswerRelevancy |
The pattern in 2026 production stacks: TaskCompletion is the gate; the others are the diagnosis. A failed TaskCompletion should be auto-routed to which component-level evaluator returned the smallest score, so the dashboard tells you where the agent broke without manual triage.
Agent benchmarks that score completion
The 2026 agent benchmarks all score some shape of completion:
- τ-bench (tau-bench). Sierra’s multi-turn customer-support benchmark; scoring requires the agent to satisfy a hidden state check after the dialog. Frontier scores cluster 55-70% in May 2026; the gap between models is real and meaningful.
- SWE-Bench Verified. 500 human-verified GitHub bug-fix tasks; pass criterion is hidden-test pass after the agent edits files. Frontier sits 70-78%; coding agents that score above 75% are the new bar.
- GAIA. multi-step assistant tasks with browsing, tool use, multimodal inputs; Level 3 defeats most frontier agents.
- OSWorld. real OS-level desktop tasks; frontier still under 40% in May 2026. the headroom benchmark.
- WebArena / VisualWebArena. agents driving real websites end-to-end.
- MLE-Bench. 75 Kaggle-style ML engineering tasks for ML-research agents.
If a vendor pitches an agent in 2026 without τ-bench, SWE-Bench Verified, or a domain analog, treat it the way you would a 2023 chatbot demo with no hallucination numbers. Internal TaskCompletion on a golden dataset is what decides; public agent benchmarks shortlist.
Three completion failure shapes you will see in production
The trajectory shape tells you the failure mode at a glance:
- Early-stop failures. agent runs 2-4 steps, hits a transient tool error, apologises, returns.
is_final=truebut no criteria met. Fix: better retry policy on tool spans, plus apre-actioncheck that the agent has actually attempted the required tool. - Loop-exhaustion failures. agent runs to the step budget, then summarizes what it tried.
is_final=falseorstep_budget_exceeded=true. Fix: loop detection, tighter step-efficiency scoring, and a planner prompt that biases toward decisive action. - Wrong-task drift. agent finishes confidently but solved a related, easier task. High AnswerRelevancy, low criteria match, often paired with a sycophancy signal. Fix: stricter criteria, BiasDetection probe for easy-question substitution, and a CustomEvaluation judge that scores task-vs-result fit explicitly.
These three account for ~85% of the agent-completion incidents we have seen across customer regression suites in 2026.
How FutureAGI handles task completion
FutureAGI’s approach is to reduce task completion to a deterministic check across three signals exposed by the agent trajectory, with an optional LLM-as-a-judge mode for ambiguous outcomes. The fi.evals.TaskCompletion class consumes an AgentTrajectoryInput containing the trajectory steps, the final_result, the expected_result, and a task definition with success_criteria. Internally it weights three components: 20% for whether the trajectory marked a final step, 50% for outcome match against the expected result (using exact, substring, and keyword-overlap fallbacks), and 30% for the fraction of success criteria found in the result or trajectory observations. The output is a 0–1 score with a reason like “Agent reached final step. Result matches expected (90%). Criteria: 3/4 met. Unmet: refund_issued.”
Concretely: a team building a reimbursement agent on traceAI-langchain instruments their LangGraph workflow, captures each step as an agent.trajectory.step span, and attaches TaskCompletion to every offline regression run via Dataset.add_evaluation(). The success criteria list reads like an acceptance test. ["receipt_parsed", "policy_match_verified", "refund_request_filed"]. When a prompt change drops completion from 0.91 to 0.78 on the regression set, the eval-fail-rate-by-cohort dashboard surfaces it before deploy. The team pairs TaskCompletion with TrajectoryScore so they also see how it failed. was it a tool selection regression or an outcome mismatch. and routes the failure to the right owner. Versus Galileo’s “Agent Effectiveness” judge-LLM score (one number, no decomposition), the FutureAGI surface keeps the rule-based outcome check fast and deterministic, with LLM-judge as the fallback for fuzzy criteria. not the default.
Wiring TaskCompletion into release gates and runtime guardrails
TaskCompletion matters most at two places: pre-deploy (CI release gate) and runtime (Agent Command Center routing). At release time, the CI job runs the agent on a 200-1,000 row golden dataset, scores TaskCompletion, posts results back to the dataset, and either passes the build or blocks the deploy with a per-cohort diff link. Engineers see which task types failed, which step in the trajectory diverged, and which trace span shows the regression.
At runtime, the same evaluator runs against a 1-5% sample of production trajectories via traceAI. When a cohort’s TaskCompletion dips below threshold over a rolling window, the routing policy can shift that cohort to a fallback model, escalate to human review via AnnotationQueue, or trigger an alert. The same dataset, scored the same way, drives both gates. there is no eval-vs-production drift.
The pattern at the dashboard layer: TaskCompletion is the headline KPI on the agent monitoring view, with TrajectoryScore, ToolSelectionAccuracy, Faithfulness, and AnswerRelevancy as the four sub-panels. When the headline dips, the operator’s first question is “which sub-metric moved with it”. and the trace drill-down answers it without another search.
How to measure task completion
Bullet-list of measurement signals you can wire to a TaskCompletion eval:
fi.evals.TaskCompletion. returns a 0–1 score, a reason, andhas_final_step/result_producedflags. Threshold at 0.7 for green, 0.5–0.7 for yellow, below 0.5 fails the regression.agent.trajectory.stepOTel attribute. the per-step span tag traceAI emits for agent frameworks (LangChain, LangGraph, OpenAI Agents SDK, CrewAI, Pydantic-AI, MCP); a missing terminalis_final=truestep is itself a completion-fail signal.- Eval-fail-rate-by-cohort dashboard. slice TaskCompletion by user segment, agent variant, and tool surface; sudden drops on one cohort point straight at the regression source.
- Worst-decile vs average gap. average completion can hide a regression on one cohort; track the spread.
- Trajectory length distribution. completion drops with step-budget exhaustion show up as a right-shift in step count before the score moves; alert on the distribution, not just the mean.
- Token-cost-per-completion. a 0.05 completion lift bought at 2x token cost is a partial regression; track cost and completion together on the release-gate dashboard.
- User-feedback proxy. thumbs-down rate on agent runs trails completion failures by hours; alert on the eval signal first.
- Goal Progress. partial-credit metric that catches “the agent always gets 80% there but never finishes.”
Minimal Python:
from fi.evals import TaskCompletion, TrajectoryScore
completion = TaskCompletion()
trajectory = TrajectoryScore()
for run in regression_runs:
c = completion.evaluate(
trajectory=run.trajectory,
final_result=run.final_result,
expected_result=run.expected_result,
task={"success_criteria": run.criteria},
)
t = trajectory.evaluate(trajectory=run.trajectory)
run.attach_scores(completion=c, trajectory=t)
For online runtime scoring, wire TaskCompletion directly to a traceAI span so every production trajectory writes its score back to the trace. the same pattern frontier teams use to mirror τ-bench-style hidden-state scoring against live agent runs:
from fi.evals import TaskCompletion, ToolSelectionAccuracy
from fi.traceAI import langchain_instrumentor
langchain_instrumentor().instrument()
@TaskCompletion.online(sample_rate=0.05, threshold=0.7)
@ToolSelectionAccuracy.online(sample_rate=0.05, threshold=0.85)
def agent_run(query: str, expected_criteria: list[str]):
trajectory = my_agent.invoke(query)
return {
"trajectory": trajectory,
"final_result": trajectory.final_message,
"expected_result": expected_criteria,
"task": {"success_criteria": expected_criteria},
"span_attrs": {"agent.trajectory.step": trajectory.steps},
}
Healthy completion eval: thresholded scores hold or improve across every production cohort, worst-decile stays bounded, step-count distribution does not drift right, and every regression has a per-cohort link to the trace that explains it. As a sanity anchor, frontier τ-bench scores (55-70%) and SWE-Bench Verified (70-78%) are the public reference points your internal TaskCompletion should pace against. if your internal score lags frontier by more than 10 points on equivalent tasks, the gap is usually in tool wiring, not the metric.
How TaskCompletion plays with the rest of the 2026 agent eval stack
TaskCompletion is the outcome metric, not the only metric. It pairs with:
- TrajectoryScore. the weighted composite that captures “did it the right way” beyond “did it at all.”
- ToolSelectionAccuracy and FunctionCallAccuracy. the two tool-level metrics that diagnose tool-driven completion failures.
- Faithfulness and Groundedness. claim-level grounding that explains when an agent says “Done” with a fabricated result.
- ContextRelevance and ContextRecall. when the agent is RAG-shaped, completion can fail because the retriever missed the needed context.
- BiasDetection, Toxicity, PII. the safety panel that catches “the agent finished but unsafely.”
- CustomEvaluation. for product-specific success rubrics no off-the-shelf metric encodes.
Treat them as a panel, not a hierarchy. Every regression eval should print all of them; every release gate should threshold a subset; every dashboard should surface the worst-mover, not the average.
Common mistakes
- Grading on the final message instead of the trajectory. A “Done!” reply with no refund issued still passes a text-similarity check; only trajectory-level completion catches it.
- Conflating task completion with goal progress. Completion is binary-ish (did it finish); progress is partial credit. Tracking only completion hides agents that consistently get 80% of the way and stall. pair both.
- Empty
success_criteria. Without criteria, the score collapses to “did the agent stop and produce a result”, which most agents trivially pass. Write the criteria list as an acceptance test, with the same rigor as an integration test contract. - Reusing the same judge model that drives the agent. When
TaskCompletion’s LLM-judge fallback is enabled, pin it to a different model family. self-evaluation inflates the score. The Anthropic/OpenAI/Google rotation pattern is standard in 2026 production eval pipelines. - No regression eval on
TaskCompletion. A green score in dev is a snapshot, not a guarantee; rerun it on every prompt and tool change, plus quarterly against the model fallback chain. - Treating step-budget exhaustion as “no big deal”. It is a completion fail. Alert on it explicitly via a
step_budget_exceededcohort, not just the headline score. - Scoring
TaskCompletionwithout tool-selection and function-call accuracy. Completion can fail because of wrong tool or wrong args. the diagnostic metrics tell you which. Run all three by default. - Hand-curated criteria with no review cycle. Criteria written six months ago, when the product had three tools, will not catch failures in a current twelve-tool MCP environment. Refresh per quarter, ideally as a sub-task of the data flywheel routine.
- Ignoring multi-agent handoff failures. With agent-handoff and A2A becoming standard,
TaskCompletionmust score the end-to-end trajectory, not the local agent’s view; a sub-agent reporting “Done” while the orchestrator never finished is the canonical multi-agent failure mode. - Self-scoring with the agent’s own model. When LLM-judge mode is on, pin the judge to a different family (e.g., Claude judges GPT, Gemini judges Claude); cross-family rotation has been the 2026 norm since Anthropic published their measured judge-leak rates in late 2025.
- No alerting on step-count drift. A 25% rightward shift in step distribution often precedes the completion-score drop by 24-48 hours; treat the distribution as a leading indicator, not a curiosity.
Frequently Asked Questions
What is task completion in agent evaluation?
Task completion is a 0–1 metric that scores whether a multi-step agent actually finished the job. matching the expected final result, meeting listed success criteria, and reaching a final step rather than abandoning the trajectory.
How is task completion different from goal progress?
Task completion measures the end state. did the agent finish the task. Goal progress measures the journey. how close the agent got, even if it never finished. Use task completion as the pass/fail gate and goal progress for partial credit on hard tasks.
How do you measure task completion?
FutureAGI's fi.evals.TaskCompletion takes the agent trajectory, expected result, and a list of success criteria, then returns a score plus a reason. Pair it with TrajectoryScore for a weighted view across completion, efficiency, and tool selection.