What is goal progress in agent evaluation?

Goal progress is an agent-eval metric that measures how far an agent advanced toward its goal across a trajectory, returning a 0–1 partial-credit score blended from average, final, and peak step-level progress.

How is goal progress different from task completion?

Task completion is the binary outcome — did the agent finish. Goal progress is the journey — how close it got, even if it stopped short. A run can score 0 on completion and 0.7 on progress; that is the signal that the agent consistently stalls near the finish line.

How do you measure goal progress?

FutureAGI's fi.evals.GoalProgress consumes the agent trajectory and task description, extracts goal keywords, and returns a 0–1 score plus a per-step progress array so you can see exactly where the agent stalled or regressed.

What Is Goal Progress? Agent Eval Definition (2026)

What Is Goal Progress?

Goal progress is an agent-evaluation metric that measures how far an agent advanced toward its goal across a trajectory, even when it never finished. The metric extracts goal keywords from the task description and expected outcome, then walks the trajectory step-by-step, tracking cumulative keyword overlap between each step’s thought, action, and observation. It blends average progress (30%), final progress (50%), and peak progress (20%) into a single 0–1 score. In FutureAGI it is the GoalProgress class in fi.evals, the partial-credit complement to TaskCompletion.

Why It Matters in Production LLM and Agent Systems

Most agent failures are not catastrophes — they are stalls. The agent gets 70% of the way to the goal, hits a tool error or a context limit, and emits a polite “I wasn’t able to complete this.” A binary task-completion score marks that as a 0; a goal-progress score marks it as a 0.7, and that distinction is what tells you whether your agent is fundamentally broken or is consistently failing in the last mile.

The pain shows up wherever you debug agents. ML engineers cannot prioritise fixes when every failed run scores 0 on completion — they need a gradient to find the agents that almost worked versus the ones that wandered off. Product owners cannot communicate progress to leadership when “agent reliability” only swings between fully-passed and fully-failed. SREs cannot tell the difference between an agent crashing at step one and an agent finishing nine of ten substeps before timing out — both look like “task incomplete” in the logs.

In 2026 stacks where agent loops run dozens of steps and multi-agent handoffs are routine, goal progress is the metric that surfaces the slow-degrade pattern: an agent that used to score 0.95 on progress now scores 0.78 because a sub-agent got slower and the loop runs out of step budget before completion. Without per-step progress tracking, that regression is invisible until you look at hours of traces by hand.

How FutureAGI Handles Goal Progress

FutureAGI’s approach is a deterministic walk over the trajectory with a monotonic, never-decreasing progress signal. The fi.evals.GoalProgress class consumes an AgentTrajectoryInput containing the trajectory and a task with description and expected_outcome. It extracts keyword sets from both task fields (filtered against a stopword list), then iterates each step computing the overlap between step-text keywords (thought + action + observation + tool names) and goal keywords. The progress curve is cumulative — each step’s score is max(this_step_overlap, prev_score) — so a successful step locks in the gain even if the next step regresses. The final score weights three views of that curve: average across all steps (30%), final value (50%, because where the agent ended matters most), and peak (20%, to credit agents that did reach near-completion before stalling).

Concretely: a research agent team running on traceAI-openai-agents instruments multi-step research tasks where each task has a description like “find the 2026 Q1 revenue for Stripe and cite the source.” The team attaches GoalProgress alongside TaskCompletion to every run. For one regression cohort they see completion drop from 0.91 to 0.62 after a model swap, while goal progress only drops from 0.93 to 0.81 — meaning the new model still got most of the way, but stopped short. That is a different fix from “the agent is broken” — they raise task.max_steps and the issue resolves. Compared with G-Eval’s LLM-judge “Trajectory Quality” prompt, GoalProgress runs in milliseconds with zero LLM cost and exposes the per-step trajectory directly, making it cheap to chart progress curves alongside latency.

How to Measure or Detect It

Bullet-list of measurement signals tied to GoalProgress:

fi.evals.GoalProgress — returns a 0–1 score, a progress_by_step array, and a final_progress value. Alert on final_progress < 0.7 even when task completion is Failed.
agent.trajectory.step OTel attribute — every step the eval reads from; pivoting goal-progress regressions by step number reveals where agents stall.
Progress-vs-step-budget chart — overlaying progress_by_step against task.max_steps shows whether agents are running out of budget mid-progress.
Cohort-level “stall” rate — fraction of runs with final_progress > 0.6 but task_completion = 0; this is your “almost worked” cohort and the highest-priority fix list.

Minimal Python:

from fi.evals import GoalProgress

metric = GoalProgress()
result = metric.evaluate(trajectory=run.trajectory,
                         task={"description": "find Stripe Q1 2026 revenue",
                               "expected_outcome": "revenue figure cited"})
print(result.score, result.final_progress)

Common Mistakes

Confusing goal progress with task completion. Completion is “did it finish”; progress is “how far did it get”. Tracking only one hides the agents that consistently stall.
Empty task.description. Without keywords, the metric has nothing to score against and falls back to a 0.5 default — feed real task descriptions.
Treating progress as a substitute for completion. A high goal-progress score with a 0 completion is still a failed run from the user’s perspective; both metrics belong on the dashboard.
Tuning thresholds on the headline score alone. The most diagnostic signal is progress_by_step — a flat curve means the agent never started; a curve that peaks early and stalls means a step-budget problem.
Ignoring the gap between peak and final. When peak progress > final progress (the cumulative implementation prevents this; in the underlying overlap signal it can drop), the agent regressed mid-run — usually a tool-error or context-loss issue.