How is an agent scorecard different from a single evaluation metric?

A metric returns one number on one axis. A scorecard aggregates many metrics across multiple axes — quality, safety, cost — and presents them side-by-side so deploy decisions can be made on the full picture.

How do you build an agent scorecard with FutureAGI?

Combine evaluators with AggregatedMetric over a frozen synthetic-scenario cohort. FutureAGI dataset reports render the scorecard per profile version and per cohort slice.

Agent Scorecard: Definition & FutureAGI Guide (2026)

Q: What is an agent scorecard?

An agent scorecard is a structured rubric that aggregates an agent's performance across task completion, tool accuracy, safety, latency, and cost into a per-dimension dashboard view, not a single opaque score.

What Is an Agent Scorecard?

An agent scorecard is a structured agent-evaluation rubric that summarizes an AI agent’s quality, tool-use accuracy, safety, latency, and cost across a fixed scenario cohort. It shows up in eval pipelines and production trace reviews as a per-dimension release gate, not one opaque quality number. In a FutureAGI workflow, teams compare scorecards by agent profile version to decide whether a prompt, model, tool list, or routing change is safe to ship.

Why agent scorecards matter in production LLM and agent systems

A single number cannot ship an agent. A 0.82 “agent quality” score does not tell a product reviewer whether quality dropped on safety or on tool selection, and it does not tell an SRE whether the cost-per-trace is sustainable at next quarter’s traffic. Without a scorecard, every release decision is subjective, and every regression discussion turns into a re-evaluation argument. Unlike a LangSmith run trace or a Braintrust experiment summary viewed alone, a scorecard forces the release question into rows: which behavior improved, which regressed, and which owner must respond.

Different roles read different rows. A platform engineer cares about ToolSelectionAccuracy and StepEfficiency. A product reviewer cares about TaskCompletion and IsHelpful. A compliance lead cares about ActionSafety and PromptAdherence. A finance partner cares about cost-per-resolved-trace. A scorecard with one number per row makes that conversation tractable; a scorecard with a single composite makes everyone re-derive the breakdown they actually need.

In 2026-era multi-agent stacks, scorecards also get a per-sub-agent dimension. The triage agent and the billing agent in the same crew need separate scorecards because their failure modes are different. The OpenAI Agents SDK, CrewAI, and LangGraph all emit per-agent traces, which means scorecards can be sliced by agent name without bespoke code — provided your eval platform respects that boundary.

How FutureAGI handles agent scorecards

FutureAGI’s approach is to assemble scorecards from individual evaluators and present them as Dataset reports tied to a profile version. The relevant SDK surfaces are Dataset.add_evaluation, AggregatedMetric (which combines multiple metric evaluators into a single weighted score for one dimension), and the dataset reporting layer that renders per-cohort breakdowns. Each row of the scorecard is one evaluator: TaskCompletion, TrajectoryScore, ToolSelectionAccuracy, ActionSafety, PromptAdherence, plus latency and cost pulled from traceAI spans such as llm.token_count.prompt.

Concrete example: a coding-assistant agent ships v1.7 with a model swap and a tool-list change. The team runs the v1.7 profile against a frozen 300-scenario cohort. The FutureAGI scorecard shows: TaskCompletion +3 points, ToolSelectionAccuracy -6 points, ActionSafety flat, p95 latency +180ms, cost-per-trace +12%. The single-number view would have read “+1 quality, -2 cost” and missed the tool-selection regression entirely. Instead, the scorecard surfaces the trade-off, and the team chooses to keep v1.6 in prod while re-tuning the tool descriptions.

For multi-agent systems, the scorecard supports per-agent slicing: the same five evaluators run with agent.trajectory.step-aware filtering produce one row per sub-agent. A regression in the manager-agent column does not get masked by strong worker-agent rows.

How to measure an agent scorecard

A scorecard is a composition, not a single metric:

Quality dimension: TaskCompletion, IsHelpful, AnswerRelevancy — does the agent finish the user’s goal.
Trajectory dimension: TrajectoryScore, StepEfficiency, ReasoningQuality — is the path sound.
Tool dimension: ToolSelectionAccuracy, FunctionCallAccuracy, ParameterValidation — are tools called correctly.
Safety dimension: ActionSafety, PromptAdherence, ProtectFlash — does the agent stay within policy.
Operations dimension: p95 latency, cost-per-trace, escalation rate — pulled from traceAI spans.
AggregatedMetric: combines per-dimension metrics into a single weighted score when one summary number is needed alongside the breakdown.

from fi.datasets import Dataset
from fi.evals import TaskCompletion, ToolSelectionAccuracy, ActionSafety, AggregatedMetric

dataset = Dataset(name="agent-scorecard-v1.7")
dataset.add_evaluation(TaskCompletion())
dataset.add_evaluation(ToolSelectionAccuracy())
dataset.add_evaluation(ActionSafety())
report = dataset.run(agent_profile="coding-assistant-v1.7")
print(report.scorecard())

Common mistakes

Collapsing the scorecard into a single number. A weighted average hides which dimension regressed; keep the breakdown visible at deploy time.
Running the scorecard on a moving cohort. Freeze the scenario cohort; otherwise version-to-version deltas mix profile changes with cohort changes.
Ignoring operations rows. Quality without latency and cost does not generalize to production economics; pull both from traces.
Skipping per-agent slicing in multi-agent flows. A team-level scorecard masks which sub-agent regressed; slice by agent name.
Treating the scorecard as a one-time artifact. Re-run on every profile version and dashboard the trend; the trend is the deploy gate, not the absolute score.