How is an aggregated metric different from an evaluation metric?

An evaluation metric is one score on one quality dimension. An aggregated metric combines several metrics into a weighted or rule-based composite for a dashboard, alert, or release gate.

What Is an Aggregated Metric? FutureAGI Guide (2026)

Q: What is an aggregated metric?

An aggregated metric is a composite LLM-evaluation score that combines several evaluator results into one weighted signal. In FutureAGI, `AggregatedMetric` keeps both the final score and the component sub-scores available.

Q: How do you measure an aggregated metric?

Use `fi.evals.AggregatedMetric` with component evaluators such as `Groundedness`, `AnswerRelevancy`, `TaskCompletion`, and `ToolSelectionAccuracy`. Track aggregate score, sub-scores, thresholds, and eval-fail-rate-by-cohort.

What Is an Aggregated Metric?

An aggregated metric is a composite LLM-evaluation metric that combines multiple evaluator outputs into one score for a dataset, eval pipeline, or production trace cohort. It is useful when a release gate needs to balance several dimensions, such as groundedness, answer relevance, task completion, safety, and schema compliance. In FutureAGI, the eval:AggregatedMetric surface maps to fi.evals.AggregatedMetric, which produces a single score while preserving component sub-scores for alerts, regression analysis, and debugging.

Why aggregated metrics matter in production LLM and agent systems

Aggregated metrics matter because production quality is multidimensional. A RAG support agent can be grounded but irrelevant; a tool-using agent can pick the correct tool but produce invalid JSON; a safety-focused agent can refuse too often and lower task completion. If each metric sits in its own chart, teams ship from whichever number looks best and miss the combined risk.

Ignoring aggregation creates named failure modes: silent quality regressions behind a healthy global average, over-optimized prompt changes that improve one metric while hurting another, and release gates that pass despite a severe sub-metric failing. Developers feel this as unclear triage. SREs see alerts with no release-level severity. Product sees inconsistent user experience by cohort. Compliance sees no auditable answer to “did this release meet the full policy?”

Logs usually show symptoms as scattered signals: Groundedness falls on long-context traces, AnswerRelevancy drops for one route, JSONValidation fails only after a tool call, and user thumbs-downs rise for one language. In 2026-era agentic systems, that fragmentation is normal. A single request can contain retrieval, planning, tool selection, tool execution, and final response synthesis. Aggregation gives the team one release decision, while the sub-scores explain the broken step.

How FutureAGI handles aggregated metrics

FutureAGI’s approach is to treat AggregatedMetric as a composition layer, not a replacement for individual evaluators. The specific FAGI surface is eval:AggregatedMetric; in the inventory, AggregatedMetric is a local metric that combines multiple metric evaluators into a single aggregated score. An engineer can combine Groundedness, AnswerRelevancy, TaskCompletion, and ToolSelectionAccuracy, assign weights, and set a release threshold.

Example: a customer-support agent answers account questions and can call billing tools. The team creates an aggregate named support_quality_release_gate with weights: 0.35 Groundedness, 0.25 AnswerRelevancy, 0.20 TaskCompletion, 0.15 ToolSelectionAccuracy, and 0.05 JSONValidation. Dataset rows store input, output, retrieved_context, expected tool decision, prompt version, model route, and cohort. The aggregate must score at least 0.90, and no safety-critical component can fall below its own threshold.

When the same workflow runs on sampled traces from traceAI-langchain, fields such as llm.token_count.prompt and agent.trajectory.step explain why the aggregate changed. A failing aggregate can trigger an alert, block a prompt release, send low-scoring rows to an annotation queue, or require model fallback through Agent Command Center. Unlike Ragas-style RAG scores that focus mainly on retrieval-answer quality, the FutureAGI aggregate can include agent trajectory and structured-output checks in the same gate.

How to measure or detect aggregated metrics

Measure aggregated metrics by treating the aggregate and its components as separate observability surfaces:

fi.evals.AggregatedMetric result - returns the combined score plus component sub-scores, making it suitable for release gates.
Component evaluator scores - chart Groundedness, AnswerRelevancy, TaskCompletion, and ToolSelectionAccuracy separately.
Threshold behavior - track aggregate pass rate, hard-fail component thresholds, and borderline cases near the cutoff.
Dashboard signal - alert on eval-fail-rate-by-cohort, aggregate-by-prompt-version, and aggregate-by-model-route.
Trace fields - inspect llm.token_count.prompt, agent.trajectory.step, and trace_id when a sub-score changes.
User proxy - compare aggregate movement against thumbs-down rate, escalation rate, and corrected-answer rate.

from fi.evals import AggregatedMetric, Groundedness, AnswerRelevancy

metric = AggregatedMetric(
    evaluators=[Groundedness(), AnswerRelevancy()],
    weights=[0.6, 0.4],
)
result = metric.evaluate(input=query, output=answer, context=docs)
print(result.score, result.sub_scores)

Common mistakes

These are the mistakes that make an aggregate look precise while hiding the real failure:

Averaging incompatible metrics. A 0-1 judge score, boolean JSON check, and latency value need normalization before aggregation.
Hiding component failures. A passing aggregate should not mask PromptInjection, JSONValidation, or ToolSelectionAccuracy below its own safety threshold.
Choosing weights by preference. Use historical incidents, human labels, and cohort impact; do not let the loudest stakeholder set weights.
Using one aggregate for every workflow. RAG answer quality, function calling, and compliance review need different component metrics and thresholds.
Skipping variance checks. A 0.92 aggregate over 25 traces is not a stable release gate; track confidence intervals or minimum sample counts.

The aggregate should summarize a decision, not erase the evidence behind it.