Evaluation

What Is a Performance Impact Score?

A composite release-gate signal that aggregates eval, cost, and latency deltas to quantify the production impact of a model, prompt, or retrieval change.

What Is a Performance Impact Score?

A performance impact score is a composite release-gate signal that quantifies how much a specific change — a prompt update, a model swap, an adapter, a retrieval tweak — moves end-to-end task quality, cost, and latency against a baseline. It is not one evaluator; it is a weighted aggregate of several. The output is a single decision-ready number per release: positive means ship, negative means hold. It belongs to the evaluation family and lives in offline regression jobs, online release gates, and dashboards. FutureAGI computes it across versioned fi.datasets.Dataset runs.

Why a Performance Impact Score Matters in Production LLM and Agent Systems

Single metrics lie. A new prompt can lift AnswerRelevancy 3% while doubling token cost or breaking 2% of JSON outputs. Reviewing five evaluators by eye on every release does not scale. Without a single composite, releases get gated on whichever metric the loudest stakeholder happens to watch.

The pain shows up across roles. Engineering ships a model swap that aces the offline relevance score and tanks structured-output validity downstream. Finance gets a surprise bill when a “better” prompt grows token counts 40%. Product gets escalations because latency rose 600 ms even though answer quality nominally improved. Compliance asks how the team knows a release did not regress refusal behavior or PII exposure, and the answer is “we ran the eval suite” — but nobody can show a single number.

In 2026 agentic stacks the problem compounds. A multi-step agent has step-level evaluators (planner quality, tool-selection accuracy, retrieval relevance) plus end-to-end signals (TaskCompletion, total latency, total cost). A composite impact score forces a team to define what “better” means for that workflow before shipping, not after the incident.

How FutureAGI Computes a Performance Impact Score

FutureAGI exposes the building blocks: fi.datasets.Dataset for replaying baseline and candidate against the same inputs, Dataset.add_evaluation for attaching evaluators, and AggregatedMetric for combining them into a weighted composite. The named anchors here are AggregatedMetric plus the underlying evaluators (Groundedness, TaskCompletion, JSONValidation, AnswerRelevancy) and trace-derived signals like llm.token_count.prompt and end-to-end latency.

A practical example: a RAG team defines an impact score as 0.4 * Groundedness + 0.3 * AnswerRelevancy + 0.2 * (1 - JSONValidation_fail_rate) + 0.1 * normalized_latency_inverse. They register a baseline run from last week’s Dataset and a candidate run from the new prompt. FutureAGI computes per-row scores, then AggregatedMetric produces the composite. The release gate compares candidate to baseline; if the impact score drops more than 0.02, the release is held. Compared with eyeballing five charts, this is one number with one threshold and one owner.

Custom variants are easy: wrap the formula in CustomEvaluation to expose an arbitrary score, label, and reason on every row. Compared with running the same logic in a notebook, the dataset and trace integration means every regression has reproducible evidence behind it.

How to Measure or Detect It

The score itself is the measurement. The signals you compose it from are:

  • Evaluator scoresGroundedness, AnswerRelevancy, TaskCompletion, JSONValidation, plus any domain-specific judge rubric.
  • Cost — sum of llm.token_count.prompt and llm.token_count.completion weighted by the model’s price.
  • Latency — p50 and p99 end-to-end and per-step.
  • User-feedback proxies — thumbs-down rate and escalation-rate, lagged but valuable for weighting validation.
  • Per-cohort scorecard — compute the impact score per cohort to expose changes that are net-positive globally and net-negative for a critical segment.
from fi.evals import AggregatedMetric, Groundedness, AnswerRelevancy

agg = AggregatedMetric(
    metrics=[Groundedness(), AnswerRelevancy()],
    weights=[0.6, 0.4],
)
result = agg.evaluate(input=q, output=a, context=c)
print(result.score)

Common Mistakes

  • Picking weights once and never revisiting. Weights should reflect current priorities; a security crunch may justify weighting PromptInjection higher.
  • Hiding regressions in the average. A composite that masks a 30% drop on one cohort is not a release gate, it is a vibe.
  • Skipping cost and latency. A quality lift that doubles spend or breaks the latency SLO should not pass the gate.
  • Using the same composite for every product surface. A chat assistant and a structured-output extractor have different weight profiles.
  • Comparing against an outdated baseline. Refresh the baseline on a schedule so cumulative drift does not silently make every release look great.

Frequently Asked Questions

What is a performance impact score?

A performance impact score is a composite release-gate signal that combines eval, cost, and latency deltas to quantify how much a change moves end-to-end production quality.

How is it different from a single evaluation metric?

A single metric measures one dimension; a performance impact score weights several metrics into one decision-ready number, so a change that improves quality but blows up cost is flagged as net-negative.

How do you compute a performance impact score?

Run the same evaluator suite on the candidate and baseline through `fi.datasets.Dataset`, weight the deltas, and use `AggregatedMetric` to produce a per-release score gated by a metric threshold.