Models

What Is a Performance Slice?

A defined subset of evaluation or production data used to compute per-segment model performance and surface behavior the global average hides.

What Is a Performance Slice?

A performance slice is a defined subset of evaluation or production data, segmented by an attribute such as language, length bucket, user cohort, route, tool name, or prompt version, used to compute per-segment model performance. The point is to surface behavior the global average hides — a 2% drop in mean accuracy can be a 30% drop on a small but critical slice. Slicing belongs to the model evaluation and observability layer, and is one of the most reliable ways to spot regressions early. FutureAGI computes per-slice scores across fi.datasets.Dataset and traceAI.

Why Performance Slices Matter in Production LLM and Agent Systems

Averages mislead. A model release looks fine because Spanish-language traffic only accounts for 4% of volume, so a complete failure on Spanish prompts barely moves the global metric. The team ships, the Spanish cohort breaks, and the failure surfaces through escalations a week later. Slicing would have caught it on day zero.

The pain shows up in every role. ML engineers see global eval scores hold steady while a domain-specific cohort silently regresses. SREs see traffic dashboards stay green while p99 latency spikes on long-context requests, which only land in one slice. Product managers see CSAT averages hold while a high-LTV cohort starts churning. Compliance teams need to prove fairness across demographic slices, which requires the slicing work be done up front.

In 2026 agent stacks the slicing dimensions multiply. A multi-step agent should be sliced by route, by step type, by tool name, by agent.trajectory.step count, by retrieval source, by language, and by user segment. A regression in tool-selection accuracy on the billing-lookup tool inside the support-agent-secure route on long-context Spanish prompts is invisible until you slice on all four dimensions and read the cell.

How FutureAGI Surfaces Performance Slices

FutureAGI’s anchor is the Dataset and traceAI integration: every eval row carries slice attributes, every span carries route and tool context, and dashboards can group by any combination. Use Dataset.add_evaluation to attach evaluators, then group results by (language, route, tool, prompt.version) to produce a slice scorecard. There is no special “Slice” evaluator class because slicing is an aggregation pattern, not an evaluator.

Real example: a customer-support team slices its weekly eval by (language, ticket_category, model.version). They run Groundedness, AnswerRelevancy, and JSONValidation on a sampled production cohort. FutureAGI returns per-row scores; the team groups by the three attributes. The output: a per-cell scorecard showing English billing tickets at 0.92 grounding, Spanish billing at 0.71, English technical at 0.88. The Spanish billing cell is the regression. The team replays those traces, finds a retrieval misconfiguration for the Spanish corpus, and fixes it before broadening rollout. Compared with reading a single global score, slicing turns “something feels off” into “the bug is here, in this route, on this language, on this tool.”

For agent traces, agent.trajectory.step and tool name become natural slice keys; FutureAGI’s traceAI integration captures both automatically.

How to Measure or Detect It

Slicing produces metrics; the slice itself is a definition.

  • Eval-fail-rate-by-cohort — the canonical dashboard signal. Sort cohorts by fail rate, not by volume.
  • Per-slice evaluator scoreGroundedness, AnswerRelevancy, TaskCompletion, JSONValidation computed within each slice.
  • Per-slice latency — p50 and p99 grouped by route, language, length bucket.
  • Per-slice token costllm.token_count.prompt and llm.token_count.completion summed by slice.
  • Slice volume sanity — make sure each slice has enough rows for a stable estimate; tiny slices need bootstrapping or wider time windows.
  • Drift across slices — track week-over-week change in the per-slice score; rising variance is a leading indicator of regression.
from fi.evals import AnswerRelevancy

evaluator = AnswerRelevancy()
result = evaluator.evaluate(
    input=user_query,
    output=model_response,
)
# attach slice keys: language, route, model.version
print(result.score)

Common Mistakes

  • Reporting only the global average. It hides every interesting failure.
  • Slicing by too many dimensions at once. Cells become tiny; estimates become noisy. Pick the 2–3 dimensions that matter most.
  • Ignoring small slices. A 1% slice can be the high-LTV cohort that pays the bills; weight by business impact, not volume.
  • Comparing slice scores across release boundaries without normalization. A new model may shift slice volumes; weight or stratify before comparing.
  • Forgetting to log slice keys. Without language, route, model.version on the trace, slicing later is archaeology.

Frequently Asked Questions

What is a performance slice?

A performance slice is a defined subset of eval or production data used to compute per-segment performance, exposing regressions and improvements that a global average hides.

How is a slice different from a cohort?

The terms overlap; 'slice' usually emphasizes a data attribute filter (language, length bucket, route), while 'cohort' often emphasizes a user grouping. In practice, both refer to a partition for per-segment evaluation.

How do you compute slice metrics?

Tag eval rows with slice attributes in `fi.datasets.Dataset`, attach evaluators with `Dataset.add_evaluation`, and group results by slice. FutureAGI surfaces eval-fail-rate-by-cohort directly in dashboards.