How is ICE different from a Partial Dependence Plot (PDP)?

PDP averages the effect of a feature across all instances, hiding heterogeneous effects. ICE shows one curve per instance, so you can see when a feature helps some users and hurts others.

How is ICE relevant to LLM evaluation?

FutureAGI applies the pattern to LLM evals: vary one input attribute (cohort, language, system-prompt variant) and chart per-instance evaluator scores rather than only the global mean, exposing heterogeneous behavior.

What Is Individual Conditional Expectation (ICE)? Guide 2026

Q: What is Individual Conditional Expectation (ICE)?

ICE is an interpretability method that plots how a single instance's predicted output changes as one input feature is varied while all other features are held at that instance's actual values.

What Is Individual Conditional Expectation (ICE)?

Individual Conditional Expectation (ICE) is a model-interpretability technique that plots how a single instance’s predicted output changes as one input feature is varied, with all other features held at that instance’s actual values. Each instance produces one curve. Stacking many curves on the same axis exposes heterogeneous feature effects that an averaged Partial Dependence Plot (PDP) would hide. In FutureAGI’s world, the same per-instance lens applies to LLM evaluation — slice evaluator scores by one input attribute at a time and chart them per-row to find behavior the global mean smooths out.

Why It Matters in Production LLM and Agent Systems

Aggregate metrics are seductive and misleading. A PDP that says “raising temperature from 0.2 to 0.7 has no average effect on faithfulness” can hide that for half the prompts faithfulness drops 30 points and for the other half it rises 30 points. ICE-style per-row analysis surfaces both effects. The same logic applies whenever you change one input variable at a time — model variant, retriever top-k, system prompt rev, locale.

The pain shows up across roles. ML engineers run an A/B test on a single global metric and ship a “neutral” change that breaks one cohort. Product managers cannot explain why feedback degraded for a slice of users when the rollup looks fine. Compliance leads asked to demonstrate fairness cannot answer with a single PDP — they need per-instance curves to show the model behaves consistently across protected attributes.

In 2026-era LLM stacks, this matters more, not less. Prompt edits and model swaps create heterogeneous effects: a prompt change might help reasoning-heavy queries by 12 points and hurt summary queries by 8 points, netting to zero on a global eval. Without per-instance, per-cohort slicing, the regression hides until production users surface it.

How FutureAGI Handles ICE-Style Analysis

FutureAGI doesn’t train classical ML models, so we don’t compute literal ICE plots over feature space. Where we apply the pattern is in evaluation: vary one input dimension at a time and chart the per-instance evaluator score, not just the cohort average. At the dataset level, Dataset.add_evaluation runs the same evaluator across multiple variants of a prompt or context and stores all rows. At the dashboard level, eval-fail-rate-by-cohort plots per-row scores so you can see the spread, not just the mean. At the evaluator level, BiasDetection and the bias suite return per-row scores that surface heterogeneous responses across protected attributes — the LLM-evaluation analogue of an ICE curve.

Concretely: a team is testing a prompt revision on traceAI-langchain. They build a dataset of 500 user queries and run Faithfulness on both prompt variants via Dataset.add_evaluation. Globally, faithfulness moves from 0.78 to 0.79 — looks neutral. But the per-row “ICE-style” view groups queries by topic and shows reasoning-heavy queries jumped from 0.62 to 0.81 while summary queries dropped from 0.91 to 0.74. The team rejects the prompt as a regression on the summary cohort and writes a regression eval that gates future deploys on both cohorts. Without the per-instance lens, they would have shipped the regression.

How to Measure or Detect It

Treat ICE as a pattern, not a single number. The signals that matter:

Per-row evaluator scores (Dataset.add_evaluation output): keep the row-level table, not just the aggregate; charting all rows shows heterogeneity.
Per-cohort eval-fail-rate (dashboard signal): pass rate sliced by feature value, locale, or topic — the canonical heterogeneity alarm.
BiasDetection: returns per-row bias scores across protected attributes; aggregate hides exactly the cases ICE is designed to surface.
Spread / variance of scores: if global mean doesn’t change but variance jumps, the change is heterogeneous and likely hiding regressions.
Feature-by-feature counterfactuals: vary one input field, hold the rest, and re-score — the LLM analogue of perturbing a feature in classical ICE.

Minimal Python:

from fi.evals import Faithfulness

faithfulness = Faithfulness()
rows = []
for prompt_variant in ["v1", "v2"]:
    for query, context in dataset:
        score = faithfulness.evaluate(input=query, output=run(query, prompt_variant), context=context)
        rows.append({"variant": prompt_variant, "query": query, "score": score.score})
# chart per-row, per-variant — that's the ICE pattern

Common Mistakes

Averaging away the answer. Reporting only the global mean hides exactly the heterogeneous effects ICE was invented to surface; keep per-row scores.
Conflating ICE with PDP. PDP averages; ICE doesn’t. If you want the heterogeneity, you need the per-instance curve.
Varying multiple features at once. ICE assumes you change one feature; multi-feature changes mix effects and the curve becomes uninterpretable.
Skipping cohort metadata on rows. Without user.locale, topic, or other cohort labels on rows, you cannot slice the per-instance scores meaningfully.
Using ICE-style analysis without a baseline. Plotting one variant tells you nothing; ICE is a comparison technique — always plot the variant against the reference.