Models

What Is a Feature Importance Heat Map?

A 2-D visualisation of feature importance scores across samples, cohorts, time, or model versions, used to expose patterns invisible in a flat ranking.

What Is a Feature Importance Heat Map?

A feature importance heat map is a two-dimensional model explainability view that shows how input features, retrieved chunks, or source documents influence outputs across cohorts, traces, time windows, or model versions. Instead of one global ranking, it renders each importance score as a colored cell, so engineers can see where a feature matters, where it vanishes, and where it drifts. In FutureAGI workflows, the same idea appears in production trace dashboards for RAG attribution and cohort debugging.

Why Feature Importance Heat Maps Matter in Production LLM and Agent Systems

Flat feature-importance rankings hide the parts of the model that matter most. A feature that ranks #20 globally may be the #1 driver for one user cohort. A model version’s importance ranking may shift between releases without changing aggregate accuracy. A RAG system may show one chunk dominating retrieval for half its traffic and contributing nothing for the rest. None of this is visible from a single bar chart.

The pain falls on different roles. ML engineers ship models whose decision-driving features differ across cohorts and discover the disparity only when a fairness audit lands. Product leads can’t tell whether a feature that drove a quarter’s results is durable or cohort-specific. Compliance teams writing model cards need stratified importance, not aggregate. Customer-success teams field bug reports that trace back to a feature working on most users and failing on one cohort.

In 2026-era stacks, the heat-map view scales to RAG and agent debugging. A chunk-attribution heat map across a week of traces shows which chunks dominate, when, and for whom. A spike in one row means a single chunk is over-indexing — usually a sign of embedding-space collapse or a metadata filter going wrong. FutureAGI surfaces this as a dashboard view because per-trace attribution alone cannot show pattern drift; the heat map can.

How FutureAGI Handles Feature Importance Heat Maps

FutureAGI’s approach is to make chunk- and source-attribution data sliceable on the trace dashboard so a heat-map view falls out of the same evaluator runs. ChunkAttribution and ChunkUtilization produce per-chunk scores at every RAG trace. The dashboard groups by chunk-id, prompt version, user cohort, and time window so engineers can visualise importance as a 2-D grid: chunks × cohorts, chunks × prompt versions, or chunks × hours of traffic.

A concrete workflow: a RAG team running on traceAI-langchain notices Faithfulness divergence between two user cohorts. They open the dashboard, pivot ChunkAttribution results into a heat map with cohorts on the x-axis and chunks on the y-axis. One row stands out — a single chunk dominates one cohort’s traffic but barely registers for the other. The chunk is a generic FAQ entry that the embedding pipeline keeps surfacing for a specific phrasing pattern. The team adds a metadata exclusion to the retriever for that phrasing and runs RegressionEval to confirm Faithfulness recovers. For tabular ML pipelines wrapped through FutureAGI, permutation-importance heat maps stratified by cohort are stored alongside the Dataset artifact for audit. The Agent Command Center can also pin a fallback route for cohorts where importance drift suggests a different model variant performs better. Unlike SHAP’s summary_plot, which is a static notebook artifact, the heat map view in FutureAGI is regenerated continuously from production traces.

How to Measure Feature Importance Heat Maps

The heat-map view is built on the same evaluator data, sliced two-dimensionally:

  • ChunkAttribution × cohort: chunks dominating one cohort and not another flag a retrieval bias.
  • ChunkAttribution × time: shifts across days flag retriever or index drift.
  • ChunkAttribution × prompt version: importance changes between prompt revisions.
  • Permutation importance × cohort (tabular): SHAP or permutation values stratified by cohort.
  • Cell-level outlier detection: a single cell darker than the rest is a debug entry-point.
  • eval-fail-rate-by-cohort overlay: high-importance chunks correlated with high-fail cohorts.
from fi.evals import ChunkAttribution

attr = ChunkAttribution()
heat = {}
for trace in traces:
    r = attr.evaluate(input=trace.input, output=trace.output, context=trace.chunks)
    heat[(trace.cohort, trace.chunk_id)] = r.score
# pivot heat dict into a cohort x chunk DataFrame for visualisation

Common mistakes

  • Plotting a heat map without normalisation. Different cohorts have different traffic; normalise by cohort size before colouring cells.
  • Ignoring time as a dimension. Static snapshots miss drift; produce the heat map over rolling windows.
  • Confusing high importance with quality. A chunk dominating doesn’t mean it is correct; pair with Faithfulness.
  • Skipping outlier alerts. A single dark cell often signals a real bug; instrument cell-level alerting.
  • Plotting once at launch. Importance distributions shift as models, prompts, and indexes change.

Frequently Asked Questions

What is a feature importance heat map?

A 2-D visualisation showing feature importance across a second dimension — cohorts, samples, time, model versions. Each cell colours an importance score, exposing patterns a flat ranking misses.

How is a heat map different from a feature importance bar chart?

A bar chart shows one ranking. A heat map shows many rankings stacked side by side, exposing how importance varies across cohorts, time, or releases. The heat map answers 'is this feature important the same way for everyone?'

What is the LLM equivalent?

A chunk-attribution heat map: rows are retrieved chunks, columns are traces or cohorts, cells are attribution scores. FutureAGI's `ChunkAttribution` results aggregated by cohort produce exactly this view.