Models

What Is Integrated Gradients?

An attribution method that explains a neural network's prediction by integrating gradients along a path from a baseline input to the actual input.

What Is Integrated Gradients?

Integrated Gradients is a model attribution method that explains a neural network prediction by integrating gradients from a baseline input to the actual input. In production eval and debugging workflows, it produces one attribution score per feature, token, or pixel so engineers can inspect which inputs drove a decision. The scores should add up to the difference between the model’s output on the real input and the baseline. FutureAGI can evaluate those attribution outputs as side-data, even when IG itself runs outside the platform.

Why Integrated Gradients matters in production LLM and agent systems

Attribution problems become production problems when a model is accurate on average but wrong for the wrong reason. A classifier may pass aggregate accuracy while depending on a proxy for a protected attribute. A RAG reranker may prefer a boilerplate footer over the retrieved paragraph that actually answers the question. A vision-language model may cite the right object class while focusing on the background. Integrated Gradients gives engineers a per-example signal for those failures.

A vanilla saliency map reads the gradient at one point and can go silent when activations saturate. Unlike SHAP, which estimates feature contribution by perturbing coalitions, Integrated Gradients requires gradient access and a baseline path; in return, it gives a completeness check that catches many broken attribution runs. That tradeoff matters when a team must explain a single high-risk decision, not just summarize global feature importance.

The pain is shared across roles. ML engineers use IG to debug regressions across model variants. Compliance leads use attribution maps to show that sensitive features did not drive a decision. Product teams use them to explain edge cases to stakeholders. In 2026 LLM and agent stacks, token-level IG and integrated Jacobians are also useful when hallucinated output fails to trace back to the prompt, tool result, or retrieved context.

How FutureAGI handles Integrated Gradients

FutureAGI does not run Integrated Gradients inside the normal eval pipeline because IG needs model-internal gradient access. The platform boundary is after attribution generation: compute IG in Captum, transformers-interpret, or a custom training hook, then store the attribution vector on a FutureAGI dataset row beside the prompt, response, model output, and expected feature groups. FutureAGI’s approach is to treat IG as evidence attached to an eval row, not as a replacement for pass/fail quality metrics.

For example, a team auditing a transformer classification head can write ig_attribution, protected_indices, and ig_completeness_error onto each dataset example. A CustomEvaluation then returns the fraction of attribution mass on protected features. BiasDetection runs on the same row to check whether the output itself is biased. If the protected-attribution score rises above a threshold, the engineer opens the row, compares the token or feature attribution map, and blocks the model promotion.

The same pattern works for LLM traces. If an app is instrumented through the langchain traceAI integration, the team can attach an attribution summary to the trace metadata and compare it with Groundedness or HallucinationScore outcomes. Rows where attribution concentrates away from retrieved context become regression candidates.

How to measure or detect Integrated Gradients

Integrated Gradients is not a FutureAGI evaluator by itself. Measure the attribution output and the downstream eval result together:

  • Attribution-mass score: a CustomEvaluation that returns the share of attribution assigned to protected, irrelevant, or retrieved-context features.
  • BiasDetection pairing: flags rows where attribution lands on sensitive features and the output also shows bias.
  • Completeness error: checks whether attribution scores sum to the prediction-baseline difference; spikes usually mean the baseline or integration steps are wrong.
  • Eval-fail-rate-by-cohort: slices failed examples by attribution profile, so teams can see whether one feature group drives failures.
  • Trace review queue: stores high-attribution failures for manual review when Groundedness or HallucinationScore fails on the same example.

Minimal Python:

from fi.evals import CustomEvaluation

def attribution_mass_protected(row):
    total = sum(abs(x) for x in row["ig_attribution"])
    protected = sum(abs(row["ig_attribution"][i]) for i in row["protected_indices"])
    return protected / max(total, 1e-9)

ig_eval = CustomEvaluation(name="ig-protected-mass", scorer=attribution_mass_protected)

Common mistakes

  • Using a poorly chosen baseline. A zero baseline is not always meaningful; baseline choice changes the attribution and can invert the explanation.
  • Treating attribution as causation. IG explains what the model used, not what would happen counterfactually; use counterfactual tests for causal claims.
  • Ignoring the completeness axiom. If attributions do not approximately sum to the prediction-baseline difference, the IG run is broken.
  • Comparing IG across model variants without recomputing. Attributions are model-specific; a v2 model’s IG output is not directly comparable to v1.
  • Using IG as the only fairness signal. Pair attribution with BiasDetection, cohort failure rates, and reviewer labels before making release decisions.

Frequently Asked Questions

What is Integrated Gradients?

Integrated Gradients is a neural-network attribution method that integrates the model's gradient along a path from a baseline input to the actual input. The output is a per-feature attribution that sums to the prediction difference.

How is it different from saliency maps?

A vanilla saliency map uses the gradient at one point and is noisy and prone to saturation. Integrated Gradients averages gradients along a path, which gives more stable attributions and satisfies completeness and sensitivity axioms.

Where does Integrated Gradients fit in LLM evaluation?

FutureAGI does not compute IG during evaluation, but you can wrap external IG outputs as a `CustomEvaluation` and score whether attributions land on expected features. Pairing that score with `BiasDetection` is useful for fairness audits and debugging unexpected predictions.