Models

What Is Population Stability Index (PSI)?

A binned distribution-drift metric that quantifies how much a current distribution has shifted from a baseline using a sum of weighted log-ratios.

What Is Population Stability Index (PSI)?

Population Stability Index (PSI) is a distribution-drift metric. You bin both the baseline and current populations, compute the relative frequency in each bin, and sum (current − baseline) × ln(current / baseline) across all bins. The standard interpretation: PSI < 0.1 indicates a stable distribution, 0.1–0.25 a minor shift worth investigating, and > 0.25 a significant drift that usually requires action. PSI originated in credit-risk modelling but applies to any binnable feature — including LLM input-token distributions, embedding-cluster frequencies, and evaluator-score histograms.

Why It Matters in Production LLM and Agent Systems

PSI’s job is to catch the silent shifts. A model is not down, latency is fine, no exception is logged — but the input distribution moved, and quality is degrading. Without a drift metric like PSI, the team only finds out when users complain or the next regression eval fails.

The pain shows up across roles. An ML platform engineer sees AUC slowly slip on a churn classifier — PSI on the customer-tenure feature would have flagged it three weeks earlier. An LLM team rolls a new system prompt that subtly biases retrieval; PSI on the retrieved-chunk-source distribution would catch the shift without a regression eval. A compliance lead is asked, post-incident, “when did the input distribution change?” and has no historical PSI series to point to.

For 2026 LLM-and-agent stacks, PSI applies far beyond tabular features. PSI on the distribution of detected user intents catches semantic drift in your product. PSI on per-tool invocation rates in an agent flags a planner that has started preferring one tool. PSI on the histogram of evaluator scores (AnswerRelevancy, Groundedness) catches output-quality drift before the eval-fail-rate threshold trips. The metric is general; the binning choice is what you tune.

How FutureAGI Tracks PSI-Equivalent Drift

FutureAGI does not ship a PSI evaluator class — PSI is a feature-distribution metric, not an output evaluator — but the platform exposes the inputs you need to compute it and the surfaces where drift matters most.

Concretely: a fintech LLM team baselines their input distribution against a Dataset versioned v3-baseline (March traffic). Every day, traceAI ingests production traces; the team computes PSI on three distributions — input-token-length histograms, retrieved-document-source distribution, and the histogram of Groundedness evaluator scores. PSI = 0.18 on retrieval-source after two weeks: minor shift. They sample those traces, find the retriever is over-pulling from one collection because of an embedding-index update, fix the index, and PSI returns to 0.06.

Because evaluator scores are stored against the trace, drift on the evaluator-score histogram is a downstream signal that captures the impact of feature drift, prompt drift, or model drift simultaneously — a higher-signal place to watch than any single feature. Pair PSI with the FutureAGI dashboard’s eval-fail-rate-by-cohort to localise where the drift is hurting.

How to Measure or Detect It

PSI is a small calculation; the discipline is in the inputs:

  • Bin the distribution carefully: 10 equal-width or equal-frequency bins is standard for continuous variables; categorical variables use one bin per level.
  • Pick a baseline window: use a stable training-data window or a known-good production week; do not let the baseline drift.
  • Track PSI as a daily time series: rising PSI is the signal, not a single measurement.
  • EmbeddingSimilarity: as a complement, the FutureAGI evaluator returns a 0–1 score; large swings in the average hint at embedding-space drift PSI on raw features may miss.
  • Evaluator-score histogram drift: the most actionable distribution to monitor in LLM stacks.
import numpy as np

def psi(baseline, current, bins=10):
    edges = np.histogram_bin_edges(baseline, bins=bins)
    b, _ = np.histogram(baseline, bins=edges)
    c, _ = np.histogram(current, bins=edges)
    b = np.where(b == 0, 1e-6, b) / b.sum()
    c = np.where(c == 0, 1e-6, c) / c.sum()
    return float(np.sum((c - b) * np.log(c / b)))

Common Mistakes

  • Choosing too few bins. Coarse bins hide drift; default to 10 unless you have a reason.
  • Using zero-count bins without smoothing. A zero in either distribution blows up the log; add a small epsilon.
  • Letting the baseline drift. Recompute PSI against a fixed baseline; rolling baselines make the metric meaningless.
  • Treating PSI alone as causal. PSI tells you a distribution shifted, not why; pair with cohort eval-fail-rate to localise.
  • Ignoring evaluator-score PSI. The evaluator histogram is downstream of every other shift — monitor it first, then trace upstream.

Frequently Asked Questions

What is Population Stability Index (PSI)?

PSI is a drift-detection metric that compares two binned distributions and sums weighted log-ratios across bins. Standard thresholds are 0.1 (stable) and 0.25 (significant drift).

How is PSI different from KL divergence?

PSI is the symmetric form of KL divergence: it sums (current − baseline) × ln(current / baseline) across bins, treating both directions equally. KL divergence is asymmetric and does not have the same conventional thresholds.

How do you use PSI for LLM monitoring?

Compute PSI on input-token distributions, embedding clusters, or evaluator-score histograms between a baseline window and a current window. In FutureAGI, monitor evaluator-score distribution shifts as a PSI-equivalent drift signal.