Data

What Is Class Imbalance?

A dataset skew where one label or outcome appears far more often than another, distorting training, evaluation, and monitoring signals.

What Is Class Imbalance?

Class imbalance is a data reliability problem where one class, intent, outcome, or risk cohort dominates the examples used for training, evaluation, or monitoring. In LLM and agent systems, it appears in sdk:Dataset rows, regression eval samples, annotation queues, and production trace cohorts. The failure is usually hidden: overall accuracy or pass rate looks healthy while rare labels fail. FutureAGI helps teams inspect class mix before they trust metrics, release gates, or feedback-driven improvements.

Why Class Imbalance Matters in Production LLM and Agent Systems

Class imbalance creates quiet production regressions because the metric most people watch first still moves in the right direction. A support classifier with 92% billing questions and 8% refund-risk questions can report high accuracy while missing nearly every refund escalation. A RAG evaluator can pass on common onboarding intents while failing rare policy, safety, or localization cases. In an agent workflow, the planner may learn that a rarely labeled tool path is “unimportant” and skip it when the user actually needs it.

Developers feel it as flaky evals: one release looks better overall but worse for a small cohort. SREs see escalations, thumbs-down rate, or manual correction rate rise without an obvious latency or error spike. Compliance teams lose coverage for protected classes, regional policies, or adverse-action outcomes. Product teams ship with a blind spot that appears only after a rare user segment gets enough traffic.

Useful symptoms include a high majority-class pass rate, low recall for the minority label, confusion matrices with empty cells, class distributions that changed between training-set and validation-set, and eval-fail-rate-by-cohort jumps after a dataset refresh. It matters more for 2026-era multi-step systems because a rare class can decide the whole path: retrieve a policy, choose a tool, refuse safely, escalate, or ask a clarifying question.

How FutureAGI Handles Class Imbalance

FutureAGI’s approach is to make class distribution visible at the dataset and evaluation boundary, not only after a model has failed in production. The specific anchor is sdk:Dataset, exposed in the SDK as fi.datasets.Dataset. A team building a support agent can store each row with input, expected_label, expected_response, cohort, source_trace_id, class_label, reviewer_status, and dataset_version. Before a regression run, they compare label counts across the full dataset, the holdout split, and the latest production trace sample.

The same rows can carry evaluator outputs. GroundTruthMatch checks whether the agent’s answer matches the expected response, while class-conditioned slices reveal whether the failure is concentrated in refunds, privacy requests, chargebacks, or low-volume locales. If the application is traced with traceAI-langchain, fields such as llm.token_count.prompt and source_trace_id help separate data scarcity from prompt bloat or retrieval drift.

What the engineer does next is concrete. If the minority class falls below a 0.85 recall threshold while the overall pass rate stays above 0.94, the release is blocked. The team adds reviewed rows, resamples synthetic scenarios only for missing classes, or sends ambiguous examples to an annotation queue. Unlike an accuracy-only scikit-learn report, the FutureAGI workflow keeps the class mix, evaluator result, and production trace evidence connected in one review loop.

How to Measure or Detect Class Imbalance

Detect class imbalance before model scoring and after scoring; both views matter.

  • Label ratio: count each class_label in sdk:Dataset; alert when the smallest production-critical class drops below its minimum row count.
  • Split parity: compare class ratios across training, validation, test, and holdout data so one split does not hide the rare class.
  • Per-class recall and precision: recall shows missed minority cases; precision shows whether the model over-predicts a rare but high-cost label.
  • Confusion matrix cells: empty or near-empty off-diagonal cells often mean the dataset cannot expose a real decision boundary.
  • Eval-fail-rate-by-cohort: dashboard the pass rate by class, locale, policy version, and account tier instead of only by release.
  • Feedback proxy: compare thumbs-down rate, escalation rate, and manual correction rate for classes that have little eval coverage.

GroundTruthMatch can score row-level agreement, then the team groups those results by class_label.

from fi.evals import GroundTruthMatch

evaluator = GroundTruthMatch()
result = evaluator.evaluate(
    response=row["response"],
    expected_response=row["expected_response"],
)

Common Mistakes

Common mistakes are usually metric mistakes, not sampling mistakes alone. The fix is to tie each decision to cohort-level pass rate, recall, and review status.

  • Trusting accuracy on skewed labels. A model that always predicts the majority class can look strong while missing every rare escalation.
  • Balancing only the training set. If validation, test, or production trace samples stay skewed, release gates still hide minority failures.
  • Oversampling without provenance. Duplicated minority rows can inflate confidence and make one annotation error appear many times.
  • Treating all minority classes equally. A rare refund intent and a rare self-harm intent do not carry the same risk or threshold.
  • Ignoring changing traffic mix. A launch, campaign, or new region can turn a rare class into the primary support path.

Frequently Asked Questions

What is class imbalance?

Class imbalance is a dataset condition where one label, intent, outcome, or risk cohort appears much more often than another. It can make overall accuracy or pass rate look healthy while rare classes fail.

How is class imbalance different from imbalanced data?

Class imbalance is the label-distribution problem: one class is overrepresented. Imbalanced data is broader and can include skew in features, cohorts, languages, channels, scenarios, or trace sources even when labels are balanced.

How do you measure class imbalance with FutureAGI?

Use `sdk:Dataset` through `fi.datasets.Dataset` to count `class_label` by split, cohort, and dataset version, then compare per-class precision, recall, and eval-fail-rate-by-cohort. `GroundTruthMatch` can score row-level agreement before grouping results by class.