Models

What Is Regression?

A class of supervised machine-learning tasks where the target is a continuous value rather than a categorical label.

What Is Regression?

Regression in machine learning is the class of supervised tasks where the target is a continuous value rather than a category — predicting a price, latency, score, sales total, risk number, or expected value. It contrasts with classification, where the target is a discrete label. Regression models include classical methods (linear regression, gradient-boosted trees), neural networks, and LLM-based scoring heads. Evaluation uses metrics like mean absolute error, mean squared error, root mean squared error, and R². FutureAGI evaluates regression model outputs through fi.evals and tracks them across versions with regression-eval workflows.

Why Regression Matters in Production LLM and Agent Systems

A regression model embedded in an agentic stack is often the quiet, high-impact component. A pricing model regressing on dozens of features sits behind a user-facing chat assistant. A risk score feeds a tool-routing decision. A latency-prediction model gates a fallback to a faster model. A reranker emits a continuous relevance score before a top-K is sliced. When any of these regressions degrade, the surface symptom shows up downstream: wrong answer, wrong route, wrong tool. The eval challenge is to catch the regression at the regression-model layer, not just at the user-visible layer.

The pain hits multiple roles. ML engineers see error metrics drift but cannot tell whether the underlying data shifted or the model broke. SREs see latency-prediction errors lead to over-aggressive fallbacks, doubling cost. Product teams see pricing surprises and complain about model trust. Compliance teams in regulated sectors need monotonic guarantees and explanation evidence that a regression model cannot easily provide.

In 2026 LLM and agent systems, the same evaluation discipline applies whether the regression is gradient-boosted or LLM-scored. A judge-model that rates a response on a 1–10 scale is, formally, a regression head; its calibration matters as much as a classical model’s. Treating both the same way — same datasets, same trace links, same regression evals — keeps the reliability story consistent across the stack.

How FutureAGI Handles Regression

FutureAGI’s approach is to treat regression-model outputs as scores stored against ground-truth values in a Dataset, with traceAI linking each prediction to its production span. GroundTruthMatch covers categorical equivalence; for continuous outputs, teams pair it with deterministic numeric checks (within-tolerance) and an EmbeddingSimilarity proxy for textual continuous features. The RegressionEval workflow runs your regression model against a fixed dataset on every release candidate so error metrics are diff-able release over release.

A real workflow: a pricing team runs a gradient-boosted regression model behind a chat-driven quote assistant. They snapshot 20,000 historical (features, label) rows into a FutureAGI Dataset. Every model retrain runs RegressionEval over the dataset, computing MAE, RMSE, and per-cohort residual distributions. The release gate is “no regression on per-cohort MAE above 4% relative versus the previous deploy.” When a retrain ships, the test set is rerun, the diff dashboard shows the cohort that regressed, and the failing rows link back to the original traces and features. If the team also exposes the regression score through the chat agent, traceAI captures it as a span attribute so user-visible failures correlate with model errors directly.

FutureAGI does not run the regression algorithm itself; it runs the regression-eval harness that decides whether a new model is safe to deploy.

How to Measure or Detect It

Regression evaluation is a layered combination of error metrics, residual analysis, and trace-linked monitoring:

  • MAE, RMSE, MAPE, R² — the headline scalars; report at every release and per cohort.
  • RegressionEval workflow — fixed dataset, repeatable eval, version diff for every retrain.
  • Per-cohort residual distributions — histograms by region, segment, product, model version.
  • GroundTruthMatch — for categorized regression buckets where exact numeric match is overkill.
  • Trace-linked monitoring — capture the regression score as a span attribute so user-visible incidents map back to model errors.
from fi.evals import GroundTruthMatch

match = GroundTruthMatch()
result = match.evaluate(
    prediction=str(round(model.predict(x), 2)),
    ground_truth=str(round(y_true, 2)),
)
print(result.score)

Common Mistakes

  • Reporting a single error metric. MAE and RMSE answer different questions; report both, plus a cohort breakdown.
  • Skipping calibration on continuous predictions. A model with low MAE but biased residuals produces systematic errors users feel.
  • Confusing regression-the-task with regression-eval-the-process. They share a name but mean different things; specify in writing.
  • Using global thresholds. A 5% MAPE may be excellent for one product line and dangerous for another; gate per-cohort.
  • Treating LLM judge scores as classification. A 1–10 rubric score is a regression target; calibrate and evaluate it as one.

Frequently Asked Questions

What is regression in machine learning?

Regression is the class of supervised ML tasks where the target is a continuous value — a price, score, latency, or quantity — rather than a categorical label. It is the counterpart to classification.

How is regression different from classification?

Classification predicts a discrete label like 'fraud' or 'not fraud'; regression predicts a continuous value like a fraud probability or a transaction risk score. Their evaluation metrics differ: classification uses precision/recall/accuracy; regression uses MAE, MSE, RMSE, and R².

How do you evaluate regression models in production?

FutureAGI evaluates regression outputs through fi.evals against labelled datasets, tracks predicted-versus-actual residuals over time, and uses regression-eval workflows to detect cohort-level degradation across model versions.