Guides

Coefficient of Determination (R²) in 2026: How to Interpret Low, Moderate, High, and Negative Values Correctly

How to interpret R² in regression in 2026: when 0.4 is great, when 0.9 means overfitting, the negative-R² trap, and the four metrics you must pair with it.

·
Updated
·
10 min read
evaluations data quality llms rag
Coefficient of Determination (R²): How to interpret model fit in 2026
Table of Contents

TL;DR: How to Read R² in 2026 Without Getting Fooled

R² valueWhat it usually meansWhat to do next
Below 0 (test set)Model is worse than predicting the meanCheck leakage, features, target encoding
0.0 to 0.3Weak fit, possibly an inherently noisy problemAdd features, try non-linear models, check noise floor
0.3 to 0.6Useful in social, behavioral, marketing dataPair with RMSE/MAE, validate on holdout
0.6 to 0.85Common production range for ML regressionWatch for overfitting, run cross-validation
0.85 to 0.95Strong fit for engineering and physical systemsConfirm no leakage, sanity-check features
Above 0.97Almost always too good to be trueRun a leakage audit: identifier columns, target encoding, time-leak
Adjusted R² gapR² grows but adj-R² dropsDrop the predictor, it does not earn its slot

R² is a relative variance-explained score. It does not tell you the magnitude of error, and it does not tell you whether your model is calibrated, fair, or useful in production. Always read it alongside RMSE, MAE, adjusted R², and residual diagnostics.

This post is the interpretation deep-dive. For the basic formula, derivation, and history, see the companion R-squared and model accuracy explainer.

What Is the Coefficient of Determination (R²) and Why Interpretation Trips People Up in 2026

R² is the share of variance in the dependent variable that a regression model explains. The formula is short enough to fit on a single line:

R^2 = 1 - (SS_res / SS_tot)

Where:

  • SS_res is the residual sum of squares, the unexplained variance after fitting the model.
  • SS_tot is the total sum of squares, the variance of the target around its mean.

The headline number ranges from 0 to 1 in textbook in-sample OLS regression, with negative values possible on held-out data or on models that do not minimize squared error. The trouble is that R² is unitless, scale-free, and entirely relative. It tells you how much better your model is than predicting the target mean (for held-out R² as computed by sklearn.metrics.r2_score, that mean is the evaluation-set mean). It does not tell you whether the residual error is acceptable for your business problem, whether the model is fair, or whether the relationship is linear.

What Changed in How Practitioners Read R² Since 2025

Three shifts make 2026 different. First, modern ML stacks routinely report cross-validated R² and holdout R² by default, so the in-sample 0 to 1 bound is no longer the value most teams see. Second, gradient-boosted regressors and stacked ensembles produce R² values that drift up with model complexity, which has reinforced the discipline of always pairing R² with RMSE or MAE on a true holdout set. Third, the broader move to mature CI for ML means most teams compute held-out or cross-validated R² via scikit-learn’s long-standing cross_val_score with scoring="r2" rather than reporting in-sample numbers, which removes most of the legacy confusion about “training R² versus test R²” in production pipelines.

How to Calculate R² in Python, R, and Excel With Worked Examples

Software does the arithmetic, but the workflow matters. In Python with scikit-learn:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression().fit(X_train, y_train)
preds = model.predict(X_test)
print(r2_score(y_test, preds))

In R, the linear-model summary returns it directly:

fit <- lm(y ~ x1 + x2, data = train)
summary(fit)$r.squared
summary(fit)$adj.r.squared

In Excel, RSQ(y_range, x_range) returns R² for a single predictor, and the regression output in the Data Analysis ToolPak reports both R² and adjusted R² for multivariable fits.

Always compute R² on a holdout set or via cross-validation. The in-sample number flatters every model and will rise even when you add randomly shuffled noise as a predictor.

How to Interpret R² Values: Low, Moderate, High, and Negative R² Explained

Low R² (Below 0.3) Often Reflects a Noise-Dominated Problem, Not a Failed Model

A low R² is not automatically a bad model. In behavioral and social-science settings, R² of 0.1 to 0.3 can still surface a real, deployable signal because the underlying process is dominated by unmodeled human variation. The right reflex is to ask:

  • Is the irreducible noise floor of this problem genuinely high?
  • Are there missing predictors I can add at low cost?
  • Does the residual plot show patterns I can model with non-linear features?

If the answer to all three is no, a low R² may be the best honest fit you can produce, and what matters is the residual error and the downstream decision.

Moderate R² (0.3 to 0.7) Is Common in Noisy Business and Social Data and the Hardest to Read

This is the band where R² lies the most. It is typical for marketing-mix, customer-behavior, and social-science targets where the underlying process is noisy. Two models with identical R² can have very different residual distributions, very different fairness profiles, and very different cost in production. Always pair R² in this range with:

  • RMSE or MAE in the original units of the target.
  • Residual plots stratified by the most important feature.
  • A fairness audit across sensitive subgroups (see the bias detection guide).

High R² (0.85 to 0.95) Usually Reflects Either a Clean Process or a Subtle Leakage Problem

A high R² is what every team wants and what every team should sanity-check. The first question to ask is: does any feature contain information that would not be available at prediction time? Target encoding, leaked timestamps, and identifier columns are the three most common sources of unrealistically high R². If you can rule those out and the residuals look unstructured, a high R² in this band is genuine and you should pair it with cross-validated R² before shipping.

R² Above 0.97 Should Trigger a Leakage Audit Before You Celebrate

In real-world tabular regression with messy features, R² above 0.97 on a holdout split is almost always too good. The most frequent culprits are: a feature that is essentially the target in disguise, an ID column the model has memorized, or a time-leak where future information bleeds into the training fold. Run a permutation test and inspect the top-ranked features.

Negative R² Means Your Model Is Worse Than Predicting the Mean

A negative R² on the test set is unambiguous: the constant predictor y_hat = mean(y_test) (the baseline used by sklearn.metrics.r2_score) beats your model on the held-out data. The usual causes are:

  • Train and test distributions differ (covariate shift, time leakage in the wrong direction).
  • The model is severely misspecified (e.g., fitting a line to a U-shaped relationship).
  • You trained on a tiny set and overfit hard enough that holdout performance collapsed.

Negative R² is not a numeric bug. It is the formula doing its job.

Why R² Alone Misleads: The Four Limitations You Have to Plan Around

  • Adds-only behavior in-sample. In-sample OLS R² with an intercept never decreases when you add a predictor, even pure noise. That is why adjusted R² exists and why holdout or cross-validated R² is the only honest comparison for prediction work.
  • Linearity assumption. R² is computed against the variance around the mean, which is the right baseline for linear models but a weak one for non-linear relationships. For tree ensembles and neural regressors, the R² is still defined but the residuals carry the story.
  • Scale-free. R² will not tell you whether your residual error is acceptable in dollars, kilograms, or milliseconds. Always pair it with RMSE or MAE.
  • No fairness signal. A model with R² of 0.85 can still be biased against a protected subgroup. R² is silent on stratified residuals. For that you need an explicit fairness audit.

For a deeper write-up on metric pairing, see the LLM evaluation frameworks and metrics best-practices guide and the RAG evaluation metrics overview.

How to Read R² in Practice: Five Real-World Application Patterns

Finance: R² Captures Linear Exposure, Not Risk

In factor models and CAPM-style regressions, R² is read as the share of return variance explained by the factor set. A high R² in a Fama-French style model means residual variance is small; alpha (the intercept) still needs to be tested separately for statistical and economic significance. Risk teams pair R² with Value-at-Risk, drawdown, and out-of-sample Sharpe rather than treating R² as a quality score on its own.

Marketing Attribution: R² Around 0.3 to 0.5 Is Normal and Often Useful

Marketing-mix models routinely live in the 0.3 to 0.6 band because consumer behavior is noisy and many drivers are unobserved. The right comparison is incremental lift on a holdout media-spend variation, not the in-sample R² of the fit.

Healthcare: R² Must Be Read Stratified by Cohort

A length-of-stay or hospital-cost regression with R² of 0.6 across all patients can still mis-fit under-represented subgroups. Stratify the residuals by age, ethnicity, and primary diagnosis before deploying. Note that binary outcomes like 30-day readmission are classification problems, not regression, and should be scored with AUC or calibration metrics instead.

Manufacturing and Process Control: R² Above 0.9 Is Expected

Sensor data and physical processes have low intrinsic noise. R² of 0.85 in a continuous process-control regression (yield, thickness, temperature, or defect-rate as a numeric target) usually means a missing feature, not an inherently hard problem. If your “defect” target is binary, treat it as a classification problem and use AUC instead of R².

Real Estate and Hedonic Pricing: Watch for Leakage From Past Sale Price

An R² of 0.95 in a house-price model is almost always a sign that the past-sale price, ZIP-level price index, or a near-duplicate listing feature is leaking the target. Drop those features and re-evaluate.

R² Versus Other Regression Metrics: When to Use Which

MetricWhat it tells youWhen it winsWhere it fails
Share of variance explainedQuick comparison across linear models on the same targetSilent on error magnitude and fairness
Adjusted R²Variance explained, penalized for feature countComparing models with different numbers of predictorsStill in-sample biased without holdout
RMSEAverage error in target units, weighted toward outliersWhen large errors are costlyHard to compare across targets with different scales
MAEMean absolute error in target unitsWhen you want a robust error measure in the target’s original unitsLess sensitive to large but rare errors
MAPEPercentage errorWhen stakeholders want a percentBlows up near zero target values

For regression-style scoring of LLM outputs (e.g., predicting human-rated quality from text features), pair these classical metrics with the model-graded LLM evaluators in ai-evaluation, Future AGI’s Apache 2.0 evaluation library. R² stays the right tool for the numeric prediction, while faithfulness and hallucination metrics handle the generative side.

Best Practices for Using R² in 2026

  • Always compute R² on a holdout split or via cross-validation. Reporting in-sample R² is a 2010s habit.
  • Report R² with RMSE and MAE in the same line. They cover different failure modes.
  • Use adjusted R² when comparing models with different feature counts. Or use AIC/BIC if you care about likelihood.
  • Stratify residuals. A single R² hides fairness, calibration, and tail-risk failures.
  • Treat R² above 0.97 as a leakage hypothesis until proven otherwise. Permutation importance is your friend.
  • Do not use R² for classification, ranking, or generative LLM output. Use AUC, NDCG, or model-graded evaluators instead.

Why R² Still Matters and What to Pair It With for 2026 Modeling

R² is the cheapest and oldest summary of regression fit, and it is still the right first number to look at. The trap is treating it as the only number. A 2026 regression workflow keeps R² as the headline and reports it on a true holdout split (or via cross-validation), pairs it with RMSE and MAE in target units, uses adjusted R² as an in-sample model-comparison aid when comparing linear models with different feature counts, and stratifies residuals by every variable that matters to fairness or operations. Pair the classical regression metrics with task-appropriate evaluators when your target is a generative LLM output rather than a continuous score, and you have a complete picture.

For LLM and RAG applications where the output is text rather than a number, replace R² with the deterministic evaluation metrics and the model-graded evaluators in best LLM eval libraries.

Frequently asked questions

What does R² actually measure in a regression model?
R² measures the proportion of variance in the dependent variable that the independent variables in your regression model explain. It ranges from 0 to 1 in textbook OLS regression: 0 means the model explains no variance and reduces to the mean, 1 means the model perfectly explains every observation, and a negative R² on a held-out set means your model performs worse than always predicting the mean. R² is a relative goodness-of-fit score, not an absolute error measurement, so it pairs poorly with raw accuracy and must be read alongside RMSE or MAE.
What is a good R² value for a regression model in 2026?
There is no universal threshold. In physics, engineering, and clean industrial sensor data, an R² below 0.9 often signals a problem. In behavioral, social, and economic settings, an R² of 0.2 to 0.4 can be considered strong because human behavior is intrinsically noisy. For machine-learning regression on tabular data, 0.6 to 0.8 is a common production range, with higher values often pointing to data leakage or overfitting rather than genuine skill.
Why can R² be negative on a test set even though the formula looks bounded?
The 0 to 1 bound holds only when R² is computed in-sample for an OLS model with an intercept. On a held-out test set, on cross-validation folds, or on any model that does not minimize squared error around its own training mean, R² can drop below zero. A negative held-out R² means your model is doing worse than a constant prediction of the evaluation-set mean (the baseline that sklearn.metrics.r2_score uses) and is a strong signal to revisit features, leakage, or model choice.
What is the difference between R² and adjusted R²?
R² rises mechanically whenever you add a predictor, even if that predictor is pure noise. Adjusted R² penalizes the number of predictors relative to sample size, so it goes up only when a new variable adds explanatory power beyond chance. Use adjusted R² when comparing models with different numbers of features. For prediction-focused work, prefer holdout R², cross-validated R², or RMSE on a test set instead.
Should I use R² to evaluate large language models or LLM applications?
No. R² is built for continuous numeric regression. For LLM applications you evaluate generative quality with task-specific metrics like faithfulness, groundedness, answer relevance, completeness, toxicity, and hallucination rate. Platforms like Future AGI ship these as model-graded evaluators that act as the LLM-eval analogue of using R² for a regression model. R² still applies if you are predicting a numeric score from text features, but not for free-form generation quality.
How do you compute R² in Python with scikit-learn?
Use sklearn.metrics.r2_score(y_true, y_pred) on a held-out set, or call .score(X_test, y_test) on a fitted LinearRegression model. For tree ensembles and gradient boosting, .score on a regressor also returns R². Always compute R² on a holdout split rather than the training set, and pair it with RMSE or MAE so you also know the error magnitude in the units of the target variable.
When should I avoid R² entirely?
Avoid R² for classification problems entirely. Use caution when residual structure, autocorrelation, or a heavy-tailed target distribution makes variance explained a poor proxy for downstream utility (e.g., time series with strong autocorrelation, or any setting where ranking, calibration, or cost dominates). Replace it with AUC, log-loss, or task-specific cost metrics in those cases. For LLM outputs, replace it with the faithfulness, hallucination, and task-completion evaluators in Future AGI ai-evaluation.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.