Guides

R-Squared (R²) Explained: Formula, Interpretation, and Limits

What R-squared means, how to compute it, when adjusted R² helps, when to switch to RMSE/MAE, and why LLM evaluation needs different metrics.

December 12, 2024

Updated May 14, 2026

7 min read

machine-learning regression evaluations metrics

Table of Contents

TL;DR

Question	Short answer
What is R²?	Share of variance the regression model explains, on a 0 to 1 scale (training).
Formula	R² = 1 minus (residual sum of squares / total sum of squares).
Range	0 to 1 on training; can be negative on test data.
When to use	Linear regression model comparison on a single dataset.
When to switch	Need error in real units (use RMSE / MAE); have non-linear model (use task-specific).
LLM evaluation	R² does not apply. Use faithfulness, context adherence, groundedness instead.

R² is the right tool for regression on numeric targets. It is the wrong tool for classification, ranking, or LLM evaluation.

Why this metric still matters in 2026

Even in an era where most ML headlines are about transformers and agents, regression problems with numeric targets still dominate large parts of industry: pricing models, demand forecasting, risk scoring, A/B test analysis, and any system that outputs a number rather than a class or a paragraph. R-squared is the metric that statisticians and ML engineers reach for first to summarize how well a regression model fits.

The danger is that R² is also one of the most misused metrics in ML. A high R² on training data is not the same as a useful model in production. A negative R² on test data is possible and informative. A “low” R² in one domain can be a strong result in another. This article covers the formula, the correct interpretation, and the situations where you should reach for a different metric entirely.

What R-squared measures

R-squared, also called the coefficient of determination, answers the question: “How much of the variance in my target variable does my model explain?”

The formula is:

R² = 1 - SS_res / SS_tot

where:

SS_res = residual sum of squares = sum of (y_actual - y_predicted)² across all observations.
SS_tot = total sum of squares = sum of (y_actual - y_mean)² across all observations.

SS_tot is the total variation in the target. SS_res is the variation the model fails to capture. The ratio is the unexplained share, and R² is one minus the unexplained share, that is, the explained share.

Concrete example

Suppose you predict house prices and your test set has five houses:

Actual price	Predicted price
200,000	205,000
250,000	245,000
300,000	310,000
350,000	340,000
400,000	405,000

The mean of the actuals is 300,000. The residuals (actual minus predicted) are -5,000, 5,000, -10,000, 10,000, -5,000. The residual sum of squares is 5,000² + 5,000² + 10,000² + 10,000² + 5,000² = 275,000,000. The total sum of squares is 100,000² + 50,000² + 0 + 50,000² + 100,000² = 25,000,000,000. R² = 1 - 275,000,000 / 25,000,000,000 = 0.989. The model explains about 99 percent of the variance in this tiny example.

Python implementation

import numpy as np
from sklearn.metrics import r2_score

y_true = np.array([200000, 250000, 300000, 350000, 400000])
y_pred = np.array([205000, 245000, 310000, 340000, 405000])

# scikit-learn
print(r2_score(y_true, y_pred))

# Manual, identical result
ss_res = np.sum((y_true - y_pred) ** 2)
ss_tot = np.sum((y_true - y_true.mean()) ** 2)
print(1 - ss_res / ss_tot)

Both methods return the same number. Use sklearn.metrics.r2_score for production code and the manual version when you want to be sure you understand what the library is computing.

Interpreting the value

The scale runs from 0 to 1 on training data with an intercept, where:

R² = 1: model explains all variance perfectly. Suspicious in real data; usually indicates leakage or a trivial target.
R² close to 1: strong fit.
R² near 0: model explains roughly nothing beyond the mean.
R² negative on test data: model is worse than predicting the test-set mean baseline (the denominator used by sklearn.metrics.r2_score). The fit is broken, the distribution shifted, or the assumed form is wrong.

R² is unitless. An R² of 0.6 on a house price model and an R² of 0.6 on a stock return model are not directly comparable. Each domain has its own ceiling. Consumer-behavior data is noisy, and an R² of 0.2 to 0.3 may be considered strong. Physics simulations may demand R² above 0.99 to be useful.

Adjusted R-squared

Adding more predictors to a linear regression mechanically increases R², even when the new predictors are random noise. Adjusted R² penalizes the model for additional predictors:

adjusted R² = 1 - (1 - R²) * (n - 1) / (n - k - 1)

where n is the number of observations and k is the number of predictors. Adjusted R² falls when a new predictor does not improve the fit enough to offset the penalty. Use adjusted R² for multiple regression model selection when you are comparing models with different numbers of predictors.

In Python with statsmodels:

import statsmodels.api as sm

X = sm.add_constant(X_train)
model = sm.OLS(y_train, X).fit()
print(model.rsquared)
print(model.rsquared_adj)

When R-squared is the wrong metric

R² is appropriate for least-squares regression on numeric targets, evaluated on a single dataset, with comparable models. It becomes the wrong tool in several common situations.

When you need error in real units

R² is unitless. A business stakeholder usually wants to know “how wrong is the model on average, in dollars?” That is RMSE (root mean squared error) or MAE (mean absolute error). MAE is more robust to outliers; RMSE penalizes large errors more heavily. For model ranking on a single dataset, R² and RMSE produce the same order, so the choice is about communication.

When the target is a class or a label

R² is meaningless for classification. Use accuracy, precision, recall, F1, log-loss, or AUC. Pseudo-R² metrics exist for logistic regression but they do not have the same interpretation as ordinary least squares R².

When the model is non-linear or non-OLS

R² assumes the model partitions variance through least squares. For non-linear regressions, robust regressions, or quantile regressions, R² can mislead. Prefer RMSE, MAE, or task-specific metrics.

When the target is text

Large language model outputs are free-form text. There is no scalar target to compute variance for. R² simply does not apply. LLM evaluation requires text-native metrics:

Faithfulness: does the output stick to the source material?
Context Adherence: does the answer reference only the provided context?
Groundedness: are all factual claims supported by retrieved evidence?
Completeness: does the answer cover the required points?
Semantic similarity: how close is the output to a reference, by embedding distance?
Toxicity, PII, safety: domain-specific guardrails.

Future AGI ships these as runnable evaluators through the fi.evals library, which is the LLM-eval equivalent of sklearn.metrics:

from fi.evals import evaluate

result = evaluate(
    "context_adherence",
    output="Phi-4 is a small language model released by Microsoft.",
    context="Phi-4 was released in December 2024 with 14B parameters.",
)
print(result.score)
print(result.reason)

The conceptual point is that “model accuracy” means different things for regression and language. R² answers it for regression; faithfulness and context adherence answer it for LLMs.

Common pitfalls

Comparing R² across datasets. Different datasets have different irreducible noise; R² ceilings differ. Compare on the same data only.
Reporting only training R². Always hold out a test set. Training R² is an upper bound on generalization, not a measure of it.
Adding more predictors to chase R². Use adjusted R², regularization (Lasso, Ridge, Elastic Net), or cross-validation to keep models honest.
Trusting a perfect R². R² of 1 on real data is almost always leakage. Check for the target appearing in the features, or a near-duplicate feature.
Using R² for classification. It is a regression metric. For classification, use the classification toolbox.
Forgetting that R² can be negative on test data. A negative test R² is information, not an error.
Treating low R² as failure. In high-noise domains like behavior, sentiment, or financial returns, an R² of 0.05 to 0.20 can correspond to a genuinely useful model.

Worked example with cross-validation

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import r2_score, mean_squared_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42,
)

model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Train R²:", model.score(X_train, y_train))
print("Test R²:", r2_score(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

# 5-fold cross-validated R² gives a more honest estimate of generalization.
cv_scores = cross_val_score(model, X, y, cv=5, scoring="r2")
print("CV R² mean:", cv_scores.mean(), "± std:", cv_scores.std())

The cross-validated R² is what you would report as the model’s expected performance on new data. If it is much lower than the training R², the model is overfitting.

Quick reference

You have	Use this metric
Linear regression on numeric target	R² and RMSE
Multiple regression	Adjusted R² for model selection
Classification	Accuracy, F1, log-loss, AUC
Imbalanced classification	Precision, recall, AUPRC
LLM generation	Faithfulness, context adherence, groundedness, completeness
Ranking	NDCG, MRR, MAP
Time-series forecast	MAPE, sMAPE, RMSE, walk-forward CV
Anomaly detection	Precision at K, ROC-AUC

R² is a useful, simple, and well-understood metric for regression on numeric targets. Use it for what it is good for, pair it with RMSE for communication, validate on held-out data, and reach for a different toolbox when the target is text or a class. For LLM evaluation specifically, R² is not in scope; the LLM-native equivalents live in evaluation suites like Future AGI’s fi.evals library, which run the same metric the same way across every model, every prompt, and every production trace.

LLM evaluation frameworks, metrics, and best practices covers the LLM equivalents of regression metrics.
RAG evaluation metrics walks through context adherence, groundedness, and faithfulness.
Deterministic LLM evaluation metrics details numeric eval metrics that work like classical scorers.
Top LLM evaluation tools compares evaluation platforms.

Frequently asked questions

What does R-squared (R²) actually measure?

R-squared measures the proportion of variance in the dependent variable that a regression model explains using its independent variables. It is computed as 1 minus the ratio of the residual sum of squares to the total sum of squares. An R² of 0.85 on a house-price model means 85 percent of the variation in price is explained by the features in the model, and 15 percent is attributed to noise or omitted variables. R² is unitless and depends on the dataset, so comparing R² across different datasets is rarely meaningful.

Why can R-squared be negative?

On training data with an intercept, R² is bounded between 0 and 1. On held-out test data, R² can go negative whenever the model performs worse than simply predicting the mean of the test-set target for every observation, which is the baseline that sklearn.metrics.r2_score uses. A negative test R² is a clear signal that the model has overfit, the feature distribution has shifted, or the assumed functional form is wrong. Many practitioners are surprised by this because intro courses focus only on training R², which makes the metric look bounded when it is not.

What is the difference between R-squared and adjusted R-squared?

Adjusted R-squared corrects for the fact that adding more predictors to a linear regression mechanically increases R², even when the new predictors are pure noise. The adjustment subtracts a penalty proportional to the number of predictors and the sample size, so adjusted R² will fall if you add a non-informative variable. Use adjusted R² for multiple regression model selection, especially when comparing models with different numbers of predictors. For a single-predictor model, R² and adjusted R² are nearly identical.

When should I use RMSE or MAE instead of R-squared?

Use RMSE or MAE when you need the prediction error in the original units of the target variable. R² tells you the share of variance explained, which is unitless and hard to translate into business impact. RMSE expresses average error in dollars, degrees, or milliseconds, which is what stakeholders care about. MAE is more robust to outliers than RMSE. For model comparison on the same dataset, R² and RMSE rank models identically, so the choice is about communication, not ranking.

Does R-squared work for non-linear regression?

R² is well-defined for any regression model, but its interpretation as proportion of variance explained breaks down for non-linear models without an intercept and for non-least-squares fits. Many statistical packages report a pseudo-R² for logistic regression or non-linear models, and these are not directly comparable to ordinary least squares R². For non-linear models, prefer RMSE, MAE, or domain-specific metrics like log-loss for classification.

Is a high R-squared always good?

No. A very high R² can indicate overfitting, data leakage, or a target that is essentially a deterministic function of the features. In time-series models, a high R² is easy to achieve by trivially predicting today's value as a function of yesterday's, which has no business value. Always cross-validate, hold out a test set, and pair R² with a sanity check like predicting the mean baseline. In noisy domains like consumer behavior, an R² of 0.2 may be a strong result.

Why is R-squared not the right metric for evaluating LLM outputs?

R² is a regression metric for numeric targets. Large language model outputs are free-form text, where there is no single 'correct' number to compare against and no notion of variance to explain. LLM evaluation requires task-specific metrics: faithfulness, context adherence, groundedness, completeness, semantic similarity, and human-judgment correlation. Future AGI's evaluation suite ships these LLM-native metrics, including evaluator templates that can be run on every production trace through traceAI.

How do I compute R-squared in Python and scikit-learn?

Use sklearn.metrics.r2_score(y_true, y_pred) for ad-hoc scoring, or the .score() method of any scikit-learn regressor, which returns R² by default for regression estimators. For statsmodels, fit an OLS model and read the rsquared and rsquared_adj attributes. NumPy can compute it directly: 1 - sum((y_true - y_pred) ** 2) / sum((y_true - y_true.mean()) ** 2). Always compute R² on a held-out test set, not on the training set.

View all

Guides

RAG Evaluation Metrics in 2026: Faithfulness & More

RAG eval metrics in 2026: faithfulness, context precision, recall, groundedness, answer relevance, hallucination. With FAGI fi.evals templates.

Rishav Hada · Sep 12, 2025

11 min

Guides

Best LLM Evaluation Frameworks in 2026: Ranked for Production

Future AGI, DeepEval, RAGAS, Arize Phoenix, OpenAI Evals, and LangSmith ranked for LLM evaluation in 2026. Metrics taxonomy, eval templates, best practices.

Rishav Hada · Jul 24, 2025

10 min

Guides

F1 Score in 2026: Formula, Variants, When to Use, Sklearn Code

F1 Score for classification in 2026: harmonic mean of precision and recall, the math, macro vs micro vs weighted, when to use it, and a sklearn code example.

Rishav Hada · Feb 16, 2025

8 min

TL;DR

Why this metric still matters in 2026

What R-squared measures

Concrete example

Python implementation

Interpreting the value

Adjusted R-squared

When R-squared is the wrong metric

When you need error in real units

When the target is a class or a label

When the model is non-linear or non-OLS

When the target is text

Common pitfalls

Worked example with cross-validation

Quick reference

Related reading

Frequently asked questions