R-Squared (R²) Explained: Formula, Interpretation, and Limits
What R-squared means, how to compute it, when adjusted R² helps, when to switch to RMSE/MAE, and why LLM evaluation needs different metrics.
Table of Contents
TL;DR
| Question | Short answer |
|---|---|
| What is R²? | Share of variance the regression model explains, on a 0 to 1 scale (training). |
| Formula | R² = 1 minus (residual sum of squares / total sum of squares). |
| Range | 0 to 1 on training; can be negative on test data. |
| When to use | Linear regression model comparison on a single dataset. |
| When to switch | Need error in real units (use RMSE / MAE); have non-linear model (use task-specific). |
| LLM evaluation | R² does not apply. Use faithfulness, context adherence, groundedness instead. |
R² is the right tool for regression on numeric targets. It is the wrong tool for classification, ranking, or LLM evaluation.
Why this metric still matters in 2026
Even in an era where most ML headlines are about transformers and agents, regression problems with numeric targets still dominate large parts of industry: pricing models, demand forecasting, risk scoring, A/B test analysis, and any system that outputs a number rather than a class or a paragraph. R-squared is the metric that statisticians and ML engineers reach for first to summarize how well a regression model fits.
The danger is that R² is also one of the most misused metrics in ML. A high R² on training data is not the same as a useful model in production. A negative R² on test data is possible and informative. A “low” R² in one domain can be a strong result in another. This article covers the formula, the correct interpretation, and the situations where you should reach for a different metric entirely.
What R-squared measures
R-squared, also called the coefficient of determination, answers the question: “How much of the variance in my target variable does my model explain?”
The formula is:
R² = 1 - SS_res / SS_tot
where:
SS_res= residual sum of squares = sum of (y_actual - y_predicted)² across all observations.SS_tot= total sum of squares = sum of (y_actual - y_mean)² across all observations.
SS_tot is the total variation in the target. SS_res is the variation the model fails to capture. The ratio is the unexplained share, and R² is one minus the unexplained share, that is, the explained share.
Concrete example
Suppose you predict house prices and your test set has five houses:
| Actual price | Predicted price |
|---|---|
| 200,000 | 205,000 |
| 250,000 | 245,000 |
| 300,000 | 310,000 |
| 350,000 | 340,000 |
| 400,000 | 405,000 |
The mean of the actuals is 300,000. The residuals (actual minus predicted) are -5,000, 5,000, -10,000, 10,000, -5,000. The residual sum of squares is 5,000² + 5,000² + 10,000² + 10,000² + 5,000² = 275,000,000. The total sum of squares is 100,000² + 50,000² + 0 + 50,000² + 100,000² = 25,000,000,000. R² = 1 - 275,000,000 / 25,000,000,000 = 0.989. The model explains about 99 percent of the variance in this tiny example.
Python implementation
import numpy as np
from sklearn.metrics import r2_score
y_true = np.array([200000, 250000, 300000, 350000, 400000])
y_pred = np.array([205000, 245000, 310000, 340000, 405000])
# scikit-learn
print(r2_score(y_true, y_pred))
# Manual, identical result
ss_res = np.sum((y_true - y_pred) ** 2)
ss_tot = np.sum((y_true - y_true.mean()) ** 2)
print(1 - ss_res / ss_tot)
Both methods return the same number. Use sklearn.metrics.r2_score for production code and the manual version when you want to be sure you understand what the library is computing.
Interpreting the value
The scale runs from 0 to 1 on training data with an intercept, where:
- R² = 1: model explains all variance perfectly. Suspicious in real data; usually indicates leakage or a trivial target.
- R² close to 1: strong fit.
- R² near 0: model explains roughly nothing beyond the mean.
- R² negative on test data: model is worse than predicting the test-set mean baseline (the denominator used by
sklearn.metrics.r2_score). The fit is broken, the distribution shifted, or the assumed form is wrong.
R² is unitless. An R² of 0.6 on a house price model and an R² of 0.6 on a stock return model are not directly comparable. Each domain has its own ceiling. Consumer-behavior data is noisy, and an R² of 0.2 to 0.3 may be considered strong. Physics simulations may demand R² above 0.99 to be useful.
Adjusted R-squared
Adding more predictors to a linear regression mechanically increases R², even when the new predictors are random noise. Adjusted R² penalizes the model for additional predictors:
adjusted R² = 1 - (1 - R²) * (n - 1) / (n - k - 1)
where n is the number of observations and k is the number of predictors. Adjusted R² falls when a new predictor does not improve the fit enough to offset the penalty. Use adjusted R² for multiple regression model selection when you are comparing models with different numbers of predictors.
In Python with statsmodels:
import statsmodels.api as sm
X = sm.add_constant(X_train)
model = sm.OLS(y_train, X).fit()
print(model.rsquared)
print(model.rsquared_adj)
When R-squared is the wrong metric
R² is appropriate for least-squares regression on numeric targets, evaluated on a single dataset, with comparable models. It becomes the wrong tool in several common situations.
When you need error in real units
R² is unitless. A business stakeholder usually wants to know “how wrong is the model on average, in dollars?” That is RMSE (root mean squared error) or MAE (mean absolute error). MAE is more robust to outliers; RMSE penalizes large errors more heavily. For model ranking on a single dataset, R² and RMSE produce the same order, so the choice is about communication.
When the target is a class or a label
R² is meaningless for classification. Use accuracy, precision, recall, F1, log-loss, or AUC. Pseudo-R² metrics exist for logistic regression but they do not have the same interpretation as ordinary least squares R².
When the model is non-linear or non-OLS
R² assumes the model partitions variance through least squares. For non-linear regressions, robust regressions, or quantile regressions, R² can mislead. Prefer RMSE, MAE, or task-specific metrics.
When the target is text
Large language model outputs are free-form text. There is no scalar target to compute variance for. R² simply does not apply. LLM evaluation requires text-native metrics:
- Faithfulness: does the output stick to the source material?
- Context Adherence: does the answer reference only the provided context?
- Groundedness: are all factual claims supported by retrieved evidence?
- Completeness: does the answer cover the required points?
- Semantic similarity: how close is the output to a reference, by embedding distance?
- Toxicity, PII, safety: domain-specific guardrails.
Future AGI ships these as runnable evaluators through the fi.evals library, which is the LLM-eval equivalent of sklearn.metrics:
from fi.evals import evaluate
result = evaluate(
"context_adherence",
output="Phi-4 is a small language model released by Microsoft.",
context="Phi-4 was released in December 2024 with 14B parameters.",
)
print(result.score)
print(result.reason)
The conceptual point is that “model accuracy” means different things for regression and language. R² answers it for regression; faithfulness and context adherence answer it for LLMs.
Common pitfalls
- Comparing R² across datasets. Different datasets have different irreducible noise; R² ceilings differ. Compare on the same data only.
- Reporting only training R². Always hold out a test set. Training R² is an upper bound on generalization, not a measure of it.
- Adding more predictors to chase R². Use adjusted R², regularization (Lasso, Ridge, Elastic Net), or cross-validation to keep models honest.
- Trusting a perfect R². R² of 1 on real data is almost always leakage. Check for the target appearing in the features, or a near-duplicate feature.
- Using R² for classification. It is a regression metric. For classification, use the classification toolbox.
- Forgetting that R² can be negative on test data. A negative test R² is information, not an error.
- Treating low R² as failure. In high-noise domains like behavior, sentiment, or financial returns, an R² of 0.05 to 0.20 can correspond to a genuinely useful model.
Worked example with cross-validation
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import r2_score, mean_squared_error
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42,
)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Train R²:", model.score(X_train, y_train))
print("Test R²:", r2_score(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
# 5-fold cross-validated R² gives a more honest estimate of generalization.
cv_scores = cross_val_score(model, X, y, cv=5, scoring="r2")
print("CV R² mean:", cv_scores.mean(), "± std:", cv_scores.std())
The cross-validated R² is what you would report as the model’s expected performance on new data. If it is much lower than the training R², the model is overfitting.
Quick reference
| You have | Use this metric |
|---|---|
| Linear regression on numeric target | R² and RMSE |
| Multiple regression | Adjusted R² for model selection |
| Classification | Accuracy, F1, log-loss, AUC |
| Imbalanced classification | Precision, recall, AUPRC |
| LLM generation | Faithfulness, context adherence, groundedness, completeness |
| Ranking | NDCG, MRR, MAP |
| Time-series forecast | MAPE, sMAPE, RMSE, walk-forward CV |
| Anomaly detection | Precision at K, ROC-AUC |
R² is a useful, simple, and well-understood metric for regression on numeric targets. Use it for what it is good for, pair it with RMSE for communication, validate on held-out data, and reach for a different toolbox when the target is text or a class. For LLM evaluation specifically, R² is not in scope; the LLM-native equivalents live in evaluation suites like Future AGI’s fi.evals library, which run the same metric the same way across every model, every prompt, and every production trace.
Related reading
- LLM evaluation frameworks, metrics, and best practices covers the LLM equivalents of regression metrics.
- RAG evaluation metrics walks through context adherence, groundedness, and faithfulness.
- Deterministic LLM evaluation metrics details numeric eval metrics that work like classical scorers.
- Top LLM evaluation tools compares evaluation platforms.
Frequently asked questions
What does R-squared (R²) actually measure?
Why can R-squared be negative?
What is the difference between R-squared and adjusted R-squared?
When should I use RMSE or MAE instead of R-squared?
Does R-squared work for non-linear regression?
Is a high R-squared always good?
Why is R-squared not the right metric for evaluating LLM outputs?
How do I compute R-squared in Python and scikit-learn?
RAG eval metrics in 2026: faithfulness, context precision, recall, groundedness, answer relevance, hallucination. With FAGI fi.evals templates.
Future AGI, DeepEval, RAGAS, Arize Phoenix, OpenAI Evals, and LangSmith ranked for LLM evaluation in 2026. Metrics taxonomy, eval templates, best practices.
F1 Score for classification in 2026: harmonic mean of precision and recall, the math, macro vs micro vs weighted, when to use it, and a sklearn code example.