Why R-squared is a Flawed Regression Metric?

The lesser-known limitations of the R-squared metric.

Why R-squared is a Flawed Regression Metric?

Introduction

Once you have trained any regression model, the next step is determining how well the model is doing.

Thus, it’s obvious to use a handful of performance metrics to evaluate the performance of linear regression, such as:

  • Mean Squared Error – Well, technically, this is a loss function. But MSE is often used to evaluate a model as well.
  • Visual inspection by plotting the regression line. Yet, this is impossible when you fit a high-dimensional dataset.
  • R-Squared – also called the coefficient of determination.
  • Residual analysis (for linear regression) – involves examining the distribution of residuals to check for normality, as demonstrated in one of the earlier posts of Daily Dose of Data Science.
Visualize The Performance Of Any Linear Regression Model With This Simple Plot
Assumption turned into performance validation.
  • F-statistic – Used to measure how much we are doing better than predicting the mean at a specific model inaccuracy.

We use them because evaluating the performance is crucial to ensure the model’s reliability and usefulness in a downstream application.

Also, performance quantification helps us make many informed decisions, such as:

  • Do we need more feature engineering?
  • Are we overfitting/underfitting the data, etc.?

The highlight of this article is the R-squared ($R^{2}$).

The $R^{2}$ metric is a popular choice among data scientists and machine learning engineers.

However, unknown to many, $R^{2}$ is possibly one of the most flawed evaluation metrics you could ever use to evaluate regression models.

Thus, in this blog, we’ll understand what is the $R^{2}$ metric, what it explains, and how to interpret it.

Further, we’ll examine why using the $R^{2}$ metric is not wise for evaluating regression models.

Let’s begin!

What is R-squared (R²)?

Before exploring the limitations of $R^{2}$, it is essential to understand its purpose and significance in linear regression analysis.

Essentially, when working with, say, a linear regression model, you intend to define a linear relationship between the independent variables (predictors) and the dependent variable (outcome).

$$ \Large \hat y = w_{1}*x_{1} + w_{2}*x_{2} + \dots + w_{n}*x_{n} $$

Thus, the primary objective is to find the best-fitting line (or a hyperplane in a multidimensional space) that best approximates the outcome.

R-squared serves as a key performance measure for this purpose.

Formally, $R^{2}$ answers the following question:

  • What fraction of variability in the actual outcome ($y$) is being captured by the predicted outcomes ($\hat y$)?

In other words, R-squared provides insight into how much of the variability in the outcome variable can be attributed to the predictors used in the regression model.

Let’s break that down a bit.

Imagine you have some dummy 2D data $(X, y)$. Now, in most cases, there will be some noise in the true outcome variable ($y$).‌                                                                      

Left: The outcome variable has zero variance. Right: The outcome variable has a non-zero variance.

This means that the outcome has some variability, i.e., the variance of the outcome variable ($y$) is non-zero ($\sigma_{y} \ne 0$).

$R^{2}$ helps you precisely quantify the fraction of total variability in the outcome that the model is capturing.

Here lies an inherent assumption that a model capturing more variability is expected to be better than one capturing less variability.

Left: Model capturing low data variability. Right: Model capturing high data variability.

Thus, we mathematically define $R^{2}$ as follows:

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

  • $\text{Variability captured by the model}$ is always bounded between $[O, \text{Total variability in the data}]$.
  • Thus, $R^{2}$ is always bounded between $[0,1]$.

A perfect $R^{2}$ of $1$ would mean that the model is capturing the entire variability.

Interpretation of R2 on either extreme

And an $R^{2}$ of $0$ would mean that the model is capturing NO variability. Such a model is obtained when you simply predict the mean value.

To recap, we define $R^{2}$ as follows:

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

Let’s consider the numerator and denominator separately now.

Total variability

As the name suggests, total variability indicates the variation that already exists in the data.

Thus, it can be precisely determined using the variance of the true outcome variable $(y)$, as shown below:

$$ \Large \text{Total variability} = \sum_{i=1}^{n} (y_{i} - \mu_{y})^2$$

This is commonly referred to as the Total Sum of Squares (TSS).

Variability captured by the model

Next, having trained a model, the goal is to determine how much variation is it precisely able to capture.

We can determine the variability captured by the model using the variability that is NOT captured by the model.

Essentially, variability captured and not captured should add to the total variability.

$$ {\text{Variability captured}} + {\text{Variability NOT captured}} = {\text{Total Variability}}$$

By rearranging, we get the following:

$$ {\text{Variability captured}} = {\text{Total Variability}} - {\text{Variability NOT captured}}$$

We have already determined the total variability above (also called TSS).

Next, we can define the variability not captured by the model as the squared distance between the prediction $(\hat y_{i})$ and the true label $(y_{i})$:

$$ {\text{Variability NOT captured by the model}} = \sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2} $$

Why?

The term “Variability NOT captured by the model” represents the sum of squared differences between the predicted values $(\hat y_{i})$ and the actual values $(y_{i})$ of the dependent variable for all data points $(i=1, 2, ..., n)$.

It quantifies the amount of variation in the outcome variable that remains unexplained by the linear regression model.

For instance, in a perfect fit scenario, where the R-squared value is $1$, the predicted values $(\hat y_{i})$ must perfectly match the actual values $(y_{i})$ for all data points.

The model captures the entire data variability

This results in a variability NOT captured by the model of $0$. In such a case, the model accounts for all the variation in the dependent variable, and there is no unexplained variability.

However, in cases where the R-squared value is less than $1$ (indicating a less-than-perfect fit), the predicted values $(\hat y_{i})$ might deviate from the actual values $(y_{i})$

The model has non-zero uncaptured variability. 

Consequently, the variability NOT captured by the model increases, representing the amount of variance that remains unexplained by the model.

Thus, to recap, we can safely define the unexplained variance as follows:

$$ {\text{Variability NOT captured by the model}} = \sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2} $$

The above is commonly referred to as the Residual Sum of Squares (RSS), which is also used in the Mean Squared Error (MSE) loss function in regression (considering the mean $\frac{1}{n}$ factor).

Also, we should note that the lower the RSS, the more variability the model captures.

As RSS decreases, the variability captured increases

R² derivation

Substituting everything back into the formula:

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

or

$$ R^{2} = \frac{\text{Total variability - Unexplained variability}}{\text{Total variability}} $$

$$ R^{2} = \frac{\text{TSS - RSS }}{\text{TSS}} $$

$$ \LARGE R^{2} = 1 - \frac{\text{RSS}}{\text{TSS}} $$

Finally, we get:

$$ \LARGE R^{2} = 1 - \frac{\sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2}}{\sum_{i=1}^{n} (y_{i} - \mu_{y})^2} $$

Here:

  • $\hat y_{i}$ is the model’s prediction.
  • $y_{i}$ is the true outcome variable.
  • $\mu_{y}$ is the mean of the outcome variable.

And to reiterate:

  • $\text{RSS}$ captures the unexplained variability in the data.
  • $\text{TSS}$ captures the total variability present in the data.

An alternative formulation of R²

The typical way of interpreting the $R^{2}$ is to see it as the fraction of the variation of the dependent variable $y$ that the model explains.

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

However, another way of interpreting $R^{2}$ is to look at it as the Squared Pearson Correlation Coefficient $(r)$ between the observed values $(y)$ and the predicted values $(\hat y)$.

Pearson Correlation Coefficient $(r)$ between two variables $(x, y)$ is defined as:

$$ r = \frac{\text{Cov(}x, y\text{)}}{\sqrt{\text{Var(}x\text{)} \cdot \text{Var(}y\text{)}}} $$

Positive, zero and negative Pearson correlation

It quantifies the strength and direction of the linear relationship between two variables, $x$ and $y$, and takes values between $[-1, 1]$.

Thus, the Pearson Correlation Coefficient $(r)$ between observed values $(y)$ and the predicted values $(\hat y)$ will be:

$$ r = \frac{\text{Cov(}y, \hat y\text{)}}{\sqrt{\text{Var(}y\text{)} \cdot \text{Var(}\hat y\text{)}}} $$

Here’s a formal proof that $R^{2}$ is the Squared Pearson Correlation Coefficient $(r)$ between the observed values $(y)$ and the predicted values $(\hat y)$:

Proof of R2 Squared Pearson Correlation Coefficient. Source: Economic Theory Blog.

Recap

Before proceeding to the next section, here’s a quick recap of what we have discussed so far:

  • $R^{2}$ is the fraction of the variation of the dependent variable $y$ that is explained by the model.

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

  • Total variability is the variance of the true outcome variable $(y)$.

$$ \large \text{Total variability} = \sum_{i=1}^{n} (y_{i} - \mu_{y})^2$$

  • Total variability is also called the Total Sum of Squares (TSS).
  • Variability captured by the model can be determined using unexplained variability.

$$ {\text{Variability captured}} = {\text{Total Variability}} - {\text{Variability NOT captured}}$$

$$ {\text{Variability NOT captured}} = \sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2} $$

  • The above is also called the Residual Sum of Squares (RSS).
  • Another way of interpreting $R^{2}$ is to look at it as the Squared Pearson Correlation Coefficient $(r)$ between the observed values $(y)$ and the predicted values $(\hat y)$:

$$ R^{2} = r^{2}_{y, \hat y}$$

Limitations of R-squared (R²)

Now that we understand what $R^{2}$ is, how to derive it, and its relationship with the squared Pearson correlation, we must look at its limitations as an evaluation metric for regression models.

While $R^{2}$ is widely considered a reliable measure for assessing the goodness-of-fit of a regression model, the reality is that it can be immensely misleading.

Let’s discuss its limitations in detail!

Optimizing R² promotes overfitting

📓
This limitation is valid for all regression models and not just linear regression.

One of the most prominent limitations of $R^{2}$ is its susceptibility to promote overfitting.

Overfitting occurs when the model is too complex, and it captures random fluctuations in the training data that do not generalize to unseen data.

Thus, in many cases, a high $R^{2}$ value could be a potential sign of overfitting, where the model has modeled the random noise in the training data rather than capturing the true underlying patterns.

Let’s understand how $R^{2}$ promotes overfitting!

As discussed above, $R^{2}$ always ranges between $0$ and $1$, and a value of $1$ suggests a perfect fit of the model to the training data.

Now, if we recall the definition of $R^{2}$, we learned it as the fraction of the variation of the dependent variable $y$ that the model captures:

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

Therefore, a perfect $R^{2}$ of $1$ implies that the model has captured the entire variability in the data, including noise (if any).

More variability captured indicates more R2

Because variability in data is directly attributed to the presence of noise.

Therefore, if one is solely optimizing for $R^{2}$, they would unknowingly make their model more and more complex, to the point where it entirely overfits the data.

This may lead to deceptive conclusions about the model’s performance.

Takeaways and Action Steps

  • Don’t rely solely on $R^{2}$ without considering other regression metrics and validation techniques. This is because solely optimizing for $R^{2}$ will prompt you towards overfitting the data. Here’s a comprehensive guide by the Daily Dose of Data Science.
The Ultimate Categorization of Performance Metrics in ML
Regression and classification metrics in a single frame.

R² always increases as you add more predictors

📓
This limitation is only valid for all linear regression models.

Another major limitation of $R^{2}$ is its tendency to increase with the addition of more predictors $(x)$, regardless of whether those predictors are genuinely relevant or meaningful in explaining the dependent variable $(y)$.

R2 never decreases if we add more predictors

Let’s understand this in more detail!

In regression, each additional predictor adds more degrees of freedom to the model. This allows the model to capture more (even if slight) variance in the data.

In other words, the variance captured by the new predictor may even be zero, but it will never be negative.

So even if the added predictors are unrelated to the target variable, $R^{2}$ tends to increase (or it will NEVER decrease at the least).

To understand this logically, consider a regression model which uses $n$ variables and gets a particular value of $R^{2}$.

Next, we add one more variable to the predictors.

After adding a new predictor, the variance explained by the model cannot decrease because the model always has the option to completely ignore the newly added variable you gave it and work with your first $n$ features instead.

An irrelevant feature ignored by the regression model

In other words, its coefficient will land close to zero, and it will not harm the variability that was previously captured.

The only option left for the new predictor is to explain some more (even if minimal) variance. Thus, adding a new variable can only improve the $R^{2}$, even if marginally, but never decrease it.

This phenomenon can be highly misleading, making it appear as if the model is performing better, when, in fact, it is not.

Instead, it might introduce unnecessary complexity without real improvement in predictive power.

We can also verify this experimentally:

R2 increases by adding a random predictor

As demonstrated above, adding a random feature increases the $R^{2}$ marginally.

One can calculate R² even before training a model‌

📓
This limitation is only valid for simple linear regression models.

Unknown to many, if you are training a simple linear regression model, $R^{2}$ can be determined before even training it.

Isn’t that weird?

Using $R^{2}$ as a performance metric feels like:

I know how my model will precisely perform before even fitting it.

We can prove that $R^{2}$ can be calculated before even fitting a simple linear regression model.

Recall the proof above which depicted that $R^{2}$ is the same as the Squared Pearson Correlation Coefficient $(r)$ between the observed values $(y)$ and the predicted values $(\hat y)$.

More specifically, consider the marked step below:

Pearson Correlation in terms of variance of predicted and true outcome. 

We see that:

$$ R^{2} = r_{y, \hat y}^{2} = \frac{Var(\hat y)}{Var(y)} $$

In a simple linear regression model, we can say that:

$$ \Large \hat y = a + b*x $$

...where a is the intercept and b is the slope.

Substituting this back into the above $R^{2}$ formulation, we get:

$$ R^{2} = \frac{Var(\hat y)}{Var(y)} = \frac{Var(a+b*x)}{Var(y)} $$

Using the following properties of variance:

...we get:

$$ R^{2} = \frac{Var(\hat y)}{Var(y)} = \frac{Var(a+b*x)}{Var(y)} = \frac{b^2 \cdot Var(x)}{Var(y)} $$

Furthermore, it is easy to prove that the regression coefficient $b$ can be estimated as follows:

$$ b = \frac{Cov(x, y)}{Var(x)}$$


Here’s a quick proof:

The OLS solution of linear regression is given by:

$$ \beta = (X^T X)^{-1} X^T Y $$

In the above formula, we notice that:

  • $X^{T}X$ is the variance-covariance matrix between $X^T$ and $X$.
  • $X^{T}Y$ is the variance-covariance matrix between $X^T$ and $Y$.

As there is only a single predictor in the model:

  • $X^{T}X$ is the same as $Var(X)$.
  • $X^{T}Y$ is the same as $Cov(X, Y)$.

Substituting these back into the OLS solution, we get:

$$ b = \frac{Cov(x, y)}{Var(x)}$$


Heading back to the $R^{2}$ formula and substituting the value of coefficient $b$, we get:

$$ R^{2} = b^{2} \cdot \frac{Var(x)}{Var(y)} $$

$$ R^{2} = (\frac{Cov(x, y)}{Var(x)})^{2} \cdot \frac{Var(x)}{Var(y)} $$

Simplifying, we get:

$$ R^{2} = \frac{Cov(x, y)^2}{Var(x) \cdot Var(y)}$$

The above is the same as the Squared Pearson Correlation Coefficient $(r)$ between the input $x$ and true outcome variable $y$.

Now look at the above formulation of $R^{2}$ once again and notice something:

  • It DOES NOT depend on the model’s coefficients – slope and intercept.
  • It DOES NOT depend on the residuals.
  • It JUST depends on the input and the output, which we already know before even fitting a simple linear regression model.

Thus, in the case of a simple linear regression, it is possible to calculate $R^{2}$ without even fitting it through the data.

This is not any mathematical trickery or something.

Instead, this is because $R^{2}$ is specifically designed to measure the correlation between two sequences – $x$ and $y$.

Conclusion and next steps

$R^{2}$ is quite popularly used all across data science and statistics.

Yet, I have increasingly seen it being interpreted as an ideal performance metric for evaluating model performance, when, in reality, it is not.

Instead, it is used to estimate the amount of variability in the data captured by the model.

And as seen above, capturing the entire variability does not directly correlate to having a model with high predictive and generalization power.

In other words, judging and optimizing the model's performance using $R^{2}$ inherently motivates one to strive for more, which eventually takes them to a stage where the model has learned and contorted its regression fit to all variations in the data, including noise.

This is similar to consistently optimizing for accuracy (for instance) on the training data when in reality, one should balance it with an unseen validation set.

So why can't we use $R^{2}$ on an unseen set?

Because the problem with $R^{2}$ is that in most cases, it is measured on the training data and one should not use it on unseen data. This is because the usual $R^{2}$ metric is intended to serve as a fitting measure, not a predictive measure.

Therefore, it is important to remember that:

  • $R^{2}$ should never be used to measure goodness of fit.
  • $R^{2}$ should never be used to measure predictive power.

Instead, you can rely on other reliable performance metrics for regression, which we have previously covered in the Daily Dose of Data Science here:

The Ultimate Categorization of Performance Metrics in ML
Regression and classification metrics in a single frame.

What's more, if you intend to evaluate linear models, you can also validate the performance by verifying model assumptions.

In the case of simple linear regression, you can rely on visual inspection.

For multiple linear regression, you can determine if the residuals follow a normal distribution or not.

Here are the previous posts by Daily Dose of Data Science for more info:

Most ML Folks Often Neglect This While Using Linear Regression
The effectiveness of a linear regression model is determined by how well the data conforms to the algorithm’s underlying assumptions. One highly important, yet often neglected assumption of linear regression is homoscedasticity. A dataset is homoscedastic if the variability of residuals (=actual-pre…
Visualize The Performance Of Linear Regression With This Simple Plot
Assumption turned into performance validation.
Visualize The Performance Of Any Linear Regression Model With This Simple Plot
Assumption turned into performance validation.

Would you like to add something? Feel free to comment :)  

Thanks for reading!



Join the Daily Dose of Data Science Today!

A daily column with insights, observations, tutorials, and best practices on data science.

Get Started!
Join the Daily Dose of Data Science Today!

Great! You’ve successfully signed up. Please check your email.

Welcome back! You've successfully signed in.

You've successfully subscribed to Daily Dose of Data Science.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.