## Introduction

Once you have trained any regression model, the next step is determining how well the model is doing.

Thus, itâ€™s obvious to use a handful of performance metrics to evaluate the performance of linear regression, such as:

**Mean Squared Error â€“**Well, technically, this is a loss function. But MSE is often used to evaluate a model as well.**Visual inspection**by plotting the regression line. Yet, this is impossible when you fit a high-dimensional dataset.**R-Squared**â€“ also called the coefficient of determination.**Residual analysis**(for linear regression) â€“ involves examining the distribution of residuals to check for normality, as demonstrated in one of the earlier posts of Daily Dose of Data Science.

**F-statistic**â€“ Used to measure how much we are doing better than predicting the mean at a specific model inaccuracy.

We use them because evaluating the performance is crucial to ensure the modelâ€™s reliability and usefulness in a downstream application.

Also, performance quantification helps us make many informed decisions, such as:

- Do we need more feature engineering?
- Are we overfitting/underfitting the data, etc.?

The highlight of this article is the R-squared ($R^{2}$).

The $R^{2}$ metric is a popular choice among data scientists and machine learning engineers.

However, unknown to many, $R^{2}$ is possibly one of the most flawed evaluation metrics you could ever use to evaluate regression models.

Thus, in this blog, weâ€™ll understand what is the $R^{2}$ metric, what it explains, and how to interpret it.

Further, weâ€™ll examine why using the $R^{2}$ metric is not wise for evaluating** **regression models.

Letâ€™s begin!

**What is R-squared (RÂ˛)?**

Before exploring the limitations of $R^{2}$, it is essential to understand its purpose and significance in linear regression analysis.

Essentially, when working with, say, a linear regression model, you intend to define a linear relationship between the independent variables (predictors) and the dependent variable (outcome).

$$ \Large \hat y = w_{1}*x_{1} + w_{2}*x_{2} + \dots + w_{n}*x_{n} $$

Thus, the primary objective is to find the best-fitting line (or a hyperplane in a multidimensional space) that best approximates the outcome.

R-squared serves as a key performance measure for this purpose.

Formally, $R^{2}$ answers the following question:

**What fraction of variability in the actual outcome ($y$) is being captured by the predicted outcomes ($\hat y$)?**

In other words, R-squared provides insight into how much of the variability in the outcome variable can be attributed to the predictors used in the regression model.

Letâ€™s break that down a bit.

Imagine you have some dummy 2D data $(X, y)$. Now, in most cases, there will be some noise in the true outcome variable ($y$).â€Ś Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â

This means that the outcome has some variability, i.e., the variance of the outcome variable ($y$) is non-zero ($\sigma_{y} \ne 0$).

$R^{2}$ helps you precisely quantify the fraction of total variability in the outcome that the model is capturing.

Here lies an inherent assumption that a model capturing more variability is expected to be better than one capturing less variability.

Thus, we mathematically define $R^{2}$ as follows:

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

- $\text{Variability captured by the model}$ is always bounded between $[O, \text{Total variability in the data}]$.
- Thus, $R^{2}$ is always bounded between $[0,1]$.

A perfect $R^{2}$ of $1$ would mean that the model is capturing the entire variability.

And an $R^{2}$ of $0$ would mean that the model is capturing NO variability. Such a model is obtained when you simply predict the mean value.

To recap, we define $R^{2}$ as follows:

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

Letâ€™s consider the numerator and denominator separately now.

### Total variability

As the name suggests, total variability indicates the variation that already exists in the data.

Thus, it can be precisely determined using the variance of the true outcome variable $(y)$, as shown below:

$$ \Large \text{Total variability} = \sum_{i=1}^{n} (y_{i} - \mu_{y})^2$$

This is commonly referred to as the **Total Sum of Squares (TSS)**.

### Variability captured by the model

Next, having trained a model, the goal is to determine how much variation is it precisely able to capture.

We can determine the variability captured by the model using the variability that is NOT captured by the model.

Essentially, variability captured and not captured should add to the total variability.

$$ {\text{Variability captured}} + {\text{Variability NOT captured}} = {\text{Total Variability}}$$

By rearranging, we get the following:

$$ {\text{Variability captured}} = {\text{Total Variability}} - {\text{Variability NOT captured}}$$

We have already determined the total variability above (also called TSS).

Next, we can define the variability not captured by the model as the squared distance between the prediction $(\hat y_{i})$ and the true label $(y_{i})$:

$$ {\text{Variability NOT captured by the model}} = \sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2} $$

Why?

The term â€śVariability NOT captured by the modelâ€ť represents the sum of squared differences between the predicted values $(\hat y_{i})$ and the actual values $(y_{i})$ of the dependent variable for all data points $(i=1, 2, ..., n)$.

It quantifies the amount of variation in the outcome variable that remains unexplained by the linear regression model.

For instance, in a perfect fit scenario, where the R-squared value is $1$, the predicted values $(\hat y_{i})$ must perfectly match the actual values $(y_{i})$ for all data points.

This results in a variability NOT captured by the model of $0$. In such a case, the model accounts for all the variation in the dependent variable, and there is no unexplained variability.

However, in cases where the R-squared value is less than $1$ (indicating a less-than-perfect fit), the predicted values $(\hat y_{i})$ might deviate from the actual values $(y_{i})$

Consequently, the variability NOT captured by the model increases, representing the amount of variance that remains unexplained by the model.

Thus, to recap, we can safely define the unexplained variance as follows:

$$ {\text{Variability NOT captured by the model}} = \sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2} $$

The above is commonly referred to as the **Residual Sum of Squares (RSS)**, which is also used in the Mean Squared Error (MSE) loss function in regression (considering the mean $\frac{1}{n}$ factor).

Also, we should note that the lower the RSS, the more variability the model captures.

### RÂ˛ derivation

Substituting everything back into the formula:

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

or

$$ R^{2} = \frac{\text{Total variability - Unexplained variability}}{\text{Total variability}} $$

$$ R^{2} = \frac{\text{TSS - RSS }}{\text{TSS}} $$

$$ \LARGE R^{2} = 1 - \frac{\text{RSS}}{\text{TSS}} $$

Finally, we get:

$$ \LARGE R^{2} = 1 - \frac{\sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2}}{\sum_{i=1}^{n} (y_{i} - \mu_{y})^2} $$

Here:

- $\hat y_{i}$ is the modelâ€™s prediction.
- $y_{i}$ is the true outcome variable.
- $\mu_{y}$ is the mean of the outcome variable.

And to reiterate:

- $\text{RSS}$ captures the unexplained variability in the data.
- $\text{TSS}$ captures the total variability present in the data.

### An alternative formulation of RÂ˛

The typical way of interpreting the $R^{2}$ is to see it as the fraction of the variation of the dependent variable $y$ that the model explains.

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

However, another way of interpreting $R^{2}$ is to look at it as the Squared Pearson Correlation Coefficient $(r)$ between the observed values $(y)$ and the predicted values $(\hat y)$.

Pearson Correlation Coefficient $(r)$ between two variables $(x, y)$ is defined as:

$$ r = \frac{\text{Cov(}x, y\text{)}}{\sqrt{\text{Var(}x\text{)} \cdot \text{Var(}y\text{)}}} $$

It quantifies the strength and direction of the linear relationship between two variables, $x$ and $y$, and takes values between $[-1, 1]$.

Thus, the Pearson Correlation Coefficient $(r)$ between observed values $(y)$ and the predicted values $(\hat y)$ will be:

$$ r = \frac{\text{Cov(}y, \hat y\text{)}}{\sqrt{\text{Var(}y\text{)} \cdot \text{Var(}\hat y\text{)}}} $$

Hereâ€™s a formal proof that $R^{2}$ is the Squared Pearson Correlation Coefficient $(r)$ between the observed values $(y)$ and the predicted values $(\hat y)$:

### Recap

Before proceeding to the next section, hereâ€™s a quick recap of what we have discussed so far:

- $R^{2}$ is the fraction of the variation of the dependent variable $y$ that is explained by the model.

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

- Total variability is the variance of the true outcome variable $(y)$.

$$ \large \text{Total variability} = \sum_{i=1}^{n} (y_{i} - \mu_{y})^2$$

- Total variability is also called the
**Total Sum of Squares (TSS)**. - Variability captured by the model can be determined using unexplained variability.

$$ {\text{Variability captured}} = {\text{Total Variability}} - {\text{Variability NOT captured}}$$

$$ {\text{Variability NOT captured}} = \sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2} $$

- The above is also called the
**Residual Sum of Squares (RSS)**. - Another way of interpreting $R^{2}$ is to look at it as the Squared Pearson Correlation Coefficient $(r)$ between the observed values $(y)$ and the predicted values $(\hat y)$:

$$ R^{2} = r^{2}_{y, \hat y}$$

## Limitations of R-squared (RÂ˛)

Now that we understand what $R^{2}$ is, how to derive it, and its relationship with the squared Pearson correlation, we must look at its limitations as an evaluation metric for regression models.

While $R^{2}$ is widely considered a reliable measure for assessing the goodness-of-fit of a regression model, the reality is that it can be immensely misleading.

Letâ€™s discuss its limitations in detail!

### Optimizing RÂ˛ promotes overfitting

One of the most prominent limitations of $R^{2}$ is its susceptibility to promote overfitting.

Overfitting occurs when the model is too complex, and it captures random fluctuations in the training data that do not generalize to unseen data.

Thus, in many cases, a high $R^{2}$ value could be a potential sign of overfitting, where the model has modeled the random noise in the training data rather than capturing the true underlying patterns.

Letâ€™s understand how $R^{2}$ promotes overfitting!

As discussed above, $R^{2}$ always ranges between $0$ and $1$, and a value of $1$ suggests a perfect fit of the model to the training data.

Now, if we recall the definition of $R^{2}$, we learned it as the fraction of the variation of the dependent variable $y$ that the model captures:

$$ R^{2} = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}} $$

Therefore, a perfect $R^{2}$ of $1$ implies that the model has captured the entire variability in the data, including noise (if any).

Because variability in data is directly attributed to the presence of noise.

Therefore, if one is solely optimizing for $R^{2}$, they would unknowingly make their model more and more complex, to the point where it entirely overfits the data.

This may lead to deceptive conclusions about the modelâ€™s performance.

**Takeaways and Action Steps**

- Donâ€™t rely solely on $R^{2}$ without considering other regression metrics and validation techniques. This is because solely optimizing for $R^{2}$ will prompt you towards overfitting the data. Hereâ€™s a comprehensive guide by the Daily Dose of Data Science.

### RÂ˛ always increases as you add more predictors

Another major limitation of $R^{2}$ is its tendency to increase with the addition of more predictors $(x)$, regardless of whether those predictors are genuinely relevant or meaningful in explaining the dependent variable $(y)$.

Letâ€™s understand this in more detail!

In regression, each additional predictor adds more degrees of freedom to the model. This allows the model to capture more (even if slight) variance in the data.

In other words, the variance captured by the new predictor may even be zero, but it will never be negative.

So even if the added predictors are unrelated to the target variable, $R^{2}$ tends to increase (or it will NEVER decrease at the least).

To understand this logically, consider a regression model which uses $n$ variables and gets a particular value of $R^{2}$.

Next, we add one more variable to the predictors.

After adding a new predictor, the variance explained by the model cannot decrease because the model always has the option to completely ignore the newly added variable you gave it and work with your first $n$ features instead.

In other words, its coefficient will land close to zero, and it will not harm the variability that was previously captured.

The only option left for the new predictor is to explain some more (even if minimal) variance. Thus, adding a new variable can only improve the $R^{2}$, even if marginally, but never decrease it.

This phenomenon can be highly misleading, making it appear as if the model is performing better, when, in fact, it is not.

Instead, it might introduce unnecessary complexity without real improvement in predictive power.

We can also verify this experimentally:

As demonstrated above, adding a random feature increases the $R^{2}$ marginally.

### One can calculate RÂ˛ even before training a modelâ€Ś

**simple linear regression models**.

Unknown to many, if you are training a simple linear regression model, $R^{2}$ can be determined before even training it.

Isnâ€™t that weird?

Using $R^{2}$ as a performance metric feels like:

I know how my model will precisely perform before even fitting it.

We can prove that $R^{2}$ can be calculated before even fitting a simple linear regression model.

Recall the proof above which depicted that $R^{2}$ is the same as the Squared Pearson Correlation Coefficient $(r)$ between the observed values $(y)$ and the predicted values $(\hat y)$.

More specifically, consider the marked step below:

We see that:

$$ R^{2} = r_{y, \hat y}^{2} = \frac{Var(\hat y)}{Var(y)} $$

In a simple linear regression model, we can say that:

$$ \Large \hat y = a + b*x $$

...where a is the intercept and b is the slope.

Substituting this back into the above $R^{2}$ formulation, we get:

$$ R^{2} = \frac{Var(\hat y)}{Var(y)} = \frac{Var(a+b*x)}{Var(y)} $$

Using the following properties of variance:

...we get:

$$ R^{2} = \frac{Var(\hat y)}{Var(y)} = \frac{Var(a+b*x)}{Var(y)} = \frac{b^2 \cdot Var(x)}{Var(y)} $$

Furthermore, it is easy to prove that the regression coefficient $b$ can be estimated as follows:

$$ b = \frac{Cov(x, y)}{Var(x)}$$

Hereâ€™s a quick proof:

The OLS solution of linear regression is given by:

$$ \beta = (X^T X)^{-1} X^T Y $$

In the above formula, we notice that:

- $X^{T}X$ is the variance-covariance matrix between $X^T$ and $X$.
- $X^{T}Y$ is the variance-covariance matrix between $X^T$ and $Y$.

As there is only a single predictor in the model:

- $X^{T}X$ is the same as $Var(X)$.
- $X^{T}Y$ is the same as $Cov(X, Y)$.

Substituting these back into the OLS solution, we get:

$$ b = \frac{Cov(x, y)}{Var(x)}$$

Heading back to the $R^{2}$ formula and substituting the value of coefficient $b$, we get:

$$ R^{2} = b^{2} \cdot \frac{Var(x)}{Var(y)} $$

$$ R^{2} = (\frac{Cov(x, y)}{Var(x)})^{2} \cdot \frac{Var(x)}{Var(y)} $$

Simplifying, we get:

$$ R^{2} = \frac{Cov(x, y)^2}{Var(x) \cdot Var(y)}$$

The above is the same as the Squared Pearson Correlation Coefficient $(r)$ between the input $x$ and true outcome variable $y$.

Now look at the above formulation of $R^{2}$ once again and notice something:

- It DOES NOT depend on the modelâ€™s coefficients â€“ slope and intercept.
- It DOES NOT depend on the residuals.
- It JUST depends on the input and the output, which we already know before even fitting a simple linear regression model.

Thus, in the case of a simple linear regression, it is possible to calculate $R^{2}$ without even fitting it through the data.

This is not any mathematical trickery or something.

Instead, this is because $R^{2}$ is specifically designed to measure the correlation between two sequences â€“ $x$ and $y$.

## Conclusion and next steps

$R^{2}$ is quite popularly used all across data science and statistics.

Yet, I have increasingly seen it being interpreted as an ideal performance metric for evaluating model performance, when, in reality, it is not.

Instead, it is used to estimate the amount of variability in the data captured by the model.

And as seen above, capturing the entire variability does not directly correlate to having a model with high predictive and generalization power.

In other words, judging and optimizing the model's performance using $R^{2}$ inherently motivates one to strive for more, which eventually takes them to a stage where the model has learned and contorted its regression fit to all variations in the data, including noise.

This is similar to consistently optimizing for accuracy (for instance) on the training data when in reality, one should balance it with an unseen validation set.

So why can't we use $R^{2}$ on an unseen set?

Because the problem with $R^{2}$ is that in most cases, it is measured on the training data and one should not use it on unseen data. This is because the usual $R^{2}$ metric is intended to serve as a** fitting measure**, not a predictive measure.

Therefore, it is important to remember that:

- $R^{2}$ should never be used to measure goodness of fit.
- $R^{2}$ should never be used to measure predictive power.

Instead, you can rely on other reliable performance metrics for regression, which we have previously covered in the Daily Dose of Data Science here:

What's more, if you intend to evaluate linear models, you can also validate the performance by verifying model assumptions.

In the case of simple linear regression, you can rely on visual inspection.

For multiple linear regression, you can determine if the residuals follow a normal distribution or not.

Here are the previous posts by Daily Dose of Data Science for more info:

Would you like to add something? Feel free to comment :) Â

**Thanks for reading!**