From time to time, I hear questi/concerns regarding the suitability of R2 as a measure of predictive ability of a model. I think much of this stems from confusion as to what R2 really measures, especially in "nonstandard" settings. This document aims to clear the confusion.
Proportional Reduction (PropRed): In predicting Y from a set of variables X, R2 is the proportional reduction in MSPE arising from using X to predict Y, rather than using the overall mean Y value as our prediction of Y.
This is easily understood, even by those with limited stat background, making it an especially attractive measure.
What happens regarding R2 if we fit a linear model to data in which the linearity assumption is justified, but the normality and homoscedasticity assumptions are not? Answer: The PropRed property for the outputted value of R2 is still valid.
What happens regarding R2 if we fit a linear model to data in which the linearity assumption is not justified? Answer: The PropRed property for the outputted R2 value is still valid.
Is there an analog of R2 for nonlinear models, say random forests? Answer: Yes, in fact with the same formula.
The linear model assumes that on the population level,
E(Y | X=t) = β't
for some constant vector β. (I am incorporating a 1 in X and t in order to accommodate the intercept term.) For now, we'll assume the model holds (and of course we are in the setting of numeric Y).
It can be shown that h = β minimizes
E[(Y - h'X)2]
That motivates estimating β from sample data by the value of b that minimizes
∑i [(Yi - b'Xi)2]
The outputted value of R2 is then
R2 = (SSE1 - SSE2) / SSE1
where
SSE1 = ∑ i (Ybar - Yi)2
and
SSE2 = ∑ i (b'Xi - Yi)2
Note that in both SSE1 and SSE2, we are summing squared prediction errors, (Ypred - Yi)2.
The definition of PropRed might be clarified, to PropRedLin:
Proportional Reduction (PropRedLin): In predicting Y from a set of variables X, R2 is the proportional reduction in MSPE arising from using a linear model with X to predict Y, rather than using the overall mean Y value as our prediction of Y.
So we see that:
The outputted R2 value is exactly PropRedLin.
Neither the normality nor homoscedasticity assumptions play any role.
Though the relation between X and mean Y is never exactly linear, in some situations it is good enough. But what if that is not the case?
If we still compute b to minimize the usual sum of squares,
∑i [(Yi - b'Xi)2]
then what is b estimating? The answer is that it is estimating whatever value of h minimizes
E[(Y - h'X)2]
just as before. But now β'X becomes the best-fitting linear predictor of Y based on X.
The key point then is that the outputted value of R2, i.e. (SSE1-SSE2) / SSE1 is still as in PropRedlin. Nothing has changed.
Suppose we fit, say, a random forests model to our data. We can define R2 exactly as before, needing only to update the form of our predictor.
SSE2 = ∑ i (rf(Xi) - Yi)2
where rf(Xi) is the predicted value of Yi based on our fitted random forests model.
The interpretation of R2 is then just as valid as before, just as easy to compute, and just as easy to explain to those with limited (if any) stat background.
Other than a bit of elegant mathematical theory, there is nothing magical about the role of squared quantities in R2. One could define it is the proportional reduction in Mean Absolute Prediction Error, for instance, or in the categorical Y case, proportional reduction in Overall Misclassification Error.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.