inst/mdFiles/RSquaredInGeneral.md

Clearing the Confusion: on the Generality of R2

From time to time, I hear questi/concerns regarding the suitability of R2 as a measure of predictive ability of a model. I think much of this stems from confusion as to what R2 really measures, especially in "nonstandard" settings. This document aims to clear the confusion.

Summary

Proportional Reduction (PropRed): In predicting Y from a set of variables X, R2 is the proportional reduction in MSPE arising from using X to predict Y, rather than using the overall mean Y value as our prediction of Y.

This is easily understood, even by those with limited stat background, making it an especially attractive measure.

Linear Models, Sample vs. Population

Setting

The linear model assumes that on the population level,

E(Y | X=t) = β't

for some constant vector β. (I am incorporating a 1 in X and t in order to accommodate the intercept term.) For now, we'll assume the model holds (and of course we are in the setting of numeric Y).

It can be shown that h = β minimizes

E[(Y - h'X)2]

That motivates estimating β from sample data by the value of b that minimizes

∑i [(Yi - b'Xi)2]

R2 as computed by, e.g. R's lm() function

The outputted value of R2 is then

R2 = (SSE1 - SSE2) / SSE1

where

SSE1 = ∑ i (Ybar - Yi)2

and

SSE2 = ∑ i (b'Xi - Yi)2

Note that in both SSE1 and SSE2, we are summing squared prediction errors, (Ypred - Yi)2.

The definition of PropRed might be clarified, to PropRedLin:

Proportional Reduction (PropRedLin): In predicting Y from a set of variables X, R2 is the proportional reduction in MSPE arising from using a linear model with X to predict Y, rather than using the overall mean Y value as our prediction of Y.

So we see that:

What if the linearity assumption is not justified?

Though the relation between X and mean Y is never exactly linear, in some situations it is good enough. But what if that is not the case?

If we still compute b to minimize the usual sum of squares,

∑i [(Yi - b'Xi)2]

then what is b estimating? The answer is that it is estimating whatever value of h minimizes

E[(Y - h'X)2]

just as before. But now β'X becomes the best-fitting linear predictor of Y based on X.

The key point then is that the outputted value of R2, i.e. (SSE1-SSE2) / SSE1 is still as in PropRedlin. Nothing has changed.

R2 can be used unchanged in nonlinear models

Suppose we fit, say, a random forests model to our data. We can define R2 exactly as before, needing only to update the form of our predictor.

SSE2 = ∑ i (rf(Xi) - Yi)2

where rf(Xi) is the predicted value of Yi based on our fitted random forests model.

The interpretation of R2 is then just as valid as before, just as easy to compute, and just as easy to explain to those with limited (if any) stat background.

Variants

Other than a bit of elegant mathematical theory, there is nothing magical about the role of squared quantities in R2. One could define it is the proportional reduction in Mean Absolute Prediction Error, for instance, or in the categorical Y case, proportional reduction in Overall Misclassification Error.



matloff/qeML documentation built on Dec. 15, 2024, 10:15 a.m.