Explanations of overfitting in machine learning tend to be frustratingly vague. We for instance hear "An overfitted model predicts poorly on new examples," which is a definition, not an explanation. Or we hear that overfitting results from "fitting the training data too well" or that we are "fitting the noise," again rather unsatisfying.
Somewhat better--but only somewhat-- are the explanations based on the famous Bias-Variance Tradeoff:
Mean squared prediction error (MSPE) is the sum of squared bias and variance. Bias and variance are at odds with each other. The lower the bias, the more the variance. As we move through a sequence of models of increasing complexity, eventually the reduction in bias is overwhlemed by the increase in variance, a net loss. Somewhere in that sequence of models is a best one, i.e. one with minimum MSPE.
OK, but the bias and variance of WHAT? Our treatment here is based on that notion, with emphasis on what it is that we are really talking about in discussing bias and variance.
Very basic statistical concepts + patience (document is a bit long). Expected notation, E(), is used in a couple of places, but is explained intuitively in the given context, and is not vital to the exposition.
Let Y denote our outcome variable, and X represent the vector of our predictors/features. We are predicting Y from X. For instance, Y might be human weight, with X being (height, age). We include binary classification problems, with Y = 1 or 0. (In multiclass problems, Y is a vector of 1s and 0s, with only one component being 1 at a time.)
Overfitting is often described as arising when we go too far towards the low-bias end of the Bias-Variance Tradeoff. Yes, but what does that really mean in the prediction context?
The regression function of Y on X, ρ(t), is defined to be the mean Y for the given X values t. In the weight/height/age example, ρ(68.2,32.9) is the mean weight of all people of height 68.2 and age 32.9.
Though some books use the term "regression" to mean a linear model, the actual definition is unrestricted; it applies just as much to, say, random forests as to linear models.
The ρ function is the best (i.e. minimum MSPE) predictor of weight based on height and age, the ideal. Note the phrase "based on," though. If we had a third predictor variable/feature available, say waist size, there would be another ρ for that 3-predictor setting, better than the 2-predictor one.
Note too that the ρ function is a population entity, which we estimate from our sample data. Denote the estimated ρ(t) by r(t). Say we are studying diabetic people. ρ(t) gives us the relation of weight vs. height and age in the entire population of diabetics; if our study involves 100 patients, that is considered a sample from the population, and our estimate r(t) is based on that sample.
(Note: "Sample" will mean our entire dataset; we will say, "A sample of 100 people," as in statistics courses, not "We have 100 samples.")
The term population is used in the statistics community. Those who come to ML from the computer science world tend to use terms like probabilistic generating process. Either way, it is crucial to keep in mind that we treat our data as randomly generated. If our data consists of 100 diabetes patients, we treat them as being randomly drawn from the conceptual population of all diabetics.
In a binary classification problem, the regression function reduces to the probability of class 1, since the mean of a 0,1 variable is the probability of a 1.
Say we use a linear model to predict weight from height, i.e. our model for ρ(height) is
ρ(weight) = mean weight = β0 + β1 height
This is called a "parametric" model, in that it expresses ρ(t) in terms of some parameters, in this case the βi.
Our sample-based estimate of the function ρ(t) is a function r(t), with
r(height) = mean weight = b0 + b1 height
The bi, say computed using the familiar least-squares method, are the sample estimates of the βi.
It will be convenient here to use as our nonparametric method the classic k-Nearest Neighbors approach. To estimate ρ(t), we find the k closest rows in our dataset to t, and average the Y values of those data points. Of course, k is a hyperparameter, chosen by the analyst.
Let's first review a famous instance of bias taught in elementary statistics courses, that of the sample variance. Say we have an i.i.d. sample X1,...,Xn from some population in which the variance of X is an unknown quantity σ2.
(Note again: We use the word sample in the manner of the statistics community. A dataset of 100 rows is a sample (singular) of 100 cases, not "100 samples" or "100 examples" is in the ML community..)
How do we estimate σ2 from the Xi?
At first, one might take as our estimate
S2 = (1/n) Σni=1 (Xi - Xbar)2,
where Xbar is the sample mean
(1/n) Σni=1 Xi
But the pioneering statisticians didn't like this, because they found that
E(S2) = ((n-1)/n) σ2
In other words, averaged over all possible samples from this population, S2 comes out...a...little...too...small. So, they declared that the estimator for σ2 would be
s2 = (n/(n-1)) S2
to make things come out exactly right.
However, most estimators are biased, and we must deal with that. It's a major issue in prediction, as we now discuss.
Parametric methods such as linear and logistic models have a bias, in the sense of the fundamental inaccuracy of the model itself.
Though the true population relation may be somewhat linear, the linear model cannot be perfectly correct. So, no matter how much data we have, our estimated r(t) will not converge to the true ρ(t) as the sample size grows; r(t) WILL converge to something, but that something will be the best-fitting LINEAR function for the population ρ(t), not the best-fitting FUNCTION.
This is model bias. Consider predicting the height of a person who is 68.2 inches tall. Our best prediction would be ρ(68.2), which we estimate from our data via r(68.2). Averaged over all possible samples, our prediction r(68.2) will NOT be equal to the true value, i.e.
Ei[r(68.2)] ≠ ρ(68.2)
For model-free, i.e. nonparametric, methods, the bias is subtler. Consider k-NN, again predicting weight from height. To calculate, say, r(70.6) with k=12, we find the 12 points in our data closest to 70.6, then take as r(70.6) the average weight among those 12 people.
Bias arises as follows: Say we wish to predict weight for a person of height 64.2, which is on the shorter end of the height range. Then most of the neighboring people in our dataset are likely to be taller than 64.2, and since our prediction will consist of the mean weight among the neighbors, this 64.2-inch tall person's weight will likely be overestimated, i.e.
E[r(64.2)] > ρ(64.2)
But there is an important difference between this and the parametric example: The larger the sample size n is (keeping k fixed), the smaller the bias. With a very large sample, the neighbors of the 64.2-inch tall person, for example, will tend to be pretty close to 64.2, producing a smaller bias. This is in contrast to the example of the linear model, in which the bias persists even as the sample size grows.
Note too that for fixed n, smaller k will result in smaller bias. Fewer neighbors means that the ones we have are close by.
The bias described above is pointwise, meaning specific to a particular value of t, e.g. t = 64.2 (or say t = (64.2,28.0) if we are predicting from both height and age).) By contrast, the MSPE is averaged over the distribution of X, i.e.
MSPE = E[(Y - ρ(X))2]
So, a linear model has permanent bias, even for huge datasets, while for k-NN, random forests and so on the bias goes to 0 as the dataset grows. But on the other hand, a linear model will have a smaller variance, our next topic:
This refers to sampling variance, which measures the degree of instability of one's estimated r(t) from one sample to another. E.g. in the above linear model, there will the variation of the bi from one sample to another, thus causing variation in r(t) among samples.
The same holds for nonparametric models. In the k-NN example above, say n = 1000 and k = 25. Then we are essentially basing our r(64.2) on the average of 25 data points--whereas the linear model uses all 1000 data points. So, k-NN will have a larger variance.
Note too that since k-NN averages k values, then for fixed n, we will have that the larger k is, the smaller the variance.
The relevance of sampling variance is difficult for many nonspecialists in statistics/machine learning to accept. "But we only have one sample," some might object to the analysis in the preceding paragraphs. True, but we are interested in the probability that that our sample is representative of the population. In a gambling casino, you may play a game just once, but you still would like to know the probability of winning. If the winnings vary rather little from one play of the game to another, and if the expected value is positive, your risk is less. But with large variability from one play to another, you will likely feel uneasy about taking the chance.
The same is true in data science, and that in turn means that sampling variance is key.
We stated above that a linear model cannot be exactly correct, even with an unlimited amount of data. So, why not a quadratic model?
ρ(weight) = mean weight = β0 + β1 height + β2 height2
This includes the linear model as a special case (β2 = 0), but also is more general. In the population, the best-fitting quadratic model will be more accurate than the best-fitting linear one. How about one of degree 3? With higher and higher degree polynomials, we can reduce the bias to smaller and smaller amounts.
But in finite samples, the higher the degree of the polynomial, the larger the variance. (We have more parameters to estimate, and I like to think in terms of them "sharing" the data, less data available to each one.)
So, we have a battle between bias and variance! As we increase the degree, first 1, then 2, then 3 and so on, the bias initially shrinks a lot while the variance grows only a little. But eventually, inceasing the degree by 1 will reduce bias only a little, while variance increases a lot.
Result:
If we plot MSPE vs. degree, we typically will get a U-shaped curve. Bias reduction dominates variance increase initially, but eventually variance overpowers bias. Once we pass the low point of the curve, we are overfitting.
The same is true for k-NN, plotting prediction error against k. As noted, we can achieve smaller bias with smaller k. The latter situation means smaller neighborhoods, making the problem of "using tall people to predict a short person's weight" less likely. But smaller k means we are taking an average over a small set of people, which will vary a lot from one sample to another, i.e. larger variance. If we try a range of k values, the bias-variance tradeoff will typically result in a U-shaped curve.
Again, MSPE is a population quantity, but we can estimate it via cross-validation. For instance, for each value of k in a range, we can fit k-NN on our training data and the estimate the corresponding MSPE value by predicting our test set.
Mean squared error is the sum of a decreasing quantity, squared bias, and an increasing quantity, variance. This typically produces a U-shape, but there is no inherent reason that it must come out this way. A double-U can occur (see below), or for that matter, multiple-U.
I used the Million Song Dataset. It consists of 90 audio measurements on about 500,000 songs. To reduce the amount of computation, I used only a random subset of 10,000 songs, and only two audio variables, fitting polynomial models in the two predictors, with increasing polynomial degree. (The package includes the dataset oneksongs, a random subset of 1,000 songs.)
As noted, mean squared prediction error is a sum of bias squared and variance. I calculated mean absolute prediction error (MAPE), but the principle is the same; it still combines bias and variance.
My goal was to show that:
As degree increases, MAPE at first falls but later rises, confirming the "U-shape" discussed above.
But as degree increases, variance always increases.
Combining these two points, we can observe the battle between bias and variance in their impact on MAPE. At first the reduction in bias dominates the increase in variance, but for larger degrees, the opposite it true.
As a useful measure of variance, I took the first song in the dataset, and let t = the vector of audio characterstics of that particular song. I then found the variance of r(t) (a matrix quadratic form of t with the estimated covariance matrix of the vector of coefficients of the polynomial).
I fit degrees 1 through 12. (Again, to save on computation, I used interaction terms only of degree 2.) For cross-validated MAPE values, I used 20 random holdout sets of size 1000. Here is the output, with variance and MAPE in the top and bottom rows, respectively:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.02440784 0.02718803 0.03184419 0.04229401 0.04945168 0.05450264
[2,] 7.86296820 7.74149419 7.74222432 7.66320622 7.70638321 7.69907202
[,7] [,8] [,9] [,10] [,11] [,12]
[1,] 0.06572677 0.07980077 0.114887 -0.008906266 NA NA
[2,] 7.73367546 7.70928618 7.733230 7.793813681 9.980302 17.36951
Starting with degree 10, there are serious numerical issues, arising from exact or nearly exact collinearity among the various terms in the polynomial. Thus variance is not reported for degrees 11 and 12, and even is negative (due to roundoff error) for degree 10.
In the top row, one can see variance steadily increasing through degree 9. MAPE shows a general "U" trend, albeit a shallow one.
Research in machine learning has, among other things, produced two major mysteries, which we address now.
In many ML applications, especially those in which neural networks are used, there may be far more hyperparameters than data points, i.e. p >> n. Yet excellent prediction accuracy on new data has often been reported in spite of such extreme overfitting.
Much speculation has been made as to why these phenomena can occur. My own speculation is as follows.
These phenomena occur in classification problems in which there is a very low error rate, 1 or 2% or even under 1%.
In other words, the classes are widely separated, i.e. there is a large margin in the SVM sense. Thus a large amount of variance can be tolerated in the fitted model, and there is room for myriad high-complexity functions that separate the classes perfectly or nearly so, even for test data.
Thus there are myriad fits that will achieve 0 training error, and the iterative solution method used by ML methods such as neural networks is often of a nature that the minimum norm solution is arrived at. In other words, of the ocean of possible 0-training error solutions, this one is smallest, which intuitively suggests that it is of small variance. This may produce very good test error.
Here is an example, using a famous image recognition dataset (generated by code available here:
There are 10 classes, each shown in a different color. Here the UMAP method (think of it as a nonlinear PCA) was applied for dimension reduction, down to dimension 2 here. There are some isolated points here and there, but almost all the data is separated into 10 "islands." Between any 2 islands, there are tons of high-dimensonal curves, say high-degree polynomials, that one can fit to separate them. So we see that overfitting is just not an issue, even with high-degree polynomials.
The phenomenon of double descent is sometimes observed, in which the familiar graph of test set accuracy vs. model complexity consists of two U shapes rather than one. The most intriguing cases are those in which the first U ends at a point where the model fully interpolates the training data, i.e. 0 training error--and yet even further complexity actually reduces test set error.
Below is an example again with the song data, but with a linear model rather than a polynomial one. The horizontal axis is p, the number of predictors. We see the double-U, though not with improvement in the second U over the first.
The argument explaining double descent generally made by researchers in this field is actually similar to the argument I made for the "overfitting with impunity" analysis above, regarding minimum-norm estimators.
By the way, the function penroseLM() in the regtools package (on top of which qeML is built) does find the minimum-norm b in the case of overfitting a linear model. As mentioned, this has been observed empirically, and some theoretical work has proved it under certain circumstances.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.