Say one has some data and wishes to find a parametric distribution family. We address two questions here:
Why is this desirable?
How might it be done?
Many statistical and probability models assume there are some parametric distributions driving the various variables. Even in more basic settings, having a parametric model for our data allows us to compactly describe our data, with a small number of parameters sufficing.
In statistics, we treat our data as a sample from a parent population, in which the variable of interest, say X, has unknown density fX(t). We estimate that density from our data.
This is a widely-used dataset, available for instance from the mlbench package:
> library(mlbench)
> data(PimaIndiansDiabetes2)
> pima <- PimaIndiansDiabetes2 # shorter to type
> head(pima) # always look at your data
pregnant glucose pressure triceps insulin mass pedigree age diabetes
1 6 148 72 35 NA 33.6 0.627 50 pos
2 1 85 66 29 NA 26.6 0.351 31 neg
3 8 183 64 NA NA 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.288 33 pos
6 5 116 74 NA NA 25.6 0.201 30 neg
> bmi <- pima$mass
> bmi <- na.exclude(bmi) # exclude any row with NAs
So, let's plot the data. We'll use R's basic histogram function, hist(). A more advanced alternative is density(), which plots a smooth curve. In calling hist(bmi,freq=FALSE) (that second argument means we want area = 1.0), we produce this:
Remember, this is a sample estimate of fX, so it can be used for deciding whether to use a given parametric model.
We see a rather steep rise from 0, quickly reaching a peak, then a gradual tapering off toward 0. This suggests that the gamma distribution family may work well.
There is of course the question as to how well this data is fit by the gamma family. We will return to this later.
Once we decide to use a certain parametric family, we must decide which member of that family to use. In other words, what parameter values?
The two classic ways to estimate the parameters are Maximum Likelihood Estimation (MLE) and the Method of Moments (MM). I'm partial to MM, and will use that here.
The population kth moment of X is defined to be E(Xk). It can be estimated by its sample analog
Mk = (1/n) Σi=1n Xik
where our data are X1,...,Xn. (The two quantities are analogous because E(Xk) is the average of Xk in the population, while Mk is the average of Xk in the sample.) So M1 is simply our sample mean, "X-bar."
One may also use central moments, e.g. Var(X) and s2. Note that s2 = M2 - (M1)2.
The idea of MM is to set the population moments equal to their sample analogs. Since the former are functions of the parameters, we can solve for the parameters, which serve actually as our estimated parameters.
If we have q parameters, we form equations corresponding to the first q moments. For the gamma family with paraemeters r and λ, we use the first two moments. For convenience, we'll use variance rather than E(X2).
For the gamma family, EX = r/λ and Var(X) = r/λ2. So our equations are
M1 = rest / λest
s2 = rest / λest2
Luckily, these are easy to solve. We divide the first by the second, yielding
λest = M1 / s2
and
rest = M12 / s2
Let's superimpose the fitted gamma density onto the histogram:
> curve(dgamma(x,rest,lambest),0,70,add=TRUE)
(The function curve() plots a function x, which in this case has range (070); add=TRUE means superimpose this new graph onto the old one.)
Our fitted parametric density estimate is rather close to the one that is model-free, so the parametric model seems pretty good.
Now let's try comparing CDFs. Just as a histogram is a model-free estimate of fX, a model-free estimate of FX is the empirical CDF:
Fest(t) = proportion of Xi that are ≤ t.
The R function ecdf computes this and sets it up for plotting. (Actually, ecdf returns a function of class 'ecdf', so calling plot() on the return value invokes plot.ecdf().) Let's go ahead:
> plot(ecdf(bmi),pch=19,cex=0.1)
> curve(pgamma(x,ch,lh),0,70,add=TRUE) # range of data is (0,70)
The ECDF and the fitted CDF are almost identical, wonderful. We were lucky here; in most real applications, we do not achieve such a close fit, even though the fit is usable.
Related software:
The regtools package includes a function mm() for computing Method of Moments estimators, and base R has the mle() function for MLE.
Well then, how well did our gamma model fit?
First, note that there are several possible sources of discrepancies between the histogram and the fitted gamma:
The true fX is not gamma. No model is ever perfectly right. Keep in mind the famous George Box quote, "All models are false, but some are useful."
We have just a sample of values from the population, so even if our gamma model were exactly correct for the true fX, our sample estimate would not form a perfect match. The larger the sample, the more likely we will have a close fit, but never exact.
Again due to our finite sample, we have nonzero widths for the histogram intervals.
So, assessing fit is a matter of attempting to decide how much of the discrepancy is due to having a finite sample, and how much is due to model error.
There are formal methods of assessment, known as goodness of fit tests, but these days hypothesis testing and p-values are frowned upon for any type of statistical situation, for good reason. Testing is especially irrelevant in assessing model fit. They don't tell us whether our fit is "good enough" for our purposes.
If you feel you must do formal assessment, I suggest forming a Kolmogorov-Smirnov confidence band. We will not pursue that here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.