fitDistr: Fitting distributions to observations/Monte Carlo simulations

Description Usage Arguments Details Value Author(s) References Examples

View source: R/fitDistr.R

Description

This function fits 32 different continuous distributions by (weighted) NLS to the histogram of Monte Carlo simulation results as obtained by propagate or any other vector containing large-scale observations. Finally, the fits are sorted by ascending BIC.

Usage

1
2
fitDistr(object, nbin = 100, weights = FALSE, verbose = TRUE, 
         brute = c("fast", "slow"), plot = c("hist", "qq"),  distsel = NULL, ...)

Arguments

object

an object of class 'propagate' or a vector containing observations.

nbin

the number of bins in the histogram.

weights

numeric or logical. Either a numeric vector of weights, or if TRUE, the distributions are fitted with weights = 1/(counts per bin).

verbose

logical. If TRUE, steps of the analysis are printed to the console.

brute

complexity of the brute force approach. See 'Details'.

plot

if "hist", a plot with the "best" distribution (in terms of lowest BIC) on top of the histogram is displayed. If "qq", a QQ-Plot will display the difference between the observed and fitted quantiles.

distsel

a vector of distribution numbers to select from the complete cohort as listed below, e.g. c(1:10, 15).

...

other parameters to be passed to the plots.

Details

Fits the following 32 distributions using (weighted) residual sum-of-squares as the minimization criterion for nls.lm:
1) Normal distribution (dnorm) => https://en.wikipedia.org/wiki/Normal_distribution
2) Skewed-normal distribution (propagate:::dsn) => https://en.wikipedia.org/wiki/Skew_normal_distribution
3) Generalized normal distribution (propagate:::dgnorm) => https://en.wikipedia.org/wiki/Generalized_normal_distribution
4) Log-normal distribution (dlnorm) => https://en.wikipedia.org/wiki/Log-normal_distribution
5) Scaled and shifted t-distribution (propagate:::dst) => GUM 2008, Chapter 6.4.9.2.
6) Logistic distribution (dlogis) => https://en.wikipedia.org/wiki/Logistic_distribution
7) Uniform distribution (dunif) => https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)
8) Triangular distribution (propagate:::dtriang) => https://en.wikipedia.org/wiki/Triangular_distribution
9) Trapezoidal distribution (propagate:::dtrap) => https://en.wikipedia.org/wiki/Trapezoidal_distribution
10) Curvilinear Trapezoidal distribution (propagate:::dctrap) => GUM 2008, Chapter 6.4.3.1
11) Gamma distribution (dgamma) => https://en.wikipedia.org/wiki/Gamma_distribution
12) Inverse Gamma distribution (propagate:::dinvgamma) => https://en.wikipedia.org/wiki/Inverse-gamma_distribution
13) Cauchy distribution (dcauchy) => https://en.wikipedia.org/wiki/Cauchy_distribution
14) Laplace distribution (propagate:::dlaplace) => https://en.wikipedia.org/wiki/Laplace_distribution
15) Gumbel distribution (propagate:::dgumbel) => https://en.wikipedia.org/wiki/Gumbel_distribution
16) Johnson SU distribution (propagate:::dJSU) => https://en.wikipedia.org/wiki/Johnson_SU_distribution
17) Johnson SB distribution (propagate:::dJSB) => https://www.mathwave.com/articles/johnson_sb_distribution.html
18) Three-parameter Weibull distribution (propagate:::dweibull2) => https://en.wikipedia.org/wiki/Weibull_distribution
19) Two-parameter beta distribution (dbeta2) => https://en.wikipedia.org/wiki/Beta_distribution#Two_parameters_2
20) Four-parameter beta distribution (propagate:::dbeta2) => https://en.wikipedia.org/wiki/Beta_distribution#Four_parameters_2
21) Arcsine distribution (propagate:::darcsin) => https://en.wikipedia.org/wiki/Arcsine_distribution
22) Von Mises distribution (propagate:::dmises) => https://en.wikipedia.org/wiki/Von_Mises_distribution
23) Inverse Gaussian distribution (propagate:::dinvgauss) => https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution
24) Generalized Extreme Value distribution (propagate:::dgevd) => https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution
25) Rayleigh distribution (propagate:::drayleigh) => https://en.wikipedia.org/wiki/Rayleigh_distribution
26) Chi-square distribution (dchisq) => https://en.wikipedia.org/wiki/Chi-squared_distribution
27) Exponential distribution (dexp) => https://en.wikipedia.org/wiki/Exponential_distribution
28) F-distribution (df) => https://en.wikipedia.org/wiki/F-distribution
29) Burr distribution (dburr) => https://en.wikipedia.org/wiki/Burr_distribution
30) Chi distribution (dchi) => https://en.wikipedia.org/wiki/Chi_distribution
31) Inverse Chi-square distribution (dinvchisq) => https://en.wikipedia.org/wiki/Inverse-chi-squared_distribution
32) Cosine distribution (dcosine) => https://en.wikipedia.org/wiki/Raised_cosine_distribution

All distributions are fitted with a brute force approach, in which the parameter space is extended over three orders of magnitude (0.1, 1, 10)\times β_i when brute = "fast", or five orders (0.01, 0.1, 1, 10, 100)\times β_i when brute = "slow". Approx. 20-90s are needed to fit for the fast version, depending mainly on the number of bins.

The goodness-of-fit (GOF) is calculated with BIC from the (weighted) log-likelihood of the fit:

\rm{ln}(L) = 0.5 \cdot ≤ft(-N \cdot ≤ft(\rm{ln}(2π) + 1 + \rm{ln}(N) - ∑_{i=1}^n log(w_i) + \rm{ln}≤ft(∑_{i=1}^n w_i \cdot x_i^2\right) \right) \right)

\rm{BIC} = - 2\rm{ln}(L) + (N - k)ln(N)

with x_i = the residuals from the NLS fit, N = the length of the residual vector, k = the number of parameters of the fitted model and w_i = the weights.

In contrast to some other distribution fitting softwares (i.e. Easyfit, Mathwave) that use residual sum-of-squares/Anderson-Darling/Kolmogorov-Smirnov statistics as GOF measures, the application of BIC accounts for increasing number of parameters in the distribution fit and therefore compensates for overfitting. Hence, this approach is more similar to ModelRisk (Vose Software) and as employed in fitdistr of the 'MASS' package. Another application is to identify a possible distribution for the raw data prior to using Monte Carlo simulations from this distribution. However, a decent number of observations should be at hand in order to obtain a realistic estimate of the proper distribution. See 'Examples'.
The code for the density functions can be found in file "distr-densities.R".

IMPORTANT: It can be feasible to set weights = TRUE in order to give more weight to bins with low counts. See 'Examples'. ALSO: Distribution fitting is highly sensitive to the number of defined histogram bins, so it is advisable to change this parameter and inspect if the order of fitted distributions remains stable.

Value

A list with the following items:

stat: the by BIC value ascendingly sorted distribution names, including RSS and MSE.
fit: a list of the results from nls.lm for each distribution model, also sorted ascendingly by BIC values.
par: a list of the estimated parameters of the models in fit.
se: a list of the parameters' standard errors, calculated from the square root of the covariance matrices diagonals.
dens: a list with all density function used for fitting, sorted as in fit.
bestfit: the best model in terms of lowest BIC.
bestpar: the parameters of bestfit.
bestse: the parameters' standard errors of bestfit.
fitted: the fitted values of bestfit.
residuals: the residuals of bestfit.

Author(s)

Andrej-Nikolai Spiess

References

Continuous univariate distributions, Volume 1.
Johnson NL, Kotz S and Balakrishnan N.
Wiley Series in Probability and Statistics, 2.ed (2004).

A guide on probability distributions.
R-forge distributions core team.
http://dutangc.free.fr/pub/prob/probdistr-main.pdf.

Univariate distribution relationships.
Leemis LM and McQueston JT.
The American Statistician (2008), 62: 45-53.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
## Not run: 
## Linear example, small error
## => Normal distribution.
EXPR1 <- expression(x + 2 * y)
x <- c(5, 0.01)
y <- c(1, 0.01)
DF1 <- cbind(x, y)
RES1 <- propagate(expr = EXPR1, data = DF1, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES1)

## Ratio example, larger error
## => Gamma distribution.
EXPR2 <- expression(x/2 * y)
x <- c(5, 0.1)
y <- c(1, 0.02)
DF2 <- cbind(x, y)
RES2 <- propagate(expr = EXPR2, data = DF2, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES2)

## Exponential example, large error
## => Log-Normal distribution.
EXPR3 <- expression(x^(2 * y))
x <- c(5, 0.1)
y <- c(1, 0.1)
DF3 <- cbind(x, y)
RES3 <- propagate(expr = EXPR3, data = DF3, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES3)

## Rectangular input distributions result
## in Curvilinear Trapezoidal output distribution.
A <- runif(100000, 20, 25)
B <- runif(100000, 3, 3.5)
DF4 <- cbind(A, B)
EXPR4 <- expression(A + B)
RES4 <- propagate(EXPR4, data = DF4, type = "sim", 
                 use.cov = FALSE, do.sim = TRUE)
fitDistr(RES4)        

## Fitting with 1/counts as weights.
EXPR5 <- expression(x + 2 * y)
x <- c(5, 0.05)
y <- c(1, 0.05)
DF5 <- cbind(x, y)
RES5 <- propagate(expr = EXPR5, data = DF5, type = "stat", 
                  do.sim = TRUE, verbose = TRUE, weights = TRUE)
fitDistr(RES5)

## Using only selected distributions.
EXPR6 <- expression(x + sin(y))
x <- c(5, 0.1)
y <- c(1, 0.2)
DF6 <- cbind(x, y)
RES6 <- propagate(expr = EXPR6, data = DF6, type = "stat", 
                  do.sim = TRUE)
fitDistr(RES6, distsel = c(1:10, 15, 28))

## End(Not run)

anspiess/propagate documentation built on May 14, 2019, 3:09 a.m.