fitCDF2: Nonlinear fit of a commutative distribution function
In genomaths/usefr: Utility Functions for Statistical Analyses

fitCDF2

R Documentation

Nonlinear fit of a commutative distribution function

Description

Usually the parameter estimation of a cumulative distribution function (*CDF*) are accomplished using the corresponding probability density function (*PDF*). Different optimization algorithms can be used to accomplished this task and different algorithms can yield different estimated parameters. Hence, why not try to fit the CDF directly?

Usage

fitCDF2(varobj, ...)

## S4 method for signature 'numeric'
fitCDF2(
  varobj,
  distNames,
  plot = FALSE,
  plot.num = 1L,
  distf = NULL,
  start = NULL,
  loss.fun = c("linear", "huber", "smooth", "cauchy", "arctg"),
  min.val = NULL,
  only.info = FALSE,
  maxiter = 1024,
  maxfev = 1e+05,
  ptol = 1e-12,
  nls.model = FALSE,
  algorithm = "default",
  xlabel = "x",
  mar = c(4, 4, 3, 1),
  mgp = c(2.5, 0.6, 0),
  las = 1,
  cex.main = 1,
  cex.text = 0.8,
  cex.point = 0.5,
  verbose = TRUE,
  ...
)

## S4 method for signature 'list_OR_matrix_OR_dataframe'
fitCDF2(
  varobj,
  distNames,
  plot = FALSE,
  plot.num = 1L,
  distf = NULL,
  start = NULL,
  loss.fun = c("linear", "huber", "smooth", "cauchy", "arctg"),
  min.val = NULL,
  only.info = FALSE,
  maxiter = 1024,
  maxfev = 1e+05,
  ptol = 1e-12,
  nls.model = FALSE,
  algorithm = "default",
  xlabel = "x",
  mar = c(4, 4, 3, 1),
  mgp = c(2.5, 0.6, 0),
  las = 1,
  cex.main = 1,
  cex.text = 0.8,
  cex.point = 0.5,
  num.cores = 1L,
  verbose = TRUE,
  ...
)

Arguments

`varobj`	A a vector, a named list, a matrix or a data.frame, containing the observations from the variable for which the CDF parameters will be estimated. When the argument is a matrix or a data.frame, the columns must be named, carrying the objective variables.
`...`	(Optional) Further graphical parameters (see `par`). Graphical parameter will simultaneously affect all the plots.
`distNames`	a vector of distribution numbers to select from the listed below in details section, e.g. c(1:10, 15). If 'distNames' is not any of current 20 named distributions, then it can be any arbitrary character string, but the argument 'distf' must be given (see below).
`plot`	Logical. Default FALSE Whether to produce the plots for the best fitted CDF.
`plot.num`	The number of distributions to be plotted.
`distf`	A character string naming a cumulative distribution function(s) (CDF) present in the R session environment . For example, gamma or norm, etc, from where, internally, we can get: density, distribution function, quantile function and random generation as: dnorm, pnorm, qnorm, and rnorm, respectively. If the function is not present in the environment, then an error will be returned. It must given only if 'distNames' is not any of current 20 named distributions (see details below). Default is NULL.
`start`	A named numerical vector giving the parameters to be optimized with initial values or a list of numerical vectors (only when varobj is a list, a matrix or a data.frame). This can be omitted for some of the named distributions (see Details). This argument will be used if provided for only one distribution. The default parameter values are: norm = c( mean = MEAN, sd = SD ) lnorm = c(meanlog = mean( log1p(X), na.rm = TRUE), sdlog = sd( log1p( X ), na.rm = TRUE)) hnorm = c(theta = sqrt(pi)/(SDsqrt(2))) gnorm = c( mean = MEAN, sigma = SD, beta = 2) tgnorm = c( mean = MEAN, sigma = SD, beta = 2) laplace = c( mean = MEAN, sigma = sqrt( VAR)) gamma = c(shape_scale(X, gg = FALSE)) gamma3p = c(shape_scale(X, gg = FALSE), mu = 0), ggamma = c(shape_scale(X, gg = TRUE), mu = MIN, psi = 1) ggamma = c(shape_scale(X, gg = TRUE), psi = 1) weibull = c( shape = log( 2 ), scale = Q) weibull3p = c( mu = MIN, shape = log( 2 ), scale = Q) beta = c(shape1 = 1, shape2 = 2) beta3 = c(shape1 = 1, shape2 = 2, a = MIN) beta4 = c(shape1 = 2, shape2 = 3, a=0.9 MIN, b=1.1 * MAX) bweibull = c(alpha=1, beta=2, shape = log( 2 ), scale = Q) gbeta = c( shape1 = 1, shape2 = 2, lambda = 1) rayleigh = c( sigma = SD ) exp = c( rate = 1) exp2 = c( rate = 1, mu = 0) geom = c(prob = ifelse(MEAN > 0, 1/(1 + MEAN), 1)) bgeom = c(shape1 = 1, shape2 = 1) lgamma = shape_scale(log1p(X), gg = FALSE) lpgamma3p = c(shape_scale(log1p(X), gg = FALSE), mu = 0)
`loss.fun`	Loss function(s) used in the regression (see (Loss function)). After `z = 1/2 * sum((f(x) - y)^2)` we have: "linear": linear function which gives a standard least squares: `loss(z) = z`. "huber": Huber loss, `loss(z) = ifelse(z <= 1, z, sqrt(z) -1)`. "smooth": Smooth approximation to the sum of residues absolute values: `loss(z) = 2*(sqrt(z + 1) - 1)`. "cauchy": Cauchy loss: `loss(z) = log(z + 1)`. "arctg": arc-tangent loss function: `loss(x) = atan(z)`.
`min.val`	A number denoting the lower bound of the domain where CDF is defined. For example, for Weibull and GGamma *min.val = 0*.
`only.info`	Logic. Default TRUE. If true, only information about the parameter estimation is returned.
`maxiter`, `maxfev`, `ptol`	Parameters to control of various aspects of the Levenberg-Marquardt algorithm through function `nls.lm.control` from minpack.lm package.
`nls.model`	Logical. Whether to return the best fitted model as an object from nlsModel class. Default is FALSE. If TRUE, then the estimated parameters are used new fitting with `nls` function.
`algorithm`	Only if nls.model = TRUE. The same as for `nls` function.
`xlabel`	(Optional) Label for variable *varobj. Default is xlabel = "x"*.
`mar`, `mgp`, `las`, `cex.main`	(Optional) Graphical parameters (see `par`).
`cex.text`, `cex.point`	Numerical value to scale text and points.
`verbose`	Logic. If TRUE, prints the function log to stdout

Details

This function works as function fitCDF, but without parallel computation. The reason is the parallelization fails in Windows OS. CDFs defined out the package are not passed to the parallel processes on Windows (so far).

The nonlinear fit (NLF) problem for CDFs is addressed with Levenberg-Marquardt algorithm implemented in function nls.lm from package *minpack.lm*. The Stein's rho for adjusted R squared (rho) is applied as an estimator of the average cross-validation predictive power [1]. This function is inspired in a script for the function fitDistr from the package propagate [2]. Some parts or script ideas from function fitDistr are used, but here we to estimate CDF and not the PDF as in the case of "fitDistr. More informative results are given now. The studentized residuals are provided as well. The list (so far) of possible CDFs is:

Normal (Wikipedia)
Log-normal (Wikipedia). This This function is set to fit log(1+x). Users can transform their variable by themself and then try the fitting to Normal distribution.
Half-normal (Wikipedia). An Alternatively using a scaled precision (inverse of the variance) parametrization (to avoid issues if \sigma is near zero), obtained by setting \theta=sqrt(\pi)/\sigma*sqrt(2).
Generalized Normal (Wikipedia)
T-Generalized Normal [3].
Laplace (Wikipedia)
Gamma (Wikipedia)
3P Gamma [4].
Generalized 4P Gamma [4] (Wikipedia)
Generalized 3P Gamma [4].
Weibull (Wikipedia)
3P Weibull (Wikipedia)
Beta (Wikipedia)
3P Beta (Wikipedia)
4P Beta (Wikipedia)
Beta-Weibull ReliaWiki
Generalized Beta (Wikipedia)
Rayleigh (Wikipedia)
Exponential (Wikipedia)
2P Exponential (Wikipedia)
Geometric (Wikipedia)
"Beta-Geometric" https://www.statisticshowto.com/beta-geometric-distribution/
Log-Gamma (Mathematica)
Log-Gamma 3P (Mathematica)

Where, shape_scale function is an internal function that can be retrieve by typing: usefr:::shape_scale.

Value

After return the plots, a list with following values is provided:

aic: Akaike information creterion
fit: list of results of fitted distribution, with parameter values
bestfit: the best fitted distribution according to AIC
fitted: fitted values from the best fit
rstudent: studentized residuals
residuals: residuals

After cdf = fitCDF2( varobj, ...), attributes( cdf$bestfit ) shows the list of objects carry on cdf$bestfit:

names: "par" "hessian" "fvec" "info" "message" "diag" "niter" "rsstrace" "deviance"
class: "nls.lm"

And fitting details can be retrieved with summary(cdf$bestfit)

Author(s)

Robersy Sanchez (https://genomaths.com).

References

Stevens JP. Applied Multivariate Statistics for the Social Sciences. Fifth Edit. Routledge Academic; 2009.
Andrej-Nikolai Spiess (2014). propagate: Propagation of Uncertainty. R package version 1.0-4. http://CRAN.R-project.org/package=propagate
Abramowitz, M. and Stegun, I. A. (1972) Handbook of Mathematical Functions. New York: Dover. Chapter 6: Gamma and Related Functions.
Hand-book on STATISTICAL DISTRIBUTIONS for experimentalists (pag 73) by Christian Walck. Particle Physics Group Fysikum. University of Stockholm (e-mail: walck@physto.se).

Examples

set.seed(1230)
x1 <- rnorm(10000, mean = 0.5, sd = 1)
cdfp <- fitCDF2(x1, distNames = "Normal", plot = FALSE)
summary(cdfp$bestfit)

## Add some cosmetics to the plots
cdfp <- fitCDF2(x1,
    distNames = "Normal", xlabel = "My Nice Variable Label",
    plot = T, font.lab = 3, font = 2, font.axis = 2, family = "serif",
    cex.lab = 1.3, cex.axis = 1.3
)

## Fitting a Weibull distribution with 3 paramaters
x1 <- rweibull3p(1000, shape = 0.5, scale = 1, mu = 0.1)
cdfp <- fitCDF2(x1,
    distNames = "3P Weibull",
    xlabel = "My Nice Variable Label",
    plot = T, font.lab = 3, font = 2, font.axis = 2, family = "serif",
    cex.lab = 1.3, cex.axis = 1.3, cex.main = 1.1,
    mgp = c(2.5, 1, 0)
)

genomaths/usefr documentation built on June 10, 2025, 9:18 p.m.