# hellinger.disc: Estimate a Discrete Mixture's Complexity Based on Hellinger... In anjaweigel/mixComp_package: Estimation of the Order of Finite Mixture Distributions

## Description

Estimation of a discrete mixture's complexity as well as its component weights and parameters by minimizing the squared Hellinger distance to the empirical probability mass function.

## Usage

 ```1 2 3 4``` ```hellinger.disc(obj, j.max = 10, threshold = "SBC", control = c(trace = 0)) hellinger.boot.disc(obj, j.max = 10, B = 100, ql = 0.025, qu = 0.975, control = c(trace = 0), ...) ```

## Arguments

 `obj` object of class `datMix`. `j.max` integer stating the maximal number of components to be considered. `threshold` function or character string in `c("AIC", "SBC")` specifying which threshold should be used to compare two mixture estimates of complexities j and j+1. If the difference in minimized squared distances is smaller than the relevant threshold, j will be returned as complexity estimate. `B` integer specifying the number of bootstrap replicates. `ql` numeric between 0 and 1 specifying the lower quantile to which the observed difference in minimized squared distances will be compared. `qu` numeric between 0 and 1 specifying the upper quantile to which the observed difference in minimized squared distances will be compared. `control` control list of optimization parameters, see `solnp`. `...` further arguments passed to the `boot` function.

## Details

Define the complexity of a finite discrete mixture F as the smallest integer p, such that its probability mass function (pmf) f can be written as

f(x) = w_1*g(x;θ_1) + … + w_p*g(x;θ_p).

Further, let g, f be two probability mass functions. The squared Hellinger distance between g and f is given by

H^2(g,f) = ∑ (√{g(x)}-√{f(x)})^2

where √{g(x)}, respectively √{f(x)} denotes the square root of the probability mass functions at point x.

To estimate p, `hellinger.disc` iteratively increases the assumed complexity j and finds the “best” estimate for both, the pmf of a mixture with j and j+1 components, by calculating the parameters that minimize the squared Hellinger distances to the empirical probability mass function. Once these parameters have been obtained, the difference in squared distances is compared to a predefined `threshold`. If this difference is smaller than the threshold, the algorithm terminates and the true complexity is estimated as j, otherwise j is increased by 1 and the procedure is started over. The predefined thresholds are the `"AIC"` given by (d+1)/n and the `"SBC"` given by ((d+1)log(n))/(2n), n being the sample size and d the number of component parameters, i.e. θ \subseteq R^d. Note that, if a customized function is to be used, it may only take the arguments `j` and `n`, so if the user wants to include the number of component parameters d, it has to be entered explicitly.

`hellinger.boot.disc` works similarly to `hellinger.disc` with the exception that the difference in squared distances is not compared to a predefined threshold but a value generated by a bootstrap procedure. At every iteration (of j), the function sequentially tests p = j versus p = j+1 for j = 1,2, …, using a parametric bootstrap to generate `B` samples of size n from a j-component mixture given the previously calculated “best” parameter values. For each of the bootstrap samples, again the “best” estimates corresponding to pmfs with j and j+1 components are calculated, as well as their difference in squared Hellinger distances from the empirical probability mass function. The null hypothesis H_0: p = j is rejected and j increased by 1 if the original difference in squared distances lies outside of the interval [ql, qu], specified by the `ql` and `qu` empirical quantiles of the bootstrapped differences. Otherwise, j is returned as the complexity estimate.

To calculate the minimum of the Hellinger distance (and the corresponding parameter values), the solver `solnp` is used. The initial values supplied to the solver are calculated as follows: the data is clustered into j groups by the function `clara` and the data corresponding to each group is given to `MLE.function` (if supplied to the `datMix` object `obj`, otherwise numerical optimization is used here as well). The size of the groups is taken as initial component weights and the MLE's are taken as initial component parameter estimates.

## Value

Object of class `paramEst` with the following attributes:

 `dat` data based on which the complexity is estimated. `dist` character string stating the (abbreviated) name of the component distribution, such that the function `ddist` evaluates its density function and `rdist` generates random numbers. `ndistparams` integer specifying the number of parameters identifying the component distribution, i.e. if θ \subseteq R^d then `ndistparams` = d. `formals.dist` string vector specifying the names of the formal arguments identifying the distribution `dist` and used in `ddist` and `rdist`, e.g. for a gaussian mixture (`dist = norm`) amounts to `mean` and `sd`, as these are the formal arguments used by `dnorm` and `rnorm`. `discrete` logical indicating whether the underlying mixture distribution is discrete. Will always be `TRUE` in this case. `mle.fct` attribute `MLE.function` of `obj`. `pars` Say the complexity estimate is equal to some j. Then `pars` is a numeric vector of size (d+1)*j-1 specifying the component weight and parameter estimates, given as (w_1, ... w_{j-1}, θ 1_1, ... θ 1_j, θ 2_1, ... θ d_j). `values` numeric vector of function values gone through during optimization at iteration j, the last entry being the value at the optimum. `convergence` indicates whether the solver has converged (0) or not (1 or 2) at iteration j.

## References

M.-J. Woo and T. Sriram, "Robust estimation of mixture complexity for count data", Computational Statistics and Data Analysis 51, 4379-4392, 2007.

## See Also

`L2.disc` for the same estimation method using the L2 distance, `hellinger.cont` for the same estimation method for continuous mixtures, `solnp` for the solver, `datMix` for the creation of the `datMix` object.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29``` ```## create 'Mix' object poisMix <- Mix("pois", w = c(0.45, 0.45, 0.1), lambda = c(1, 5, 10)) ## create random data based on 'Mix' object (gives back 'rMix' object) set.seed(0) poisRMix <- rMix(1000, obj = poisMix) ## create 'datMix' object for estimation # generate list of parameter bounds poisList <- vector(mode = "list", length = 1) names(poisList) <- "lambda" poisList\$lambda <- c(0, Inf) # generate MLE function MLE.pois <- function(dat){ mean(dat) } # generating 'datMix' object pois.dM <- RtoDat(poisRMix, theta.bound.list = poisList, MLE.function = MLE.pois) ## complexity and parameter estimation set.seed(0) res <- hellinger.disc(pois.dM) plot(res) ```

anjaweigel/mixComp_package documentation built on Sept. 2, 2020, 3:55 p.m.