L2.disc: Estimate a Discrete Mixture's Complexity Based on L2 Distance In anjaweigel/mixComp_package: Estimation of the Order of Finite Mixture Distributions

Description

Estimation of a discrete mixture's complexity as well as its component weights and parameters by minimizing the squared L2 distance to the empirical probability mass function.

Usage

 ```1 2 3 4``` ```L2.disc(obj, j.max = 10, n.inf = 1000, threshold = "SBC", control = c(trace = 0)) L2.boot.disc(obj, j.max = 10, n.inf = 1000, B = 100, ql = 0.025, qu = 0.975, control = c(trace = 0), ...) ```

Arguments

 `obj` object of class `datMix`. `j.max` integer stating the maximal number of components to be considered. `n.inf` integer; the L2 distance contains an infinite sum, which will be approximated by a sum ranging from 0 to `n.inf`. `threshold` function or character string in `c("LIC", "SBC")` specifying which threshold should be used to compare two mixture estimates of complexities j and j+1. If the difference in minimized squared distances is smaller than the relevant threshold, j will be returned as complexity estimate. `B` integer specifying the number of bootstrap replicates. `ql` numeric between 0 and 1 specifying the lower quantile to which the observed difference in minimized squared distances will be compared. `qu` numeric between 0 and 1 specifying the upper quantile to which the observed difference in minimized squared distances will be compared. `control` control list of optimization parameters, see `solnp`. `...` further arguments passed to the `boot` function.

Details

Define the complexity of a finite discrete mixture F as the smallest integer p, such that its probability mass function (pmf) f can be written as

f(x) = w_1*g(x;θ_1) + … + w_p*g(x;θ_p).

Further, let g, f be two probability mass functions. The squared L2 distance between g and f is given by

L_2^2(g,f) = ∑(g(x)-f(x))^2.

To estimate p, `L2.disc` iteratively increases the assumed complexity j and finds the “best” estimate for both, the pmf of a mixture with j and j+1 components, by calculating the parameters that minimize the squared L2 distances to the empirical probability mass function. The infinite sum contained in the objective function will be approximated by a sum ranging from 0 to `n.inf`, set to 1000 by default. Once the “best” parameters have been obtained, the difference in squared distances is compared to a predefined `threshold`. If this difference is smaller than the threshold, the algorithm terminates and the true complexity is estimated as j, otherwise j is increased by 1 and the procedure is started over. The predefined thresholds are the `"LIC"` given by (0.6*log((j+1)/j))/n and the `"SBC"` given by (0.6*log(n)*log((j+1)/j))/n, n being the sample size. Note that, if a customized function is to be used, it may only take the arguments `j` and `n`.

`L2.boot.disc` works similarly to `L2.disc` with the exception that the difference in squared distances is not compared to a predefined threshold but a value generated by a bootstrap procedure. At every iteration (of j), the function sequentially tests p = j versus p = j+1 for j = 1,2, …, using a parametric bootstrap to generate `B` samples of size n from a j-component mixture given the previously calculated “best” parameter values. For each of the bootstrap samples, again the “best” estimates corresponding to pmfs with j and j+1 components are calculated, as well as their difference in squared L2 distances from the empirical probability mass function. The null hypothesis H_0: p = j is rejected and j increased by 1 if the original difference in squared distances lies outside of the interval [ql, qu], specified by the `ql` and `qu` empirical quantiles of the bootstrapped differences. Otherwise, j is returned as the complexity estimate.

To calculate the minimum of the L2 distance (and the corresponding parameter values), the solver `solnp` is used. The initial values supplied to the solver are calculated as follows: the data is clustered into j groups by the function `clara` and the data corresponding to each group is given to `MLE.function` (if supplied to the `datMix` object `obj`, otherwise numerical optimization is used here as well). The size of the groups is taken as initial component weights and the MLE's are taken as initial component parameter estimates.

Value

Object of class `paramEst` with the following attributes:

 `dat` data based on which the complexity is estimated. `dist` character string stating the (abbreviated) name of the component distribution, such that the function `ddist` evaluates its density function and `rdist` generates random numbers. `ndistparams` integer specifying the number of parameters identifying the component distribution, i.e. if θ \subseteq R^d then `ndistparams` = d. `formals.dist` string vector specifying the names of the formal arguments identifying the distribution `dist` and used in `ddist` and `rdist`, e.g. for a gaussian mixture (`dist = norm`) amounts to `mean` and `sd`, as these are the formal arguments used by `dnorm` and `rnorm`. `discrete` logical indicating whether the underlying mixture distribution is discrete. Will always be `TRUE` in this case. `mle.fct` attribute `MLE.function` of `obj`. `pars` Say the complexity estimate is equal to some j. Then `pars` is a numeric vector of size (d+1)*j-1 specifying the component weight and parameter estimates, given as (w_1, ... w_{j-1}, θ 1_1, ... θ 1_j, θ 2_1, ... θ d_j). `values` numeric vector of function values gone through during optimization at iteration j, the last entry being the value at the optimum. `convergence` indicates whether the solver has converged (0) or not (1 or 2) at iteration j.

References

T. Umashanger and T. Sriram, "L2E estimation of mixture complexity for count data", Computational Statistics and Data Analysis 51, 4379-4392, 2007.

`hellinger.disc` for the same estimation method using the Hellinger distance, `solnp` for the solver, `datMix` for the creation of the `datMix` object.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29``` ```## create 'Mix' object poisMix <- Mix("pois", w = c(0.45, 0.45, 0.1), lambda = c(1, 5, 10)) ## create random data based on 'Mix' object (gives back 'rMix' object) set.seed(1) poisRMix <- rMix(10000, obj = poisMix) ## create 'datMix' object for estimation # generate list of parameter bounds poisList <- vector(mode = "list", length = 1) names(poisList) <- "lambda" poisList\$lambda <- c(0, Inf) # generate MLE function MLE.pois <- function(dat){ mean(dat) } # generating 'datMix' object pois.dM <- RtoDat(poisRMix, theta.bound.list = poisList, MLE.function = MLE.pois) ## complexity and parameter estimation set.seed(1) res <- L2.disc(pois.dM) plot(res) ```