This vignette document provides an introduction to users who wish to use normalregMix package to perform estimation and testing the number of components for normal mixture regression models and normal mixture models. The algorithm of normalregMix is based on Kasahara and Shimotsu (2015).^[Kasahara, H. and Shimotsu, K. (2015), Testing the Number of Components in Normal Mixture Regression Models, Journal of the American Statistical Association, 110:512, 1632-1645.]; for more details about the package, please refer to the public Github repository of this package.

General settings

Consider the following linear regression model: $$ \begin{aligned} y = \mu + x'\beta + z'\gamma + \sigma\varepsilon \end{aligned} $$ where $x$ and $\beta$ are $q \times 1$ columns, $z$ and $\gamma$ are $p \times 1$ columns, and $\varepsilon$ follows standard normal distribution and $\sigma > 0$. Then, the density of $y$ given $x, z, \gamma, \phi, \sigma^2$ is given by $$ \begin{aligned} f(y|x,z; \gamma,\theta, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}\exp \left( \frac{y-\mu - x'\beta - z'\gamma}{2\sigma^2} \right) \end{aligned} $$ Now, consider an $m$-component normal mixture regression model, where each $j$th ($1 \leq j \leq m$) subcomponent follows the distribution above with parameter $\theta_j = (\mu_j, \beta_j')'$ and $\sigma_j^2$ with proportion $\alpha_j$. Note that $\gamma$ is identical across all components. In this case, an $i$th observation $y_i$ from $j$th component is generated by $$ y_i = \mu_j + x'\beta_j + z'\gamma + \sigma_j\varepsilon $$ Then, this $m$-component normal mixture regression model has a density of $$ \begin{aligned} f_m(y | x, z; \vartheta) = \sum_{j=1}^m \alpha_j f(y| x, z; \gamma, \theta_j, \sigma_j) \end{aligned} $$ whose parameter is specified as $\mathbf{\vartheta} = (\mathbf{\alpha}', \gamma', \theta_1', ..., \theta_m', \sigma^2_1, ..., \sigma^2_m)'$, where $\mathbf{\alpha}' = (\alpha_1,...,\alpha_{m-1})'$; an $m-$component normal mixture model is a special case the model above by dropping regressors $x$ and $z$. Given $m_0 \in \mathbb{N}$, this package aims to estimate $\mathbf{\vartheta}$ and test a hypothesis $H_0 : m = m_0$ against the alternative $H_A: m = m_0 + 1$ based on likelihood ratio test (LRT).

Estimating normal mixture (regression) models

$\verb|normalmixPMLE|$ and $\verb|regmixPMLE|$ (for $q = \text{dim}(X) > 1$) can be used to obtain the (penalized) maximum likelihood estimator of $\mathbf{\vartheta}$. Running the code will also return other useful information as well, such as the values of log-likelihood and penalized log-likelihood at PMLE computed, AIC, and BIC.

library(normalregMix)
data(faithful)
attach(faithful)

# mixture with two components
normalmixPMLE(y = eruptions, m = 2)

# mixture regression with three components (regressors are waiting)
regmixPMLE(y = eruptions, x = waiting, m = 3)

Testing the number of components in normal mixture models

Suppose that you would like to test a hypothesis $H_0: m = m_0$ against the alternative $H_A: m = m_0 + 1$ in normal mixture models. In order to do that, specify the number of components in a null hypothesis as m and call normalmixMEMtest as follows:

# test the null hypothesis of m = 1 versus m = 2
normalmixMEMtest(y = eruptions, m = 1)
# test the null hypothesis of m = 3 versus m = 4
normalmixMEMtest(y = eruptions, m = 3)

Note that critical values are not printed in a default option due to heavy computation required for normalmixMEMtest. Specify crit.method in a parameter to compute them; in normalregMix, users can use either asy (based on asympotic distribution of LRT statistic) and boot (based on bootstrap) as follows:

# test the null hypothesis of m = 1 versus m = 2 using asympotic methods
normalmixMEMtest(y = eruptions, m = 1, crit.method = 'asy')
# test the null hypothesis of m = 2 versus m = 3 using bootstrap methods
normalmixMEMtest(y = eruptions, m = 2, crit.method = 'boot')

Testing the number of components in normal mixture regression models

When regressors are present, regmixMEMtest can be called to test the number of components in normal mixture regression models by following a similar procedure. Note that given $n$ observations of $y$, the data for $x$ must be an $n \times q$ matrix.

data(ToothGrowth)
attach(ToothGrowth)
# test the null hypothesis of m = 1 versus m = 2 using asympotic methods
regmixMEMtest(y = log(len), x = log(dose), m = 1, crit.method = 'asy')
# test the null hypothesis of m = 2 versus m = 3 using asympotic methods
regmixMEMtest(y = log(len), x = log(dose), m = 2, crit.method = 'asy')

Parallelization

When $q > 0$ or the number of observations is large, computing modified EM-statistics can take more than minutes. To reduce computation time required to perform statistical tests, we provide an option to users to let calculation done by parallelization. In this package, doParallel has been used to implement this feature. To enable parallel computing, set parallel as the proportion of CPUs to be used for computation. If parallel is set to be zero, the procedure will be run in a singular process.

# use all available cores
system.time(result <- normalmixMEMtest(y = eruptions, m = 5, 
                                       parallel = 1, crit.method = "asy"))
result

# run in a singular process
system.time(result <- normalmixMEMtest(y = eruptions, m = 5, 
                                       parallel = 0, crit.method = "asy"))
result

Sequentailly Testing Hypotheses Using normalMixMEMtestSeq

normalmixMEMtestSeq and regMixMEMtestSeq allow users to test the null hypothesis against $H_0: m = m_0$ for $m_0 = 1, 2, ..., \verb|maxm|$ sequentially to avoid hassles of testing each of them manually.

normalmixMEMtestSeq(y = eruptions)


hkasahar/normalregMix documentation built on May 17, 2019, 4 p.m.