Maximum likelihood estimation of the mean and covariance matrix of multivariate normal (MVN) distributed data with a monotone missingness pattern. Through the use of parsimonious/shrinkage regressions (e.g., plsr, pcr, ridge, lasso, etc.), where standard regressions fail, this function can handle an (almost) arbitrary amount of missing data
1 2 3 4
logical indicating whether pre-processing of the
describes the type of parsimonious
(or shrinkage) regression to
be performed when standard least squares regression fails.
From the pls package we have
when performing regressions,
maximal number of (principal) components to include
indicates whether the columns with equal missingness should be processed together using a multi-response regression. This is more efficient if many OLS regressions are used, but can lead to slightly poorer, even unstable, fits when parsimonious regressions are used
method for cross validation when applying
a parsimonious regression method. The default setting
logical indicating whether or not to (additionally)
compute a mean vector and covariance matrix based only on the observed
data, without regressions. I.e., means are calculated as averages
of each non-
whether or not to print progress indicators. The default
pre = TRUE then
monomvn first re-arranges the columns
y into nondecreasing order with respect to the number of
NA) entries. Then (at least) the first column should
be completely observed. The mean components and covariances between
the first set of complete columns are obtained through the standard
Next each successive group of columns with the same missingness pattern
is processed in sequence (assuming
batch = TRUE).
Suppose a total of
j columns have
been processed this way already. Let
y2 represent the non-missing
contingent of the next group of
k columns of
with and identical missingness pattern, and let
y1 be the
j-1 columns of
containing only the rows
corresponding to each non-
NA entry in
nrow(y1) = nrow(y2). Note that
y1 contains no
NA entries since the missing data pattern is monotone.
k next entries (indices
j:(j+k)) of the mean vector,
j:(j+k) rows and columns of the covariance matrix are
obtained by multivariate regression of
The regression method used (except in the case of
"factor" depends on the number of rows and columns
y1 and on the
p parameter. Whenever
< p*nrow(y1) least-squares regression is used, otherwise
method = c("pcr", "plsr"). If ever a least-squares regression
fails due to co-linearity then one of the other
"factor" method always involves an OLS regression
on (a subset of) the first
p columns of
methods require a scheme for estimating the amount of
variability explained by increasing the numbers of coefficients
(or principal components) in the model.
Towards this end, the pls and lars packages support
10-fold cross validation (CV) or leave-one-out (LOO) CV estimates of
root mean squared error. See pls and lars for
CV in all cases except when
nrow(y1) <= 10, in which case CV fails and
LOO is used. Whenever
nrow(y1) <= 3
plsr is used instead.
quiet = FALSE then a
is given whenever the first choice for a regression fails.
For pls methods, RMSEs are calculated for a number of
NULL value for
ncomp.max it is replaced with
ncomp.max <- min(ncomp.max, ncol(y2), nrow(y1)-1)
which is the max allowed by the pls package.
Simple heuristics are used to select a small number of components
ncomp for pls), or number of coefficients (for
lars), which explains a large amount of the variability (RMSE).
The lars methods use a “one-standard error rule” outlined
in Section 7.10, page 216 of HTF below. The
pls package does not currently support the calculation of
standard errors for CV estimates of RMSE, so a simple linear penalty
ncomp is used instead. The ridge constant
lm.ridge is set using the
optimize function on the
Based on the ML
ncol(y1)+1 regression coefficients (including
intercept) obtained for each of the
y2, and on the corresponding
residual sum of squares, and on the previous
and rows/cols of the covariance matrix, the
j:(j+k) entries and
rows/cols can be filled in as described by Little and Rubin, section 7.4.3.
Once every column has been processed, the entries of the mean vector, and rows/cols of the covariance matrix are re-arranged into their original order.
monomvn returns an object of class
"monomvn", which is a
list containing a subset of the components below.
a copy of the function call as used
estimated mean vector with columns corresponding to the
estimated covariance matrix with rows and columns
corresponding to the columns of
method of regression used on each column, or
number of components in a
The CV in plsr and lars are random in nature, and so
can be dependent on the random seed. Use
deterministic (but slower) result.
method = "factor" in the current version of
the package, the factors in the first
y must also obey the monotone pattern, and,
have no more
NA entries than the other columns of
Be warned that the lars implementation of
"forward.stagewise" can sometimes get stuck in
(what seems like) an infinite loop.
This is not a bug in the
the bug has been reported to the authors of lars
Robert B. Gramacy firstname.lastname@example.org
Robert B. Gramacy, Joo Hee Lee, and Ricardo Silva (2007).
On estimating covariances between many assets with histories
of highly variable length.
Preprint available on arXiv:0710.5837: http://arxiv.org/abs/0710.5837
Roderick J.A. Little and Donald B. Rubin (2002). Statistical Analysis with Missing Data, Second Edition. Wilely.
Bjorn-Helge Mevik and Ron Wehrens (2007). The pls Package: Principal Component and Partial Least Squares Regression in R. Journal of Statistical Software 18(2)
Bradley Efron, Trevor Hastie, Ian Johnstone and Robert Tibshirani
Least Angle Regression (with discussion).
Annals of Statistics 32(2); see also
Trevor Hastie, Robert Tibshirani and Jerome Friedman (2002). Elements of Statistical Learning. Springer, NY. [HTF]
Some of the code for
monomvn, and its subroutines, was inspired
by code found on the world wide web, written by Daniel Heitjan.
Search for “fcn.q”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
## standard usage, duplicating the results in ## Little and Rubin, section 7.4.3 -- try adding ## verb=3 argument for a step-by-step breakdown data(cement.miss) out <- monomvn(cement.miss) out out$mu out$S ## ## A bigger example, comparing the various methods ## ## generate N=100 samples from a 10-d random MVN xmuS <- randmvn(100, 20) ## randomly impose monotone missingness xmiss <- rmono(xmuS$x) ## plsr oplsr <- monomvn(xmiss, obs=TRUE) oplsr Ellik.norm(oplsr$mu, oplsr$S, xmuS$mu, xmuS$S) ## calculate the complete and observed RMSEs n <- nrow(xmiss) - max(oplsr$na) x.c <- xmiss[1:n,] mu.c <- apply(x.c, 2, mean) S.c <- cov(x.c)*(n-1)/n Ellik.norm(mu.c, S.c, xmuS$mu, xmuS$S) Ellik.norm(oplsr$mu.obs, oplsr$S.obs, xmuS$mu, xmuS$S) ## plcr opcr <- monomvn(xmiss, method="pcr") Ellik.norm(opcr$mu, opcr$S, xmuS$mu, xmuS$S) ## ridge regression oridge <- monomvn(xmiss, method="ridge") Ellik.norm(oridge$mu, oridge$S, xmuS$mu, xmuS$S) ## lasso olasso <- monomvn(xmiss, method="lasso") Ellik.norm(olasso$mu, olasso$S, xmuS$mu, xmuS$S) ## lar olar <- monomvn(xmiss, method="lar") Ellik.norm(olar$mu, olar$S, xmuS$mu, xmuS$S) ## forward.stagewise ofs <- monomvn(xmiss, method="forward.stagewise") Ellik.norm(ofs$mu, ofs$S, xmuS$mu, xmuS$S) ## stepwise ostep <- monomvn(xmiss, method="stepwise") Ellik.norm(ostep$mu, ostep$S, xmuS$mu, xmuS$S)
Loading required package: pls Attaching package: 'pls' The following object is masked from 'package:stats': loadings Loading required package: lars Loaded lars 1.2 Loading required package: MASS Call: monomvn(y = cement.miss) Methods used (p=0.9): 2 complete 3 lsr [,1] x1 6.655166 x2 49.965259 x3 11.769231 x4 27.047089 y 95.423077 x1 x2 x3 x4 y x1 23.80287 22.26922 -26.97542 -12.28292 50.86578 x2 22.26922 258.86275 -17.13549 -273.80147 211.90393 x3 -26.97542 -17.13549 41.02564 -10.39915 -51.51923 x4 -12.28292 -273.80147 -10.39915 319.38777 -206.48170 y 50.86578 211.90393 -51.51923 -206.48170 226.31359 Call: monomvn(y = xmiss, obs = TRUE) Methods used (p=0.9): 1 complete 17 lsr 2 plsr-CV, ncomp range: [4,7]  -24423.72  NA  NA  -24421.08  -24422.77  -24947.67  -24421.34  -24421.53  -24423.06
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.