Description Usage Arguments Details Value Note Author(s) References Examples
The PEMM function utilizes a penalized EM algorithm to estimate the mean and covariance of multivariate Gaussian data with ignorable or abundence-dependent missing-data mechanism. For example, in proteomics data, the smaller the abundance value of a protein is, the more likely the protein cannot be detected in the experiment. The PEMM function incorporates the known or estimated abundance-dependent missing-data mechanism in the penalized likelihood, and obtains the maximum penalized likelihood estimates for multvariate mean and covariance via the PEMM algorithm.
1 2 |
X |
a n-by-p matrix, the value of the multivariate incomplete data. Each rown is a sample and each column is one feature (e.g. protein). |
phi |
the parameter in the missing-data mechanism. It is a non-negative number. We model the abundence-dependent missing-data mechanism as P(missing Xi)=constant*exp(-phi*Xi) if Xi is strictly positive; or P(missing Xi)=constant*exp(-phi*Xi^2) if Xi can be positive and negative. When phi is set to 0, the data will be treated as missing-at-random (MAR) and a penalized EM algorithm will be performed without the consideration of missing-data mechanism. When phi is specified as being greater than zero, the PEMM algorithm will be utilized. |
lambda |
the tuning parameter in the Inverse-Wishart penalty function. Appropriate value of lambda helps to bound the eigen-values of the covariance matrix away from zero and thus assures the positive-definiteness of the estimated covariance . |
K |
the second tuning parameter in the Inverse-Wishart penalty function. Appropriate choice of K helps to stabalize the estimation of the covariance matrix. Here we suggest K being a small positive integer (=5), which seems work well when dimensionality and sample size are both moderate (<100). |
pos |
a logical value. If pos=TRUE, all X are one-sided and the abundance-dependent missing-data mechanism is given by P(missing Xi)=constant*exp(-phi*Xi). If pos=FALSE, X can be two-sided. The abundance-dependent missing-data mechanism is given by P(missing Xi)=constant*exp(-phi*Xi^2). |
tol |
the tolerance to assess the convergence of the iterative algorithm. We set tol=0.001 as default, i.e., when the change in log-likelihood of the current iteration versus the last iteration is less than 0.1%, the algorithm stops. |
maxIter |
the maximum number of iterations allowed. We set maxIter=100 as default, i.e., if after 100 iterations the algorithm still do not converge, we force the algorithm to stop and give the users a warning. |
The algorithm is motivated by data characteristics in proteomics data with substantial abundance-dependent missing values. The algorithm calculates the maximum penalized likelihood estimates of mean and covariance of multivariate incomplete data, with abundence-dependent missing data mechanism incorporated.
The algorithm will return a list of estimated mean and covariance, and the imputed missing-values from the last iteration.
mu |
the estimated mean of the multivariate incomplete data |
Sigma |
the estimated covariance matrix of the multivariate incomplete data |
Xhat |
the "imputed" missing-values from the last iteration. |
Motivated by characteristics and the abundance-dependent missing-data mechanism in proteomics data, the current function takes the known or estimated abundance-dependent missing-data parameter as input and estimates the mean and covariance of multivariate incomplete data.
When the missing-data mechansim is MAR or MCAR, one could specify phi being zero so that a penalized EM algorithm will be performed without modelling of the missing-data mechanism.
When the missing data are non-ignorable and the missing-data mechanism is unknown, one would first examine if the data follow abundance-dependent pattern and if so, one can estimate the abundance-dependent parameters and perform the analysis based on the PEMM algorithm. Otherwise, if the missing data are not abundence-dependent, one would need alternative approaches.
Lin S. Chen and Pei Wang
Maintainer: Lin S. Chen <lchen@health.bsd.uchicago.edu>
Lin S. Chen, Ross Prentice and Pei Wang. (2014) A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics, in revision.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | set.seed(111)
library(PEMM)
data(sim_dat)
X.mar=sim_dat
X.mar[sample(1:length(X.mar),round(length(X.mar)*0.2))]<-NA
## If data are MAR or MCAR, by only specifying phi=0,
## a penalized EM algorithm will be performed at default.
PEM.result = PEMM_fun(X.mar, phi=0)
## By specifying phi=0, lambda=0, K=0, an EM algorithm will be performed.
## Although when n is small, EM may not converge.
EM.result = PEMM_fun(X.mar, phi=0, lambda=0, K=0)
## Generate data with non-ignorable missingness -- observations with
## lower absolute values are more likely to be missing
phi=1
prob <- 0.5*exp(-phi*(sim_dat)^2)
X.mnar=sim_dat
X.mnar[which(rbinom(length(X.mnar),1,prob)==1)] <- NA
mean(is.na(X.mnar)) ## proportion of data being missing
## Geting the estimated results
PEMM.result.mnar = PEMM_fun(X.mnar, phi=1)
PEM.result.mnar = PEMM_fun(X.mnar, phi=0) ## ignoring MNAR mechanism
EM.result.mnar = PEMM_fun(X.mnar, phi=0, lambda=0, K=0) ## ignoring MNAR
## Compare the mean estimates for data with MNAR from different methods
## complete data
colMeans(sim_dat)
## EM results ignoring MNAR mechanism
EM.result.mnar$mu
## PEMM estimates
PEMM.result.mnar$mu
cor(colMeans(sim_dat),PEMM.result.mnar$mu)
cor(colMeans(sim_dat),EM.result.mnar$mu)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.