Imputation | R Documentation |
Functions to impute missing data in omics data sets.
meanModeImputer(X)
samplingImputer(X)
X |
A numeric matrix, where the columns represent independent observations (patients or samples) and the columns represent measured features (genes, proteins, clinical variables, etc). |
We recommend imputing small amounts of missing data in the input data
sets when using the plasma
package. The underlying issue is
that the PLS models we use for individual omics data sets will not be
able to make predictions on a sample if even one data point is
missing. As a result, if a sample is missing at least one data point in
every omics data set, then it will be impossible to use that sample at
all.
For a range of available imputation methods and R packages, consult
the CRAN Task
View on Missing Data. We also recommend the
R-miss-tastic web site on
missing data. Their simulations suggest that, for purposes of
producing predictive models from omics data, the imputation method is
not particularly important. Because of the latter finding, we have
only implemented two simple imputation methods in the plasma
package:
The meanModeImputer
function will replace any missing data by the
mean value of the observed data if there are more than five
distinct values; otherwise, it will replace missing data by the
mode. This approach works relatively well for both continuous
data and for binary or small categorical data.
The samplingImpute
function replaces missing values by sampling
randomly from the observed data distribution.
Both functions return a numeric matrix of the same size and with the same row and column names as the input variable
Kevin R. Coombes krc@silicovore.com, Kyoko Yamaguchi kyoko.yamaguchi@osumc.edu
loadESCAdata()
imputed <- with(plasmaEnv, lapply(assemble, samplingImputer) )
imputed <- with(plasmaEnv, lapply(assemble, meanModeImputer))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.