mv_imputation: Missing value imputation using different algorithms

View source: R/mv_imputation.R

mv_imputationR Documentation

Missing value imputation using different algorithms

Description

Missing values in metabolomics data sets occur widely and can originate from a number of sources, including technical and biological reasons.
Missing values imputation is applied to replace non-existing values with an estimated values while maintaining the data structure. A number of different methods are available as part of this function.

Usage

mv_imputation(
  df,
  method,
  k = 10,
  rowmax = 0.5,
  colmax = 0.5,
  maxp = NULL,
  check_df = TRUE
)

Arguments

df

A matrix-like (e.g. an ordinary matrix, a data frame) or RangedSummarizedExperiment-class object with all values of class numeric() or integer() of peak intensities, areas or other quantitative characteristic.

method

character(1), missing value imputation method. Supported methods are knn, rf, bpca, sv, 'mn' and 'md'.

k

numeric(1), for a given sample containing a missing value, the number of nearest neighbours to include to calculate a replacement value. Used only for method knn.

rowmax

numeric(1), the maximum percentage of missing data allowed in any row. For any rows exceeding given limit, missing values are imputed using the overall mean per sample. Used only for method knn.

colmax

numeric(1), the maximum percent missing data allowed in any column. If any column exceeds given limit, the function will report an error Used only for method knn.

maxp

integer(1), number of features to run on single core. If set to NULL will use total number of features.

check_df

logical(1), if set to TRUE will check if input data needs to be transposed, so that features are in rows.

Details

Supported missing value imputation methods are:

knn - K-nearest neighbour. For each feature in each sample, missing values are replaced by the mean average value (non-weighted) calculated from its k closest neighbours in multivariate space (default distance metric: euclidean distance);

rf - Random Forest. This method is a wrapper of missForest function. For each feature, missing values are iteratively imputed until a maximum number of iterations (10), or until the difference between consecutively-imputed matrices becomes positive. Trees per forest are set to 100, variables included per tree are calculate using formula sqrt(total number of variables);

bpca - Bayesian principal component analysis. This method is a wrapper of pca function. Missing values are replaced by the values obtained from principal component analysis regression with a Bayesian method. Therefore every imputed missing value does not occur multiple times, neither across the samples nor across the metabolite features;

sv - Small value. For each feature, replace missing values with half of the lowest value recorded in the entire data matrix;

'mn' - Mean. For each feature, replace missing values with the mean average (non-weighted) of all other non-missing values for that variable;

'md' - Median. For each feature, replace missing values with the median of all other non-missing values for that variable.

Value

Object of class SummarizedExperiment. If input data are a matrix-like (e.g. an ordinary matrix, a data frame) object, function returns the same R data structure as input with all value of data type numeric().

Examples

df <- MTBLS79[ ,MTBLS79$Batch == 1]
out <- mv_imputation(df=df, method='knn')


computational-metabolomics/pmp documentation built on March 9, 2024, 4:25 p.m.