impute_randomforest: Imputation of Missing Values Using Random Forest Imputation
In protti: Bottom-Up Proteomics and LiP-MS Quality Control and Data Analysis Tools

impute_randomforest

R Documentation

Imputation of Missing Values Using Random Forest Imputation

Description

impute_randomforest performs imputation for missing values in the data using the random forest-based method implemented in the missForest package.

Usage

impute_randomforest(
  data,
  sample,
  grouping,
  intensity_log2,
  retain_columns = NULL,
  ...
)

Arguments

`data`	A data frame that contains the input variables. This should include columns for the sample names, precursor or peptide identifiers, and intensity values.
`sample`	A character column in the `data` data frame that contains the sample names.
`grouping`	A character column in the `data` data frame that contains the precursor or peptide identifiers.
`intensity_log2`	A numeric column in the `data` data frame that contains the intensity values.
`retain_columns`	A character vector indicating which columns should be retained from the input data frame. These columns will be preserved in the output alongside the imputed values. By default, no additional columns are retained (`retain_columns = NULL`), but specific columns can be retained by providing their names as a vector.
`...`	Additional parameters to pass to the `missForest` function. These parameters can control aspects such as the number of trees (`ntree`) and the stopping criteria (`maxiter`).

Details

The function imputes missing values by building random forests, where missing values are predicted based on other available values within the dataset. For each variable with missing data, the function trains a random forest model using the available (non-missing) data in that variable, and subsequently predicts the missing values.

In addition to the imputed values, users can choose to retain additional columns from the original input data frame that were not part of the imputation process.

This function allows passing additional parameters to the underlying missForest function, such as controlling the number of trees used in the random forest models or specifying the stopping criteria. For a full list of parameters, refer to the missForest documentation.

To enable parallelisation, ensure that the doParallel package is installed and loaded:

install.packages("doParallel")
library(doParallel)

Then register the desired number of cores for parallel processing:

registerDoParallel(cores = 6)

To leverage parallelisation during the imputation, pass parallelize = "variables" as an argument to the missForest function.

Value

A data frame that contains an imputed_intensity column with the imputed values and an imputed column indicating whether each value was imputed (TRUE) or not (FALSE), in addition to any columns retained via retain_columns.

Author(s)

Elena Krismer

References

Stekhoven, D.J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597

Examples

set.seed(123) # Makes example reproducible

# Create example data
data <- create_synthetic_data(
  n_proteins = 10,
  frac_change = 0.5,
  n_replicates = 4,
  n_conditions = 2,
  method = "effect_random",
  additional_metadata = FALSE
)

head(data, n = 24)

# Perform imputation
data_imputed <- impute_randomforest(
  data,
  sample = sample,
  grouping = peptide,
  intensity_log2 = peptide_intensity_missing
)

head(data_imputed, n = 24)

protti documentation built on Jan. 14, 2026, 9:08 a.m.