clean: Dataframe cleaning for missing data handling

Description Usage Arguments Details Value Examples

View source: R/clean.R

Description

clean helps in the conversion of missing values, variable types and removes rows and columns above pre-specified missingness

Usage

1
2
3
4
5
6
7
clean(
  X,
  var_remove = NULL,
  var_removal_threshold = 0.5,
  ind_removal_threshold = 1,
  missingness_coding = NA
)

Arguments

X

Original dataframe with samples in rows and variables as columns

var_remove

Variables to remove (e.g. ID). Define by character vector, e.g. c('ID', 'character_variable')

var_removal_threshold

Variable removal threshold with default 0.5 (range between 0 and 1). Variables (columns) above this missingness fraction will be removed during the cleaning process

ind_removal_threshold

Individual removal threshold with default 1 (range between 0 and 1). Individuals (rows) above this missingness fraction will be removed during the cleaning process

missingness_coding

Non NA coding in original dataframe that should be changed to NA (e.g. -9). Can take a single value (define by: missingness_coding = -9) or multiple values (define by: missingness_coding = c(-9, -99, -999))

Details

For better imputation performance, a clean, filtered dataframe is needed. Variables and samples with very high missingness fractions will negatively impact most missing data imputation algorithms. This function cleans the original dataframe by removing rows (samples) and columns (variables) above pre-specified missingness thresholds. The function will also convert any prespecified, strangely coded missing data to NAs. Note that all factor variables will be converted or coerced to numeric variables.

Value

Clean dataset with NAs as missing values and rows/columns above the pre-specified missingness thresholds removed

Examples

1
2
3
4
5
6
7
8
# basic settings
cleaned <- clean(clindata_miss, missingness_coding = -9)

# setting very conservative removal thresholds
cleaned <- clean(clindata_miss,
                 var_removal_threshold = 0.10,
                 ind_removal_threshold = 0.9,
                 missingness_coding = -9)

missCompare documentation built on Dec. 1, 2020, 9:09 a.m.