process_data: Process Data

View source: R/process_data.R

process_dataR Documentation

Process Data

Description

A function called upon creating a task that uses the data provided to the task in order to process the covariates and identify missingness in the outcome. See parameters and details for more information.

Usage

process_data(data, nodes, column_names, flag = TRUE,
  drop_missing_outcome = FALSE)

Arguments

data

A data.table containing the analytic dataset. In creating the sl3_Task, the data passed to the task is supplied for this argument.

nodes

A list of character vectors for covariates, outcome, id, weights, and offset, which is generated when creating the sl3_Task if not already specified as an argument to make_sl3_Task.

column_names

A named list of column names in the data, which is generated when creating the sl3_Task if not already specified as an argument to make_sl3_Task.

flag

Logical (default TRUE) indicating whether to notify the user when there are outcomes that are missing, which can be modified when creating the sl3_Task by setting flag = FALSE.

drop_missing_outcome

Logical (default FALSE) indicating whether to drop observations with missing outcomes, which can be modified when creating the sl3_Task by setting drop_missing_outcome = TRUE.

Details

If the data provided to the task contains missing covariate values, then a few things will happen. First, for each covariate with missing values, if the proportion of missing values is greater than getOption("sl3.max_p_missing"), the covariate will be dropped. (The default option "sl3.max_p_missing" is 0.5 and it can be modified to say, 0.75, by setting options("sl3.max_p_missing" = 0.75)). Also, for each covariate with missing values that was not dropped, a so-called "missingness indicator" (that takes the name of the covariate with prefix "delta_") will be added as an additional covariate. The missingness indicator will take a value of 0 if the covariate value was missing and 1 if not. Also, imputation will be performed for each covariate with missing values: continuous covariates are imputed with the median, and discrete covariates are imputed with the mode. This coupling of imputation and missingness indicators removes the missing covariate values, while preserving the pattern of missingness, respectively. To avoid this default imputation, users can perform imputation on their analytic dataset before supplying it to make_sl3_Task. We generally recommend the missingness indicators be added regardless of the imputation strategy, unless missingness is very rare.

This function also coverts any character covariates to factors, and one-hot encodes factor covariates.

Lastly, if the outcome is supplied in creating the sl3_Task and if missing outcome values are detected in data, then a warning will be thrown. If drop_missing_outcome = TRUE then observations with missing outcomes will be dropped.

Value

A list of processed data, nodes and column names


jeremyrcoyle/sl3 documentation built on April 30, 2024, 10:16 p.m.