post_sample_preprocessing: Pre-processing of input file (AFTER in-/out-of-sample split)

Description Usage Arguments Details Examples

View source: R/random_rotation.R

Description

This function takes a data frame and performs the following tasks: (1) For each numeric column, it creates a ranking function based only on in-sample data (2) It applies this function to all numeric columns and to both in- and out-of-sample data (could also be applied to online data) (3) For each numeric column, it computes the median of only in-sample data (4) It imputes missing values in numeric columns with these in-sample medians (could also be applied to online data)

Usage

1
post_sample_preprocessing(Xpre, Ypre, r_train, r_test)

Arguments

Xpre

data frame

Ypre

data frame, currently unused (but will make it easier later to add processing that depends on it)

r_train

vector of in-sample indices into the data frames

r_test

vector of out-of-sample indices into the data frames, currently unused

Details

Additional pre-processing functions could be added here. For now, Ypre and r_test are not used.

It should be noted that these steps occur AFTER the file is split into training and testing data. As long as only in-sample data is used to create the transformations, there is not bias when training a classifier.

Also see: pre_sample_preprocessing() for a function that already gets called BEFORE the file is split into training and testing data.

The trade-off is speed vs bias. Putting everything here leads to slower run-times but without any potential for bias and vice versa.

Examples

1
2
r_train <- generate_training_row_indices(nrow(df), 0.6)
post_sample_preprocessing <- function(dfX, dfY, r_train, 0)

randomrotation/random.rotation documentation built on Dec. 31, 2020, 2:15 a.m.