pipe_create_stats: Generic function for creating statistics on the response...

Description Usage Arguments Value

View source: R/feature_generation.R

Description

Generic function for creating statistics on the response column, based on custom columns.

Usage

1
2
3
4
pipe_create_stats(train,
  stat_cols = colnames(train)[purrr::map_lgl(train, is.character)],
  response, functions = list(mean = mean), interaction_level = 1,
  too_few_observations_cutoff = 30, quantile_trim_threshold = 0)

Arguments

train

The train dataset, as a data.frame or data.table. Data.tables may be changed by reference.

stat_cols

A character vector of column names. Please ensure that you only choose column names of non-numeric columns

response

The column containing the response variable.

functions

A (named) list of functions to be used to generate statistics. Will take a vector and should return a scalar, e.g. mean / sd. If names are provided, the name will be prepended to the generate column. If they are not provided, gen<index of function>_ will be prepended.

interaction_level

An integer of 1 or higher. Should we gather statistics only for one column, or also for combinations of columns?

too_few_observations_cutoff

An integer denoting the minimum required observations for a combination of values in statistics_col to be used. If not enough observations are present, the statistics will be generated on the entire response column. Default: 30.

quantile_trim_threshold

Determines the quantile to which we'll trim the generated statistics. For instance, when this is set to .1, the generated statistics will be capped by the 0.1 and 0.9 quantile. Therefor, this should be a value between 0 and 0.5.

Value

A list containing the transformed train dataset and a trained pipe.

#' @details This function will also generate default values for all generated columns that use the entire response column. This allows us to ensure no NA values will be present in generated columns when, for instance, a new category is detected or when values are cut-off by too_few_observations_cutoff.


jeroenvdhoven/datapiper documentation built on July 14, 2019, 9:34 p.m.