create_stats: Calculates stats based on custom functions on the response...

Description Usage Arguments Details Value

View source: R/feature_generation.R

Description

Calculates stats based on custom functions on the response variable for each group provided in stat_cols.

Usage

1
2
create_stats(train, statistics_col, response, functions,
  too_few_observations_cutoff = 30, quantile_trim_threshold = 0)

Arguments

train

The train dataset, as a data.table

statistics_col

A character vector of column names. Please ensure that you only choose column names of non-numeric columns or numeric columns with few values. Combinations that generate too few (<30)

response

The column containing the response variable.

functions

A (named) list of functions to be used to generate statistics. Will take a vector and should return a scalar, e.g. mean / sd. If names are provided, the name will be prepended to the generate column. If they are not provided, gen<index of function>_ will be prepended.

too_few_observations_cutoff

An integer denoting the minimum required observations for a combination of values in statistics_col to be used. If not enough observations are present, the statistics will be generated on the entire response column. Default: 30.

quantile_trim_threshold

Determines the quantile to which we'll trim the generated statistics. For instance, when this is set to .1, the generated statistics will be capped by the 0.1 and 0.9 quantile. Therefor, this should be a value between 0 and 0.5.

Details

This function will also generate default values for all generated columns that use the entire response column. This allows us to ensure no NA values will be present in generated columns

Value

A list containing the generated statistics tables and defaults per columns


jeroenvdhoven/datapiper documentation built on July 14, 2019, 9:34 p.m.