balance_by: Balance a data.frame by a variable's values

Description Usage Arguments Details Value

View source: R/util.R

Description

A common use-case for this is when you have imbalanced classes in your training data for a classifier.

Usage

1

Arguments

data

A tbl or data.frame. Internally, this function uses dplyr verbs, so it will work for local tables and remote tables in the warehouse.

var

Unquoted variable name to use for balancing.

Details

Although you might most commonly use this function for binary outcomes, it will also work if 'var' has more than two values. In that case, the subset for each value of 'var' will be sampled down to match the number of rows in the least common value of 'var'.

Note, however, that the set of unique values of 'var' is pulled into local memory when working with remote tbls. As such, you probably shouldn't try to balance by a categorical variable with many, many values.

Value

'data' balanced by 'var'


lukerobert/luketools documentation built on Jan. 24, 2020, 2:15 a.m.