binCat: categorical data binning by collapsing

Description Usage Arguments Details Value Examples

Description

Bins categorical variables into a smaller number of bins. Useful when modeling with variables that have many small categories. The largest categories are taken as is and the smaller categories are collapsed into a new field named 'other.' There are two options for determining the number of bins:
1. Specify the exact number of bins desired (ncat)
2. Specify how the share of your variable that will be represented with actual categories before naming everything else 'other' (maxp)

Usage

1
2
binCat(x, ncat = NULL, maxp = NULL, results = F, setNA = NA,
  keepNA = F)

Arguments

x

vector to bin. It is transformed to a character, so any type is acceptable

ncat

number 0 to 100 (or higher I suppose). Number of bins to collapse data to

maxp

number 0 to 1. Percentage of data that will be represented "as is" before categories are collapsed to "other"

results

logical TRUE or FALSE. Prints a frequency table of the new categories.

setNA

value to set NAs to. default is to keep NA. Can set to a character string to make NAs a category

keepNA

logical. TRUE keeps NAs as their own character. FALSE bundles NAs into 'other' category.

Details

It is advisable to use only the ncat OR maxp parameters. When both used together, they will return whichever criteria yields the smaller number of bins.
Possible unexpected behavior when setNA=NA and keepNA=T. To keep NAs as standalone category, need to make setNA something that is not NA.

Value

vector of binned data

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
d <- rpois(1000, 20)
d[d>26] <- sample(1:26, length(d[d>26]), replace=T)
dl <- letters[d]
barplot(table(dl))
table(binCat(dl, results=F, ncat=5))
table(binCat(dl, results=F, maxp=0.5))
table(binCat(dl, results=F, maxp=0.9))

## With missings
ff <- sample(letters[1:15], 100, replace=T)
ff[sample(100, 10)] <- NA
binCat(ff, ncat=7, setNA='missing')

brooksandrew/Rsenal documentation built on May 13, 2019, 7:50 a.m.