factor_: Fast Factor Generation
In jackwasey/icd9: Tools for Working with ICD-9 Codes, and Finding Comorbidities

Description Usage Arguments Details Author(s) Examples

This function generates factors more quickly, by leveraging fastmatch::fmatch. The speed increase for ICD-9 codes is about 33

1
2
3

factor_(x, levels = NULL, labels = levels, na.last = NA)

factor_nosort(x, levels = NULL, labels = levels)

`x`	An object of atomic type `integer`, `numeric`, `character` or `logical`.
`levels`	An optional character vector of levels. Is coerced to the same type as `x`. By default, we compute the levels as `sort(unique.default(x))`.
`labels`	A set of labels used to rename the levels, if desired.
`na.last`	If `TRUE` and there are missing values, the last level is set as `NA`; otherwise; they are removed.

NaNs are converted to NA when used on numerics. Extracted from https://github.com/kevinushey/Kmisc.git

These feature from base R are missing: exclude = NA, ordered = is.ordered(x), nmax = NA

I don't think there is any requirement for factor levels to be sorted in advance, especially not for ICD-9 codes where a simple alphanumeric sorting will likely be completely wrong.

Kevin Ushey, adapted by Jack Wasey

## Not run: 
pts <- icd9:::randomUnorderedPatients(1e7)
u <- unique.default(pts$icd9)
# this shows that stringr (which uses stringi) sort takes 50% longer than
# built-in R sort.
microbenchmark::microbenchmark(sort(u), stringr::str_sort(u))

# this shows that \code{factor_} is about 50% faster than \code{factor} for
# big vectors of strings

# without sorting is much faster:
microbenchmark::microbenchmark(factor(pts$icd9),
                               factor_(pts$icd9),
                               factor_nosort(pts$icd9),
                               times=25)

## End(Not run)