Nothing
#' Build efficient features from high-cardinality, multiple-membership
#' factors
#'
#' @param d Data frame to use in models, at desired grain. Has id and outcome
#' @param longsheet Data frame containing multiple observations per grain. Has
#' id and groups
#' @param id Name of identifier column, unquoted. Must be present and identical
#' in both tables
#' @param groups Name of grouping column, unquoted
#' @param outcome Name of outcome column, unquoted
#' @param n_levels Number of levels to return, default = 100. An attempt is made
#' to return half levels positively associated with the outcome and half
#' negatively. If n_levels is greater than the number present, all levels will
#' be returned
#' @param min_obs Minimum number of observations a level must be found in in
#' order to be considered. Defaults to one, but larger values are often useful
#' because a level present in only a few observation will rarely be a useful.
#' @param positive_class If classification model, the positive class of the
#' outcome, default = "Y"; ignored if regression
#' @param cohesion_weight For classification problems only, how much to value a
#' level being consistently associated with an outcome relative to its being
#' present in many observations. Default = 2; equal weight is 1. Note that
#' this is a parameter that could potentially be tuned over.
#' @param levels Use this argument when add_best_levels was used in training and
#' you want to add the same columns for deployment. You can pass the model
#' trained on the data frame from \code{add_best_levels}, the data frame from
#' \code{add_best_levels}, or a character vector of levels to add.
#' @param fill Passed to \code{\link{pivot}}. Column to be used to fill the
#' values of cells in the output, perhaps after aggregation by \code{fun}. If
#' \code{fill} is not provided, counts will be used, as though a fill column
#' of 1s had been provided.
#' @param fun Passed to \code{\link{pivot}}. Function for aggregation, defaults
#' to \code{sum}. Custom functions can be used with the same syntax as the
#' apply family of functions, e.g. \code{fun = function(x)
#' some_function(another_fun(x))}.
#' @param missing_fill Passed to \code{\link{pivot}}. Value to fill for
#' combinations of grain and spread that are not present. Defaults to NA, but
#' 0 may be useful as well.
#'
#' @return For \code{add_best_levels}, d with new columns for the best levels
#' added and best_levels attribute containing a named list of levels added.
#' For \code{get_best_levels}, a character vector of the best levels.
#' @export
#' @seealso \code{\link{pivot}}
#' @description In healthcare, we are often faced with high cardinality
#' variables, where each observation may have zero, one, or more levels, e.g.
#' medications for a model at the patient grain. In these cases, creating a
#' feature variable for each level (each medication) as in one-hot encoding
#' can be prohibitively computationally intensive and can hurt performance by
#' diminishing the signal-to-noise ratio. \code{get_best_levels} identifies a
#' subset of categories that are likely to be valuable features, and
#' \code{add_best_levels} adds them to a model data frame.
#'
#' \code{get_best_levels} finds levels of \code{groups} that are likely to be
#' useful predictors in \code{d} and returns them as a character vector.
#' \code{add_best_levels} does the same and adds them, pivoted, to \code{d}.
#' The function attempts to find both positive and negative predictors of
#' \code{outcome}.
#'
#' \code{add_best_levels} stores the identified best levels and passes them
#' through model training so that in deployment, the same columns created in
#' training are again created (see the final example).
#'
#' \code{add_best_levels} accepts arguments to \code{\link{pivot}} so that
#' values associated with the levels (e.g. doses of medications) can be used
#' in the new features. However, note that these are not used in determining
#' the best levels. I.e. \code{get_best_levels} determines which levels are
#' likely to be good predictors looking only at outcomes where the levels are
#' present or abssent; it does not use \code{fill} or \code{fun} in this
#' determination. See \code{details} for more info about how levels are
#' selected.
#'
#' @details Here is how \code{get_best_levels} determines the levels of
#' \code{groups} that are likely to be good predictors.
#'
#' \itemize{
#' \item{For regression: For each group, the difference of the group-mean from the
#' grand-mean is divided by the standard deviation of the group as a sample
#' (i.e. centered_mean(group) / sqrt(var(group) / n(group))), and the groups
#' with the largest absolute values of that statistic are retained.}
#'
#' \item{For classification: For each group, two "log-loss-like" statistics are
#' calculated. One is the log of the fraction of observations in which the
#' group does not appear, which captures how ubiquitous the group is: more
#' common groups are more useful as predictors. The other captures how far the
#' group is from being always associated with the same outcome: groups that
#' are consistently assoicated with either outcome are more useful as
#' predictors. This is calculated as the log of the proportion of outcomes
#' that are not all the same outcome (e.g. if 4/5 observations are positive
#' class, this statistic is log(.2)). This value is then raised to the
#' \code{cohesion_weight} power. To ensure retainment of both positive- and
#' negative-predictors, the all-same-outcome that is used as the comparison is
#' determined by which side of the median proportion of positive_class the
#' group falls on.}
#' }
#'
#' @examples
#' set.seed(45796)
#'
#' # We have two tables we want to use in our models:
#' # - df is the model table. It has the outcomes (survived), and we want one
#' # prediction for each row in df
#' # - meds has detailed information on each row (patient) in df. Each patient
#' # may have zero, one, or more observations (drugs) in meds, and meds may
#' # have associated values (doses).
#'
#' df <- tibble::tibble(
#' patient = paste0("Z", sample(10, 5)),
#' age = sample(20:80, 5),
#' survived = sample(c("N", "Y"), 5, replace = TRUE, prob = c(1, 2))
#' )
#'
#' meds <- tibble::tibble(
#' patient = sample(df$patient, 10, replace = TRUE),
#' drug = sample(c("Quinapril", "Vancomycin", "Ibuprofen",
#' "Paclitaxel", "Epinephrine", "Dexamethasone"),
#' 10, replace = TRUE),
#' dose = sample(c(100, 250), 10, replace = TRUE)
#' )
#'
#' # Identify three drugs likely to be good predictors of survival
#'
#' get_best_levels(d = df,
#' longsheet = meds,
#' id = patient,
#' groups = drug,
#' outcome = survived,
#' n_levels = 3)
#'
#' # Identify four drugs likely to make good features and add them to df.
#' # The "fill", "fun", and "missing_fill" arguments are passed to
#' # `pivot`, which allows us to use the total doses of each drug given to the
#' # patient as our new features
#'
#' new_df <- add_best_levels(d = df,
#' longsheet = meds,
#' id = patient,
#' groups = drug,
#' outcome = survived,
#' n_levels = 4,
#' fill = dose,
#' fun = sum,
#' missing_fill = 0)
#' new_df
#'
#' # The names of the medications that were added to df in new_df are stored in the
#' # best_levels attribute of new_df so that the same columns can be added in
#' # deployment. This is useful because you need to have the same columns to make
#' # predictions as you had in model training. When you are ready to add levels to
#' # a deployment data frame, you can pass to the "levels" argument of
#' # add_best_levels either the models trained on new_df, new_df itself, or the
#' # character vector of levels to add.
#'
#' deployment_df <- tibble::tibble(
#' patient = "p6",
#' age = 30
#' )
#' deployment_meds <- tibble::tibble(
#' patient = rep("p6", 2),
#' drug = rep("Vancomycin", 2),
#' dose = c(100, 250)
#' )
#'
#' # Now, even though Vancomycin is the only drug that appears in
#' # deployment_meds, because we pass new_df to "levels", we get all the columns
#' # needed to make predictions on a model trained on new_df
#'
#' add_best_levels(d = deployment_df,
#' longsheet = deployment_meds,
#' id = patient,
#' groups = drug,
#' levels = new_df,
#' fill = dose,
#' missing_fill = 0)
add_best_levels <- function(d, longsheet, id, groups, outcome, n_levels = 100,
min_obs = 1, positive_class = "Y", cohesion_weight = 2,
levels = NULL, fill, fun = sum, missing_fill = NA) {
id <- rlang::enquo(id)
groups <- rlang::enquo(groups)
outcome <- rlang::enquo(outcome)
fill <- rlang::enquo(fill)
add_as_empty <- character()
if (is.null(levels)) {
to_add <- get_best_levels(d, longsheet, !!id, !!groups, !!outcome,
n_levels, min_obs, positive_class)
add_as_empty <- character()
} else {
levels_name <- paste0(rlang::quo_name(groups), "_levels")
# If a data frame or model_list was passed to levels, pull levels from it
if (is.model_list(levels) || is.data.frame(levels)) {
levels <- attr(levels, "best_levels")[[levels_name]]
if (is.null(levels))
stop("Looked for ", levels_name, " as an attribute of ",
match.call()$levels, " but it was NULL.")
# If the best_levels list was passed in pull the appropriate element out
} else if (is.list(levels)) {
levels <- levels[[levels_name]]
} else if (!is.character(levels)) {
stop("You passed a ", class(levels), " to levels. It should be a data frame ",
"returned from add_best_levels, a model_list trained on such a data frame ",
"the 'best_levels' attribute from such a data frame, or a character ",
"vector of levels to use.")
}
present_levels <- unique(dplyr::pull(longsheet, !!groups))
to_add <- levels[levels %in% present_levels]
add_as_empty <- levels[!levels %in% present_levels]
}
if (purrr::is_empty(to_add)) {
simplified_id <-
d %>%
dplyr::select(!!id)
empty_col_names <- paste0(quo_name(eval(groups)), "_", add_as_empty)
class(missing_fill) <- class(d[[rlang::quo_name(fill)]])
pivoted <-
matrix(missing_fill,
nrow = length(simplified_id), ncol = length(empty_col_names),
dimnames = list(NULL, empty_col_names)) %>%
tibble::as_tibble() %>%
bind_cols(simplified_id, .)
} else {
longsheet <- dplyr::filter(longsheet, (!!groups) %in% to_add)
pivot_args <- list(
d = longsheet,
grain = eval(id),
spread = eval(groups),
fun = fun,
missing_fill = missing_fill
)
if (!missing(fill))
pivot_args$fill <- eval(fill)
if (length(add_as_empty))
pivot_args$extra_cols <- add_as_empty
pivoted <- do.call(pivot, pivot_args)
}
pivoted <-
pivoted %>%
dplyr::left_join(d, ., by = rlang::quo_name(id))
# Replace any rows not found in the pivot table in the join with missing_fill
new_cols <- setdiff(names(pivoted), names(d))
if (!is.na(missing_fill)) {
pivoted[new_cols][is.na(pivoted[new_cols])] <- missing_fill
}
# Add new best_levels to any that came in on d
attr(pivoted, "best_levels") <-
c(attr(d, "best_levels"),
setNames(list(dplyr::union(to_add, add_as_empty)), paste0(rlang::quo_name(groups), "_levels")))
return(pivoted)
}
#' @export
#' @rdname add_best_levels
get_best_levels <- function(d, longsheet, id, groups, outcome, n_levels = 100,
min_obs = 1, positive_class = "Y",
cohesion_weight = 2) {
id <- rlang::enquo(id)
groups <- rlang::enquo(groups)
outcome <- rlang::enquo(outcome)
missing_check(d, outcome)
if (!is.numeric(n_levels) || !is.numeric(min_obs))
stop("n_levels and min_obs should both be integers")
# Check for multiple observations per unit of observation in d
id_ft <- dplyr::count(d, !!id) %>% dplyr::filter(n > 1)
if (nrow(id_ft))
stop("d can have only one row per observation. The following ID(s) had more ",
"than one observation: ", list_variables(dplyr::pull(id_ft, !!id)))
if (isTRUE(all.equal(d, longsheet))) {
# Remove duplicate columns that muck up joined table
longsheet <- longsheet %>% select_not(outcome)
d <- select_not(d, groups)
}
tomodel <-
longsheet %>%
# Don't want to count the same outcome twice, so only allow one combo of grain x grouper
dplyr::distinct(!!id, !!groups) %>%
dplyr::filter(!is.na(!!groups)) %>%
dplyr::inner_join(d, ., by = rlang::quo_name(id)) %>%
# Filter any level present in only one grain
group_by(!!groups) %>%
filter(n_distinct(!!sym(quo_name(id))) >= min_obs) %>%
ungroup()
if (!nrow(tomodel)) {
warning("No levels present in at least ", min_obs, " observations")
return(character())
}
if (is.numeric(dplyr::pull(tomodel, !!outcome))) {
# Regression
# Use the distance from the grand-mean divided by the sample SD to rank predictors
# Groups with no variance in outcome rise to the top even if they're very small
tomodel <-
tomodel %>%
mutate(!!rlang::quo_name(outcome) := !!outcome - mean(!!outcome)) %>%
group_by(!!groups) %>%
summarize(mean_ssd = mean(!!outcome) / sqrt(stats::var(!!outcome) / n())) %>%
arrange(desc(abs(mean_ssd)))
tozip <-
split(tomodel, sign(tomodel$mean_ssd)) %>%
purrr::map(~ pull(.x, !!groups))
} else {
# Classification
# Using basically the log-loss from being present in all observations and
# from perfect separation of outcomes. Epislon for present in all is 1/2 a
# record; for perfect separation is 1/2 an observation.
total_observations <- n_distinct(dplyr::pull(tomodel, !!id))
levs <-
tomodel %>%
group_by(!!groups) %>%
summarize(fraction_positive = mean(!!outcome == positive_class),
# If perfect separation of outcomes, say it got 1/2 of one wrong
fraction_positive = dplyr::case_when(
fraction_positive == 1 ~ 1 - (.5 / total_observations),
fraction_positive == 0 ~ .5 / total_observations,
TRUE ~ fraction_positive),
# If level present in every observation, call it every one minus one-half
present_in = ifelse(n_distinct(rlang::quo_name(id)) == total_observations,
total_observations - .5, n_distinct(rlang::quo_name(id))),
log_dist_from_in_all = -log(present_in / total_observations)) %>%
dplyr::select(-present_in)
median_positive <- stats::median(levs$fraction_positive)
levs <-
levs %>%
mutate(predictor_of = as.integer(fraction_positive > stats::median(fraction_positive)),
log_loss = - (predictor_of * log(fraction_positive) + (1 - predictor_of) * log(1 - fraction_positive)),
badness = log_loss ^ cohesion_weight * log_dist_from_in_all) %>%
arrange(badness)
tozip <-
split(levs, levs$predictor_of) %>%
purrr::map(~ pull(.x, !!groups))
}
out <-
if (length(tozip) == 1) {
tozip[[1]]
} else {
zip_vectors(tozip[[1]], tozip[[2]])
}
if (length(out) > n_levels)
out <- out[seq_len(n_levels)]
return(out)
}
#' Create one vector from two, with each vectors' first element in the first and
#' second positions, the second elements third and fourth, etc.
#' @noRd
zip_vectors <- function(v1, v2) {
ll <- list(v1, v2)
lengths <- purrr::map_int(ll, length)
zipped <-
lapply(seq_len(min(lengths)), function(i) {
c(ll[[1]][i], ll[[2]][i])
}) %>%
unlist()
# if they had different lengths, add trailing part of the longer vector
if (length(unique(lengths)) > 1)
zipped <- c(zipped, ll[[which.max(lengths)]][(min(lengths) + 1):max(lengths)])
return(zipped)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.