R/the_feature_engineer.R

Defines functions the_feature_engineer

Documented in the_feature_engineer

#' @title Suggest variable interactions and composite features for random forest models
#' @description Suggests candidate variable interactions and composite features able to improve predictive accuracy over data not used to train the model via spatial cross-validation with [rf_evaluate()]. For a pair of predictors `a` and `b`, interactions are build via multiplication (`a * b`), while composite features are built by extracting the first factor of a principal component analysis performed with [pca()], after rescaling `a` and `b` between 1 and 100. Interactions and composite features are named `a..x..b` and `a..pca..b` respectively.
#'
#'Candidate variables `a` and `b` are selected from those predictors in `predictor.variable.names` with a variable importance above `importance.threshold` (set by default to the median of the importance scores).
#'
#' For each interaction and composite feature, a model including all the predictors plus the interaction or composite feature is fitted, and it's R squared (or AUC if the response is binary) computed via spatial cross-validation (see [rf_evaluate()]) is compared with the R squared of the model without interactions or composite features.
#'
#'From all the potential interactions screened, only those with a positive increase in R squared (or AUC when the response is binomial) of the model, a variable importance above the median, and a maximum correlation among themselves and with the predictors in `predictor.variable.names` not higher than `cor.threshold` (set to 0.5 by default) are selected. Such a restrictive set of rules ensures that the selected interactions can be used right away for modeling purposes without increasing model complexity unnecessarily. However, the suggested variable interactions might not make sense from a domain expertise standpoint, so please, examine them with care.
#'
#'The function returns the criteria used to select the interactions, and the data required to use these interactions a model.
#'
#' @param data Data frame with a response variable and a set of predictors. Default: `NULL`
#' @param dependent.variable.name Character string with the name of the response variable. Must be in the column names of `data`. If the dependent variable is binary with values 1 and 0, the argument `case.weights` of `ranger` is populated by the function [case_weights()]. Default: `NULL`
#' @param predictor.variable.names Character vector with the names of the predictive variables, or object of class `"variable_selection"` produced by [auto_vif()] and/or [auto_cor()]. Every element of this vector must be in the column names of `data`. Default: `NULL`
#' @param xy Data frame or matrix with two columns containing coordinates and named "x" and "y". If not provided, the comparison between models with and without variable interactions is not done.
#' @param ranger.arguments Named list with \link[ranger]{ranger} arguments (other arguments of this function can also go here). All \link[ranger]{ranger} arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of \link[ranger]{ranger} if you are not familiar with the arguments of this function.
#' @param repetitions Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: `30`
#' @param training.fraction Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: `0.75`
#' @param importance.threshold Numeric between 0 and 1, quantile of variable importance scores over which to select individual predictors to explore interactions among them. Larger values reduce the number of potential interactions explored. Default: `0.75`
#' @param cor.threshold Numeric, maximum Pearson correlation between any pair of the selected interactions, and between any interaction and the predictors in `predictor.variable.names`. Default: `0.75`
#' @param point.color Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. `viridis::viridis(100)`). Default: `viridis::viridis(100, option = "F", alpha = 0.8)`
#' @param seed Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: `NULL`
#' @param verbose Logical. If `TRUE`, messages and plots generated during the execution of the function are displayed. Default: `TRUE`
#' @param n.cores Integer, number of cores to use for parallel execution. Creates a socket cluster with `parallel::makeCluster()`, runs operations in parallel with `foreach` and `%dopar%`, and stops the cluster with `parallel::clusterStop()` when the job is done. Default: `parallel::detectCores() - 1`
#' @param cluster A cluster definition generated with `parallel::makeCluster()`. If provided, overrides `n.cores`. When `cluster = NULL` (default value), and `model` is provided, the cluster in `model`, if any, is used instead. If this cluster is `NULL`, then the function uses `n.cores` instead. The function does not stop a provided cluster, so it should be stopped with `parallel::stopCluster()` afterwards. The cluster definition is stored in the output list under the name "cluster" so it can be passed to other functions via the `model` argument, or using the `%>%` pipe. Default: `NULL`
#' @return A list with seven slots:
#' \itemize{
#'   \item `screening`: Data frame with selection scores of all the interactions considered.
#'   \item `selected`: Data frame with selection scores of the selected interactions.
#'   \item `df`: Data frame with the computed interactions.
#'   \item `plot`: List of plots of the selected interactions versus the response variable. The output list can be plotted all at once with `patchwork::wrap_plots(p)` or `cowplot::plot_grid(plotlist = p)`, or one by one by extracting each plot from the list.
#'   \item `data`: Data frame with the response variable, the predictors, and the selected interactions, ready to be used as `data` argument in the package functions.
#'   \item `dependent.variable.name`: Character, name of the response.
#'   \item `predictor.variable.names`: Character vector with the names of the predictors and the selected interactions.
#' }
#'
#'
#' @examples
#' if(interactive()){
#'
#'  #load example data
#'  data(plant_richness_df)
#'
#'  new.features <- the_feature_engineer(
#'    data = plant_richness_df,
#'    dependent.variable.name = "richness_species_vascular",
#'    predictor.variable.names = colnames(plant_richness_df)[5:21],
#'    n.cores = 1,
#'    verbose = TRUE
#'  )
#'
#'  new.features$screening
#'  new.features$selected
#'  new.features$columns
#'
#' }
#' @importFrom utils combn
#' @importFrom foreach %do%
#' @rdname the_feature_engineer
#' @export
the_feature_engineer <- function(
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  xy = NULL,
  ranger.arguments = NULL,
  repetitions = 30,
  training.fraction = 0.75,
  importance.threshold = 0.75,
  cor.threshold = 0.75,
  point.color = viridis::viridis(
    100,
    option = "F",
    alpha = 0.8
  ),
  seed = NULL,
  verbose = TRUE,
  n.cores = parallel::detectCores() - 1,
  cluster = NULL
){

  #coerce to data frame if tibble
  if(is.null(data)){
    stop("Argument 'data' is missing.")
  } else {
    if(inherits(data, "tbl_df") | inherits(data, "tbl")){
      data <- as.data.frame(data)
    }
  }

  if(is.null(xy)){
    stop("Argument 'xy' is missing")
  } else {
    if(inherits(xy, "tbl_df") | inherits(xy, "tbl")){
      xy <- as.data.frame(xy)
    }
  }

  #CLUSTER SETUP
  #cluster is provided
  if(!is.null(cluster)){

    #n.cores <- NULL
    n.cores <- NULL

    #flat to not stop cluster after execution
    stop.cluster <- FALSE

  } else {

    #creates and registers cluster
    cluster <- parallel::makeCluster(
      n.cores,
      type = "PSOCK"
    )

    #flag to stop cluster
    stop.cluster <- TRUE

  }

  #registering cluster
  doParallel::registerDoParallel(cl = cluster)

  #finding out if the response is binary
  is.binary <- is_binary(
    data = data,
    dependent.variable.name = dependent.variable.name
  )
  if(is.binary == TRUE){
    metric <- "auc"
  } else {
    metric <- "r.squared"
  }


  #declaring variables
  variable <- NULL
  y <- NULL

  #predictor.variable.names comes from auto_vif or auto_cor
  if(!is.null(predictor.variable.names)){
    if(inherits(predictor.variable.names, "variable_selection")){
      predictor.variable.names <- predictor.variable.names$selected.variables
    }
  }

  if(importance.threshold > 1){
    importance.threshold <- 0.99
  }
  if(importance.threshold < 0){
    importance.threshold <- 0.01
  }

  if(verbose == TRUE){
    message("Fitting and evaluating a model without interactions.")
  }

  #without
  model.without.interactions <- rf(
    data = data,
    dependent.variable.name = dependent.variable.name,
    predictor.variable.names = predictor.variable.names,
    ranger.arguments = ranger.arguments,
    seed = seed,
    verbose = FALSE
  )

  #evaluation
  model.without.interactions <- rf_evaluate(
    model = model.without.interactions,
    repetitions = repetitions,
    training.fraction = training.fraction,
    xy = xy,
    metrics = metric,
    seed = seed,
    verbose = FALSE,
    n.cores = n.cores,
    cluster = cluster
  )

  #metric
  model.without.interactions.evaluation <- model.without.interactions$evaluation$aggregated
  model.without.interactions.metric <- model.without.interactions.evaluation[
    model.without.interactions.evaluation$model == "Testing",
    "median"
  ]

  #ranger.arguments.i
  ranger.arguments.i <- ranger.arguments
  ranger.arguments.i$data <- NULL
  ranger.arguments.i$dependent.variable.name <- NULL
  ranger.arguments.i$predictor.variable.names <- NULL

  #setting quantile of the importance threshold
  importance.threshold <- quantile(
    x = model.without.interactions$importance$per.variable$importance,
    probs = importance.threshold
  )

  #selected variables
  variables.to.test <- model.without.interactions$importance$per.variable[
    model.without.interactions$importance$per.variable$importance >= importance.threshold,
    "variable"
  ]

  #pairs of variables
  variables.pairs <- as.data.frame(
    t(
      utils::combn(
        variables.to.test,
        2
      )
    ),
    stringsAsFactors = FALSE
  )

  if(verbose == TRUE){
    message(paste0("Testing ", nrow(variables.pairs), " candidate interactions."))
  }

  #testing interactions
  i <- NULL
  interaction.screening.1 <- foreach::foreach(
    i = seq(1, nrow(variables.pairs)),
    .combine = "rbind",
    .verbose = FALSE
  ) %dopar% {

    #get pair
    pair.i <- c(variables.pairs[i, 1], variables.pairs[i, 2])
    pair.i.name <- paste0(
      pair.i,
      collapse = "..x.."
    )

    #get interaction values
    pair.i.1 <- spatialRF::rescale_vector(
      x = data[, pair.i[1]],
      new.min = 1,
      new.max = 100
    )
    pair.i.2 <- spatialRF::rescale_vector(
      x = data[, pair.i[2]],
      new.min = 1,
      new.max = 100
    )

    #prepare data.i
    data.i <- data.frame(
      data,
      interaction = pair.i.1 * pair.i.2
    )
    colnames(data.i)[ncol(data.i)] <- pair.i.name

    #prepare predictor.variable.names.i
    predictor.variable.names.i <- c(
      predictor.variable.names,
      pair.i.name
    )

    #computing max correlation with predictors
    cor.out <- cor(data.i[, predictor.variable.names.i])
    diag(cor.out) <- NA
    max.cor <- max(cor.out[pair.i.name, ], na.rm = TRUE)

    #if the maximum correlation is lower than the threshold, fit model
    if(max.cor <= cor.threshold){

      #without
      model.i <- spatialRF::rf(
        data = data.i,
        dependent.variable.name = dependent.variable.name,
        predictor.variable.names = predictor.variable.names.i,
        ranger.arguments = ranger.arguments.i,
        seed = seed,
        verbose = FALSE
      )

      #evaluation
      model.i <- spatialRF::rf_evaluate(
        model = model.i,
        repetitions = repetitions,
        training.fraction = training.fraction,
        xy = xy,
        metrics = metric,
        seed = seed,
        verbose = FALSE,
        n.cores = 1
      )

      #importance data frame
      model.i.importance <- model.i$importance$per.variable

      #metric
      model.i.evaluation <- model.i$evaluation$aggregated
      model.i.metric <- model.i.evaluation[
        model.i.evaluation$model == "Testing",
        "median"
      ]

      #gathering results
      out.df <- data.frame(
        interaction.name = pair.i.name,
        interaction.importance = round((model.i.importance[model.i.importance$variable == pair.i.name, "importance"] * 100) / max(model.i.importance$importance), 3),
        interaction.metric.gain = model.i.metric - model.without.interactions.metric,
        max.cor.with.predictors = max.cor,
        variable.a.name = pair.i[1],
        variable.b.name = pair.i[2]
      )

    } else {

      #gathering results
      out.df <- data.frame(
        interaction.name = NA,
        interaction.importance = NA,
        interaction.metric.gain = NA,
        max.cor.with.predictors = NA,
        variable.a.name = NA,
        variable.b.name = NA
      )

    }

    return(out.df)

  }#end of parallelized loop

  interaction.screening.2 <- foreach::foreach(
    i = seq(1, nrow(variables.pairs)),
    .combine = "rbind",
    .verbose = FALSE
  ) %dopar% {

    #get pair
    pair.i <- c(variables.pairs[i, 1], variables.pairs[i, 2])
    pair.i.name <- paste0(
      pair.i[1],
      "..pca..",
      pair.i[2]
    )

    #get interaction values
    pair.i.1 <- spatialRF::rescale_vector(
      x = data[, pair.i[1]],
      new.min = 1,
      new.max = 100
    )
    pair.i.2 <- spatialRF::rescale_vector(
      x = data[, pair.i[2]],
      new.min = 1,
      new.max = 100
    )

    #prepare data.i
    data.i <- data.frame(
      data,
      interaction = spatialRF::pca(
        x = data.frame(
          pair.i.1,
          pair.i.2
        )
      )$pca_factor_1
    )
    colnames(data.i)[ncol(data.i)] <- pair.i.name

    #prepare predictor.variable.names.i
    predictor.variable.names.i <- c(
      predictor.variable.names,
      pair.i.name
    )

    #computing max correlation with predictors
    cor.out <- cor(data.i[, predictor.variable.names.i])
    diag(cor.out) <- NA
    max.cor <- max(cor.out[pair.i.name, ], na.rm = TRUE)

    #if the maximum correlation is lower than the threshold, fit model
    if(max.cor <= cor.threshold){

      #without
      model.i <- spatialRF::rf(
        data = data.i,
        dependent.variable.name = dependent.variable.name,
        predictor.variable.names = predictor.variable.names.i,
        ranger.arguments = ranger.arguments.i,
        seed = seed,
        verbose = FALSE
      )

      #evaluation
      model.i <- spatialRF::rf_evaluate(
        model = model.i,
        repetitions = repetitions,
        training.fraction = training.fraction,
        xy = xy,
        metrics = metric,
        seed = seed,
        verbose = FALSE,
        n.cores = 1
      )

      #importance data frame
      model.i.importance <- model.i$importance$per.variable

      #metric
      model.i.evaluation <- model.i$evaluation$aggregated
      model.i.metric <- model.i.evaluation[
        model.i.evaluation$model == "Testing",
        "median"
      ]

      #gathering results
      out.df <- data.frame(
        interaction.name = pair.i.name,
        interaction.importance = round((model.i.importance[model.i.importance$variable == pair.i.name, "importance"] * 100) / max(model.i.importance$importance), 3),
        interaction.metric.gain = model.i.metric - model.without.interactions.metric,
        max.cor.with.predictors = max.cor,
        variable.a.name = pair.i[1],
        variable.b.name = pair.i[2]
      )

    } else {

      #gathering results
      out.df <- data.frame(
        interaction.name = NA,
        interaction.importance = NA,
        interaction.metric.gain = NA,
        max.cor.with.predictors = NA,
        variable.a.name = NA,
        variable.b.name = NA
      )

    }

    return(out.df)

  }#end of parallelized loop

  interaction.screening <- rbind(
    interaction.screening.1,
    interaction.screening.2
  ) %>%
    na.omit()

  if(nrow(interaction.screening) == 0){
    message("No promising interactions found. \n")
    stop()
  }

  #adding column of selected interactions
  interaction.screening$selected <- ifelse(
    interaction.screening$interaction.metric.gain > 0.01 &
      interaction.screening$interaction.metric.gain > 0.01,
    TRUE,
    FALSE
  )

  if(sum(interaction.screening$selected) == 0){
    message("No promising interactions found. \n")
    return(NA)
  }

  #compute order
  interaction.screening$order <-
    spatialRF::rescale_vector(interaction.screening$interaction.importance) +
    spatialRF::rescale_vector(interaction.screening$interaction.metric.gain) -
    spatialRF::rescale_vector(1 - interaction.screening$max.cor.with.predictors)

  #arrange by gain
  interaction.screening <- dplyr::arrange(
    interaction.screening,
    dplyr::desc(order)
  )

  #remove order
  interaction.screening$order <- NULL

  #selected only
  interaction.screening.selected <- interaction.screening[interaction.screening$selected == TRUE, ]
  interaction.screening.selected <- interaction.screening.selected[, c(
    "interaction.name",
    "interaction.importance",
    "interaction.metric.gain",
    "max.cor.with.predictors",
    "variable.a.name",
    "variable.b.name"
  )]


  #preparing data frame of interactions
  interaction.df <- data.frame(
    dummy.column = rep(NA, nrow(data))
  )

  for(i in seq(1, nrow(interaction.screening.selected))){

    #get interaction values
    pair.i.1 <- rescale_vector(
      x = data[, interaction.screening.selected[i, "variable.a.name"]],
      new.min = 1,
      new.max = 100
    )
    pair.i.2 <- rescale_vector(
      x = data[, interaction.screening.selected[i, "variable.b.name"]],
      new.min = 1,
      new.max = 100
    )
    pair.i.name <- interaction.screening.selected[i, "interaction.name"]

    #find what type of interaction
    if(grepl("..pca..", pair.i.name, fixed = TRUE) == TRUE){
      interaction.df[, interaction.screening.selected[i, "interaction.name"]] <- spatialRF::pca(
        x = data.frame(
          pair.i.1,
          pair.i.2
        )
      )$pca_factor_1
    } else {
      interaction.df[, interaction.screening.selected[i, "interaction.name"]] <- pair.i.1 * pair.i.2
    }
  }
  interaction.df$dummy.column <- NULL

  #removing variable names
  interaction.screening.selected$variable.a.name <- NULL
  interaction.screening.selected$variable.b.name <- NULL

  #filtering out interactions by correlation among themselves
  if(nrow(interaction.screening.selected) > 1){

    interaction.selection <- auto_cor(
      x = interaction.df,
      preference.order = interaction.screening.selected$interaction.name,
      cor.threshold = cor.threshold,
      verbose = FALSE
    )

    interaction.screening.selected <- interaction.screening.selected[
      interaction.screening.selected$interaction.name %in% interaction.selection$selected.variables,
    ]

    interaction.df <- interaction.df[, interaction.selection$selected.variables, drop = FALSE]

  }

  if(verbose == TRUE){
    message(
      paste0(
        "Interactions identified: ",
        nrow(interaction.screening.selected)
      )
    )
  }


  #printing suggested interactions
  if(verbose == TRUE){

    x <- interaction.screening.selected
    if(metric == "r.squared"){
      colnames(x) <- c("Interaction", "Importance (% of max)", "R-squared improvement", "Max cor with predictors")

    }
    if(metric == "auc"){
      colnames(x) <- c("Interaction", "Importance (% of max)", "AUC improvement", "Max cor with predictors")
    }

    x.hux <- huxtable::hux(x) %>%
      huxtable::set_bold(
        row = 1,
        col = huxtable::everywhere,
        value = TRUE
      ) %>%
      huxtable::set_all_borders(TRUE)
    huxtable::number_format(x.hux)[2:nrow(x), 2] <- 1
    huxtable::number_format(x.hux)[2:nrow(x), 3] <- 3
    huxtable::number_format(x.hux)[2:nrow(x), 4] <- 2
    huxtable::print_screen(x.hux, colnames = FALSE)

  }

  #plot interactions
  plot.list <- list()
  for(variable in names(interaction.df)){

    #create plot data frame
    plot.df <- data.frame(
      y = data[, dependent.variable.name],
      x = interaction.df[, variable]
    )

    #save plot
    plot.list[[variable]] <- ggplot2::ggplot() +
      ggplot2::geom_point(
        data = plot.df,
        ggplot2::aes(
          x = x,
          y = y,
          color = y
        )
      ) +
      ggplot2::scale_color_gradientn(colors = point.color) +
      ggplot2::xlab(variable) +
      ggplot2::ylab(dependent.variable.name) +
      ggplot2::ggtitle(
        paste0(
          "+R2: ",
          round(
            interaction.screening.selected[
              interaction.screening.selected$interaction.name == variable,
              "interaction.metric.gain"
            ],
            3
          ),
          "; Imp. (%): ",
          round(
            interaction.screening.selected[
              interaction.screening.selected$interaction.name == variable,
              "interaction.importance"
            ],
            0
          ),
          "; Max cor: ",
          round(
            interaction.screening.selected[
              interaction.screening.selected$interaction.name == variable,
              "max.cor.with.predictors"],
            2
          )
        )
      ) +
      ggplot2::theme_bw() +
      ggplot2::theme(legend.position = "none") +
      ggplot2::theme(
        plot.title = ggplot2::element_text(hjust = 0.5),
      )

  }

  #generating training df
  training.df <- cbind(
    data,
    interaction.df
  )

  #preparing out list
  out.list <- list()
  out.list$screening <- interaction.screening
  out.list$selected <- interaction.screening.selected
  out.list$plot <- plot.list
  out.list$data <- training.df
  out.list$dependent.variable.name <- dependent.variable.name
  out.list$predictor.variable.names <- c(
    predictor.variable.names,
    colnames(interaction.df)
  )

  if(verbose == TRUE){
    message("Comparing models with and without interactions via spatial cross-validation.")
  }

  #fitting models with and without the selected interactions

  #with
  model.with.interactions <- spatialRF::rf(
    data = training.df,
    dependent.variable.name = dependent.variable.name,
    predictor.variable.names = c(
      predictor.variable.names,
      colnames(interaction.df)
    ),
    ranger.arguments = ranger.arguments,
    verbose = FALSE
  )

  #pick colors for comparison from the palette
  n.colors <- length(point.color)
  line.color = point.color[1]
  color.a <- point.color[floor(n.colors/4)]
  color.b <- point.color[floor((n.colors/4)*3)]

  #comparison
  comparison <- rf_compare(
    models = list(
      with.interactions = model.with.interactions,
      without.interactions = model.without.interactions
    ),
    repetitions = repetitions,
    xy = xy,
    metrics = metric,
    fill.color = c(color.a, color.b),
    line.color = line.color,
    seed = seed,
    verbose = FALSE,
    n.cores = n.cores,
    cluster = cluster
  )

  #adding it to the plot list
  plot.list[["comparison"]] <- comparison$plot

  #saving new elements into the output list
  out.list$comparison.df <- comparison$comparison.df
  out.list$plot <- plot.list

  if(verbose == TRUE){
    print(patchwork::wrap_plots(out.list$plot))
  }

  #stopping cluster
  if(stop.cluster == TRUE){
    parallel::stopCluster(cl = cluster)
  }


  out.list

}
BlasBenito/spatialRF documentation built on Sept. 1, 2022, 1:42 p.m.