inst/deprecated/R/LRCbestsubsets.R

##' Best subsets logistic regression classifier (LRC) with arbitrary loss function
##'
##' @description
##' This functions uses the \code{\link{bestglm}}
##' function from the \pkg{bestglm}
##' package to fit a best subsets logistic regression model and then estimate an optimal
##' binary classification threshold using cross validation, where optimality is defined
##' by minimizing an arbitrary, user-specified discrete loss function.
##'
##' @details
##' For a given partition of the training data, cross validation is
##' performed to estimate the optimal values of
##' \eqn{\tau}
##' which is used to dichotomize the probability predictions of the
##' logistic regression model into binary outcomes.
##' (Specifically, if the probability an observation
##' belongs to the second level of \code{truthLabels} exceeds \eqn{\tau}, it is
##' classified as belonging to that second level).  In this case, optimality is defined
##' as the the value of \eqn{\tau} that minimize the loss function defined by
##' \code{lossMat}.
##'
##' \code{LRCbestsubsets} searches for the optimal values of \eqn{\tau} by
##' fitting the best subsets logistic regression model and then calculating
##' the expected loss for each \eqn{\tau} in \code{tauVec}, and the
##' value of \eqn{\tau} that results in the lowest risk designates the
##' optimal threshold for a given cross-validation partition of the data.
##'
##' This process is repeated \code{cvReps} times, where each time a different random
##' partition of the data is created using its own seed, resulting in another
##' 'optimal' estimate of \eqn{\tau}.  The final estimate of
##' \eqn{\tau} is the median of those estimates.
##'
##' The final best subsets logistic regression classfier is given by estimating
##' the best subsets regression
##' coefficients using all the training data and then using the optimal \eqn{\tau} to
##' dichotomize the probability predictions from the logistic regression model into
##' one of two binary categories.
##'
##' For leave-one-out (LOO) cross-validation, which occurs when \code{cvFolds = NROW(predictors)},
##' the values of \code{cvReps}, \code{masterSeed}, \code{nJobs} are ignored,
##' since there is only one way to create the folds for LOO. Because parallelization happens
##' at the cross-validation replicate level, parallization for LOO isn't currently available.
##'
##' @author Landon Sego
##'
##' @rdname LRCbestsubsets
##'
##' @export
##'
##' @param truthLabels A factor with two levels containing the true labels for each
##' observation.
##' If it is more desirable to correctly predict one of the two classes over the other,
##' the second level of this factor should be the class you are most interested in
##' predicting correctly.
##'
##' @param predictors A matrix whose columns are the explanatory regression variables
##'
##' @param lossMat A loss matrix of class \code{lossMat}, produced by
##' \code{\link{lossMatrix}}, that specifies the penalties for classification errors.
##'
##' @param weight The observation weights that are passed to \code{\link{glm}}.
##' The default value is 1 for each observation. Refer to the \code{weights} arguments of
##' \code{\link{glm}} for further information.
##'
##' @param tauVec A sequence of tau threshold values for the
##' logistic regression classifier.
##'
##' @param naFilter The maximum proportion of data observations that can be missing
##' for a given predictor
##' (a column in \code{predictors}). If, for a given predictor, the proportion of
##' sample observations which are NA is greater than \code{naFilter}, the predictor is
##' not included in the elastic net model fitting.
##'
##' @param cvFolds The number of cross validation folds, which must be between 2 and
##' \code{NROW(predictors)}.  Note that \code{cvFolds = NROW(predictors)}
##' gives leave-one-out (LOO) cross validation. See Details for more on LOO.
##'
##' @param cvReps The number of cross validation replicates, i.e., the number
##' of times to repeat the cross validation
##' by randomly repartitioning the data into folds.
##'
##' @param masterSeed The random seed used to generate unique seeds for
##' each cross validation replicate.
##'
##' @param nJobs The number of cores on the local host
##' to use in parallelizing the training.  Parallelization
##' takes place at the \code{cvReps} level, i.e., if \code{cvReps = 1}, parallelizing would
##' do no good, whereas if \code{cvReps = 2}, each rep would be run separately in its
##' own thread if \code{cores = 2}.
##' Parallelization is executed using \code{\link{parLapplyW}} from the
##' \pkg{Smisc} package.
##'
##' @param verbose A logical, set to \code{TRUE} to receive messages regarding
##' the progress of the training algorithm.
##'
##' @param \dots Additional named arguments to \code{\link{bestglm}} and \code{\link{glm}}.
##'
##' @return
##' Returns an object of class \code{LRCbestsubsets}, which
##' inherits from classes \code{glm} and \code{lm}.  It contains the
##' object returned by \code{\link{glm}} that has been fit to all the data using
##' the the best predictors identified by \code{\link{bestglm}}. It also contains the values
##' of the optimal estimates of \eqn{\tau} from the individual cross validation replicates,
##' along with the median of those estimates, which constitutes the overall estimate of
##' the optimal \eqn{\tau}.
##'
##' Methods \code{print}, \code{summary}, \code{plot}, and \code{coef} are provided.
##' In addition, many of the S3 methods for \code{glm} and \code{lm} can be applied to
##' the object returned by \code{LRCbestsubets}.  See \code{methods(class = "glm")} or
##' \code{methods(class = "lm")} to see a listing of these methods.
##'
## @references
##'
##' @examples
##' # Load the Mojave data
##' data(Mojave)
##' str(Mojave)
##'
##' # Here we select the predictor variables (remove the location variables and the response)
##' predictors <- Mojave[,-c(1,2,11)]
##'
##' # Create a vector for the response (presence/absence of cheat grass)
##' cheat <- Mojave$cheatGrass
##'
##' # "0" is no cheatgrass observed and "1" is cheatgrass.  Note how "1" is the second level
##' # of the factor. This second level is the level for which the summary methods calculate
##' # the sensitivity. So the factor coding here is perfect, since we are most interested
##' # in predicting the presence of cheatgrass with greatest sensitivity (power).
##' levels(cheat)
##'
##' # Specify the loss matrix. In this example, we specify the penalty for missing
##' # cheatgrass as 2, while the penalty for predicting it falsely is 1.
##' lM <- lossMatrix(c("0","0","1","1"),
##'                  c("0","1","0","1"),
##'                  c(0,   1,  2,  0))
##' print(lM)
##'
##' # Train the best subsets logistis regression classifier (in particular, identify
##' # the optimal threshold, tau, that minimizes the loss).  As this takes some time,
##' # we'll skip this step
##' \dontrun{LRCbestsubsets_fit <- LRCbestsubsets(cheat, predictors, lM,
##'                                               cvReps = 100, cvFolds = 5,
##'                                               cores = max(1, detectCores() - 1))}
##'
##'
##' # We'll load the precalculated model fit instead
##' \donttest{data(LRCbestsubsets_fit)}
##'
##' \dontshow{
##' # Here is a call to LRCbestsubsets() that will run quickly for testing purposes
##' LRCbestsubsets_fit <- LRCglmnet((cheat, predictors, lM, cvReps = 3,
##'                                  cvFolds = 3, cores = max(1, detectCores() - 1),
##'                                  tauVec = c(0.4, 0.5)))
##' }
##'
##'
##' # Demonstrate the various methods for LRCbestsubsets() output
##' # (print, summary, plot, coef)
##' print(LRCbestsubsets_fit)
##'
##' # Assigning the output of print extracts the set of optimal taus that
##' # were identified during the cross validation replicates
##' o <- print(LRCbestsubsets_fit)
##' o
##'
##' # Sumarize the best fit LRC
##' summary(LRCbestsubsets_fit)
##'
##' # Plot the loss as a function of the optimal taus for each cross validation
##' # replicate.  Also plot the bestsubsets logistic regression glm object.
##' plot(LRCbestsubsets_fit)
##'
##' # Display the coefficients
##' coef(LRCbestsubsets_fit)
##'
##' # Make predictions using final model on the training data
##' out <- predict(LRCbestsubsets_fit, Mojave, truthCol = "cheatGrass")
##'
##' head(out)
##'
##' # Calculate the performance of predictions
##' summary(out)


## TODO:  Test with factor predictors

LRCbestsubsets <- function(truthLabels, predictors, lossMat,
                           weight = rep(1, NROW(predictors)),
                           tauVec = seq(0.1, 0.9, by = 0.05),
                           naFilter = 0.6,
                           cvFolds = 5,
                           cvReps = 10,
                           masterSeed = 1,
                           nJobs = 1,
                           verbose = FALSE,
                           ...) {

  # Checks on inputs
  stopifnot(is.factor(truthLabels),
#            is.matrix(predictors), I don't think a matrix is necessary for GLM
#            is.numeric(predictors),
            length(levels(truthLabels)) == 2,
            NCOL(predictors) > 1,
            length(truthLabels) == NROW(predictors),
            inherits(lossMat, "lossMat"),
            is.numeric(tauVec),
            all(tauVec < 1) & all(tauVec > 0),
            length(naFilter) == 1,
            is.numeric(naFilter),
            naFilter < 1 & naFilter > 0,
            length(cvFolds) == 1,
            length(cvReps) == 1,
            is.numeric(cvFolds),
            is.numeric(cvReps),
            cvFolds %% 1 == 0,
            cvFolds >= 2,
            cvFolds <= NROW(predictors),
            cvReps %% 1 == 0,
            cvReps > 0,
            is.numeric(masterSeed),
            is.numeric(nJobs),
            nJobs >= 1,
            is.logical(verbose))

  # Force the evaluation of the weight object immediately--this is IMPORTANT
  # because of R's lazy evaluation
  force(weight)

  ################################################################################
  # Data preparation
  ################################################################################

  # Get the data and filter missing values as neccessary
  d <- dataPrep(truthLabels, predictors, weight, naFilter, verbose)

  # number of observations after filtering for NAs
  n <- length(d$truthLabels)

  # Report the number of observations
  if (verbose) {
    cat(n, "observations are available for fitting the LRCbestsubsets model\n")
  }

  # Get the name of the truthLabels
  truthLabelName <- deparse(substitute(truthLabels))

  # If it's not a data frame, change it
  if (is.matrix(d$predictors)) {
    d$predictors <- as.data.frame(d$predictors)
  }

  # Create a combined data frame in the form required by bestglm()
  dm <- cbind(d$predictors, d$truthLabels)

  # Create column names for data frame
  colnames(dm) <- c(colnames(d$predictors), truthLabelName)

  # Create the weight column
  weight <- d$weight

  # Adjust cross-validation parameters based on new number of obs in the
  # data frame (some might have been removed)
  if (n > cvFolds) {

    cvFolds <- n

    warning("After removing missing data, the largest possible number of 'cvFolds' is now ",
            n, ",\ncorresponding to the number of non-missing observations.  This ",
            "value will be used and will result in leave-one-out cross validation.")
  }


  # For leave-one-out
  if (n == cvFolds) {
    cvReps <- 1
    nJobs <- 1
  }

  # A wrapper function for calling single_LRCbestsubsets via parLapply
  trainWrapper <- function(seed) {

    single_LRCbestsubsets(dm,
                          lossMat,
                          weight,
                          tauVec,
                          cvFolds,
                          seed,
                          n,
                          verbose,
                          ...)

  } # trainWrapper

  # Get the vector of seeds that will ensure repeatability across threads
  seedVec <- createSeeds(masterSeed, cvReps)

  ################################################################################
  # Set up the cluster and run CV in parallel
  ################################################################################

  if (nJobs > 1) {

    # Excecute the training in parallel
    pe <- Smisc::parLapplyW(seedVec, trainWrapper, njobs = nJobs,
                            expr = expression({library(bestglm)}),
                            varlist =  c("dm",
                                         "tauVec",
                                         "n",
                                         "cvFolds",
                                         "lossMat",
                                         "verbose"))


    # Combine results into a data frame
    parmEstimates <- Smisc::list2df(pe, row.names = 1:cvReps)

  }

  ################################################################################
  # Single thread
  ################################################################################

  else {

    parmEstimates <- Smisc::list2df(lapply(seedVec, trainWrapper),
                                    row.names = 1:cvReps)

  }

  ################################################################################
  # Create the final model using the median tau value
  ################################################################################

  # Fit the model using the aggregate parameters
  bestsubsetsFinal <- bestglm::bestglm(dm, weights = weight, family = binomial, ...)$BestModel

  # Return the optimal parameters to make graphical output
  bestsubsetsFinal$tau <- median(parmEstimates$tau)

  # Return the aggregated optimal taus (for each CV replicate)
  bestsubsetsFinal$optimalTaus <- parmEstimates

  # Add in the class names of the response (truth label)
  bestsubsetsFinal$classnames <- levels(truthLabels)

  # Assign the class
  class(bestsubsetsFinal) <- c("LRCbestsubsets", class(bestsubsetsFinal))

  return(bestsubsetsFinal)

} # LRCbestsubsets

##' @method print LRCbestsubsets
##'
##' @describeIn LRCbestsubsets Displays the overall optimized value of
##' \eqn{\tau} and prints (using \code{print.glm}) the best logistic regression model.
##' Invisibly returns the threshold estimate for each cross validation replicate as a
##' data frame.
##'
##' @param LRCbestsubsets_object An object of class \code{LRCbestsubsets},
##' returned by \code{LRCbestsubsets},
##' which contains the best subsets logistic regression model and the optimal threshold,
##' \eqn{\tau}.
##'
##' @export

print.LRCbestsubsets <- function(LRCbestsubsets_object) {

  # Print the optimal parms
  cat("The optimal threshold (tau) for the best subsets logistic regression fit: \n")
  tau <- LRCbestsubsets_object$tau
  pvar(tau)

  # Print the glm object   ### The :::  will be illegal for CRAN--need to fix
  stats:::print.glm(LRCbestsubsets_object)

  # Invisibly return the matrix of optimal tau values
  invisible(LRCbestsubsets_object$optimalTaus)

} # print.LRCbestsubsets


##' @method summary LRCbestsubsets
##'
##' @describeIn LRCbestsubsets Displays the overall optimized value of
##' \eqn{\tau} and the summary (using \code{summary.glm}) of the best logistic regression
##' model.
##'
##' @export

summary.LRCbestsubsets <- function(LRCbestsubsets_object) {

  # Print the optimal parms
  cat("The optimal threshold (tau) for the best subsets logistic regression fit: \n")
  tau <- LRCbestsubsets_object$tau
  pvar(tau)

  return(stats:::summary.glm(LRCbestsubsets_object))

} # print.LRCbestsubsets


##' @method plot LRCbestsubsets
##'
##' @describeIn LRCbestsubsets Produces a scatter plot of \eqn{\tau} vs. the expected
##' loss for each of the monte-carlo replicates of the
##' cross-validation partitioning.  This can provide a sense of how stable the estimates
##' are across the cross-validation replicates. Also plots the \code{glm} object using
##' \code{plot.glm}
##'
##' @param \dots Additional arguments to default S3 method \code{\link{symbols}}.
##' @export

plot.LRCbestsubsets <- function(LRCbestsubsets_object, ...){

  # Prepare for the symbols plot
  d <- LRCbestsubsets_object$optimalTaus
  da <- aggregate(d, by = list(d$tau, d$ExpectedLoss), length)[,1:3]
  colnames(da) <- c("tau", "expectedLoss", "N")

  # Plot the taus and expected loss
  symbols(da$tau, da$expectedLoss, circles = da$N / nrow(d),
          xlab = expression(tau), ylab = "Expected Loss", ...)

  # Plot the glm object
  out <- LRCbestsubsets_object
  class(out) <- setdiff(class(LRCbestsubsets_object), "LRCbestsubsets")
  plot(out)

  invisible(NULL)

} # plot.LRCbestsubsets

##' @method coef LRCbestsubsets
##'
##' @describeIn LRCbestsubsets Calls the \code{\link{coef.glm}} method
##' on the best subsets regression and returns the logistic regression coefficients
##'
##' @export

coef.LRCbestsubsets <- function(LRCbestsubsets_object) {

  out <- LRCbestsubsets_object
  class(out) <- setdiff(class(LRCbestsubsets_object), "LRCbestsubsets")
  return(coef(out))

} # coef.LRCbestsubsets
pnnl/glmnetLRC documentation built on May 25, 2019, 10:22 a.m.