R/predBMA.R

Defines functions predBMA

Documented in predBMA

#' Predictive accuracy measurement using Bayesian Model Averaging
#'
#' @description This function is used for predictive accuracy measurement of
#' the selected models using Bayesian Model Averaging methodology. The Occam's
#' window with cut out threshold of \code{thr} is used. That means only models
#' that have posterior probability of at least \code{thr} * posteior
#' probability of the highest posterior probability model are considered in
#' model averaging. For survival time response datasets, the predictive Area
#' Under Curve (AUC) at each given time point is computed as the output. In
#' this case, the predictive AUC is obtained using Uno's method for
#' observations in the test set. For binary outcome data, only one AUC is
#' reported which is from the ROC computed on the test set. The training set is
#' used to find the selected model and relevant probabilities.
#' 
#' @param bvsobj An object that is generated by \code{\link{bvs}} function. It
#' is the output of the Bayesian variable selection procedure.
#' @param X The same \code{n} times \code{p} data frame initially used in
#' \code{bvs} function. This is the full data frame containing all test and
#' train data.
#' @param prep A boolean variable determining if the preprocessing step has
#' been done on the original data frame in \code{bvs} function. This should
#' have the same value as \code{prep} variable in the \code{bvs} function.
#' @param logT A boolean variable determining if log transform should has been
#' done on continuous columns of the data frame in \code{bvs} function. This
#' should have the same value as \code{logT} variable in the \code{bvs}
#' function.
#' @param resp For logistic regression models, this variable is the binary
#' response vector. For the Cox proportional hazard models this is a two column
#' matrix where the first column contains survival times and the second column
#' is the censoring status for each observation. Note that for survival times,
#' the time section of this variable should be in the same scale and unit
#' (year, days, etc.) as \code{times} variable for which the AUC has to be
#' computed.
#' @param nlptype Determines the type of nonlocal prior that is used in the
#' analyses. It can be "piMOM" for product inverse moment prior, or "pMOM" for
#' product moment prior. The default is set to piMOM prior.
#' @param train_idx An integer vector containing the indices of the training
#' set.
#' @param test_idx An integer vector containing the indices of the test set.
#' The set of observations that prediction will be performed on.
#' @param thr The threshold used for Occam's window as explained in the
#' description. The default value for this variable is 0.05.
#' @param times A vector of times at which predictive AUC is to be computed.
#' This input is only used for prediction in survival data analysis.
#' @param family Determines the type of data analysis. \code{logistic} is for
#' binary outcome and logistic regression model whereas, \code{survival}
#' represents survival outcomes and the Cox proportional hazard model.
#' @return The output is different based on the family for the anlysis of data
#' \strong{1) } \code{family = logistic}
#' The output is a list with the two following objects:
#' \item{auc}{This is the area under the ROC curve after Bayesian model
#' averaging is used to obtain ROC for the test data.}
#' \item{roc_curve}{This is a two column matrix representing points on the ROC
#' curve and can be used to plot the curve. The first column is FPR and the
#' second column is TPR which represent x-axis and y-axis in the ROC curve,
#' respectively.}
#' \strong{2) } \code{family = survival}
#' \item{auc}{A vector with the same length as \code{times} variable showing
#' predictive area under the curve at each given time point using Bayesian
#' Model averaging.}
#' @author Amir Nikooienejad
#' @references Raftery, A. E., Madigan, D., & Hoeting, J. A. (1997). Bayesian
#' model averaging for linear regression models. Journal of the American
#' Statistical Association, 92(437), 179-191.\cr\cr
#' Nikooienejad, A., Wang, W., & Johnson, V. E. (2020). Bayesian variable
#' selection for survival data using inverse moment priors. Annals of Applied
#' Statistics, 14(2), 809-828. \cr\cr
#' Uno, H., Cai, T., Tian, L., & Wei, L. J. (2007). Evaluating
#' prediction rules for t-year survivors with censored regression models.
#' Journal of the American Statistical Association, 102(478), 527-537.
#' @examples
#' ### Simulating Logistic Regression Data
#'n <- 200
#'p <- 40
#'set.seed(123)
#'Sigma <- diag(p)
#'full <- matrix(c(rep(0.5, p*p)), ncol=p)
#'Sigma <- full + 0.5*Sigma
#'cholS <- chol(Sigma)
#'Beta <- c(-1.7,1.8,2.5)
#'X <- matrix(rnorm(n*p), ncol=p)
#'X <- X%*%cholS
#'colnames(X) <- c(paste("gene_",c(1:p),sep=""))
#'beta <- numeric(p)
#'beta[c(1:length(Beta))] <- Beta
#'Xout <- PreProcess(X)
#'X <- Xout$X
#'XB <- X%*%beta
#'probs <- as.vector(exp(XB)/(1+exp(XB)))
#'y <- rbinom(n,1,probs)
#'X <- as.data.frame(X)
#'train_idx <- sample(1:n,0.8*n)
#'test_idx <- setdiff(1:n,train_idx)
#'X_train <- X[train_idx,]
#'y_train <- y[train_idx]
#'bout <- bvs(X_train, y_train, prep=FALSE, family = "logistic",
#'            mod_prior = "beta",niter = 50)
#'BMAout <- predBMA(bout, X, y, prep = FALSE, logT = FALSE, 
#'                  train_idx = train_idx, test_idx = test_idx,
#'                  family="logistic")
#' ### AUC for the prediction:
#'BMAout$auc
#'
#'### Plotting ROC Curve
#'roc <- BMAout$roc_curve
#'plot(roc,lwd=2,type='l',col='blue')
predBMA <- function(bvsobj, X, resp, prep, logT, nlptype = "piMOM", train_idx,
                    test_idx, thr = 0.05, times = NULL,
                    family = c("logistic", "survival")){
  
  if(!class(X)=="data.frame") stop("input X should be a data frame!") 
  X <- predmat(X)

  X_tr <- bvsobj$des_mat
  gn <- bvsobj$gene_names
  xname <- colnames(X)
  ind <- match(gn,xname)
  X <- X[,ind]
  X_te <- X[test_idx,]
  
  if(prep){
  Xout <- PreProcess(X_te, logT)
  X_te <- Xout$X
  }

  r <- bvsobj$r
  tau <- bvsobj$tau
  probs <- bvsobj$max_prob_vec
  models <- bvsobj$max_models
  maxprob <- bvsobj$max_prob
  thresh <- maxprob + log(thr)
  passinds <- which(probs >= thresh)
  
  k <- length(passinds)
  oprobs <- probs[passinds]; sh <- ceiling(max(oprobs)); oprobs <- exp(oprobs - sh);
  oprobs <- oprobs/sum(oprobs)
  omodels <- models[passinds]
  if(nlptype=="piMOM") nlptype_int <- 0
  if(nlptype=="pMOM") nlptype_int <- 1  
  # =======================================
  
  if(family=="logistic"){
    y <- resp
    y_tr <- y[train_idx]
    y_te <- y[test_idx]
    X_tr <- cbind(rep(1,length(y_tr)),X_tr); X_te <- cbind(rep(1,length(y_te)),X_te);
    
    aucout <- aucBMA_logistic(X_tr,y_tr,X_te,y_te,tau,r,nlptype_int,oprobs,omodels,k)
    roc <- aucout$roc; colnames(roc) <- c("TPR","FPR")
    return(list(auc = aucout$auc, roc_curve = roc))
  }
  
  # ====================
  
  if(family=="survival"){
    TS <- resp;
    TS_tr <- TS[train_idx,]
    TS_te <- TS[test_idx,]
    if (!length(times)) stop("No times vector is set!")
    aucout <- aucBMA_survival(X_tr,TS_tr,X_te,TS_te,tau,r,nlptype_int,times,oprobs,omodels,k)
    return(aucout)
  }
}

Try the BVSNLP package in your browser

Any scripts or data that you put into this service are public.

BVSNLP documentation built on Aug. 28, 2020, 5:08 p.m.