R/NesPrInDT.R
In PrInDT: Prediction and Interpretation in Decision Trees for Classification and Regression

Documented in NesPrInDT

#' Nested \code{\link{PrInDT}} with additional undersampling of a factor with two unbalanced levels 
#'
#' @description Function for additional undersampling of the factor 'nesvar' with two unbalanced levels to avoid dominance of the level with higher frequency.
#' The factor 'nesvar' is allowed not be part of the input data frame 'datain'. The data of this factor is given in the vector 'nesunder'. 
#' The observations in 'nesunder' have to represent the same cases as in 'datain' in the same ordering.\cr
#' \code{\link{PrInDT}} is called 'repin' times with subsamples of the original data so that the level with the larger frequency in the vector 'nesunder' has 
#' approximately the same number of values as the level with the smaller frequency.\cr
#' Only the arguments 'nesvar', 'nesunder', and 'repin' relate to the additional undersampling, all the other arguments relate to the standard 
#' \code{\link{PrInDT}} procedure. \cr As in \code{\link{PrInDT}}, the aim is to optimally model the relationship between the two-class factor variable 'classname' and all other factor and  
#' numerical variables in the data frame 'datain' by means of 'N' repetitions of undersampling. The trees generated by \code{\link{PrInDT}} can be
#' restricted by excluding unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.\cr
#' The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).\cr
#' Undersampling may be stratified in two ways by the feature 'strat'.\cr
#' The results are evaluated on the full sample and on the subsamples of 'nesunder'.
#' The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.\cr
#'
#' \strong{Reference} \cr Weihs, C., Buschfeld, S. 2021b. NesPrInDT: Nested undersampling in PrInDT. 
#'	arXiv:2103.14931
#'
#' @usage NesPrInDT(datain,classname,ctestv=NA,N,plarge,psmall=1.0,conf.level=0.95,
#'        thres=0.5,stratvers=0,strat=NA,seedl=TRUE,nesvar,nesunder,repin,
#'        minsplit=NA,minbucket=NA)
#'
#' @param datain Input data frame with class factor variable 'classname' and the\cr
#'    influential variables, which need to be factors or numericals (transform logicals and character variables to factors) 
#' @param classname Name of class variable (character)
#' @param ctestv Vector of character strings of forbidden split results;\cr
#'     (see function \code{\link{PrInDT}} for details.)\cr
#'     If no restrictions exist, the default = NA is used.
#' @param N Number of repetitions (integer > 0)
#' @param plarge Undersampling percentage of larger class (numerical, > 0 and <= 1)
#' @param psmall Undersampling percentage of smaller class (numerical, > 0 and <= 1);\cr
#'     default = 1
#' @param conf.level (1 - significance level) in function \code{ctree} (numerical, > 0 and <= 1);\cr
#'     default = 0.95
#' @param thres Probability threshold for prediction of smaller class; default = 0.5
#' @param stratvers Version of stratification;\cr
#'     = 0: none (default),\cr
#'     = 1: stratification according to the percentages of the values of the factor variable 'strat',\cr
#'     > 1: stratification with minimum number 'stratvers' of observations per value of 'strat'
#' @param strat Name of one (!) stratification variable for undersampling (character);\cr
#'     default = NA (no stratification)
#' @param seedl Should the seed for random numbers be set (TRUE / FALSE)?\cr
#'     default = TRUE
#' @param nesvar Name of factor to be undersampled (character)
#' @param nesunder Data of factor to be undersampled (integer)
#' @param repin Number of repetitions (integer) for undersampling of 'nesvar'
#' @param minsplit Minimum number of elements in a node to be splitted;\cr
#'     default = 20
#' @param minbucket Minimum number of elements in a node;\cr
#'     default = 7
#'
#' @return
#' \describe{  
#' \item{undba}{balanced accuracies on undersamples}
#' \item{imax}{indices of best trees on undersamples}
#' \item{undba3en}{balanced accuracies of ensembles of 3 best trees on undersamples}
#' \item{accF}{balanced accuracies on full sample}
#' \item{accE}{balanced accuracy on full sample of best ensemble of 3 trees from undersampling}
#' \item{maxt}{indices of best trees on full sample}
#' \item{treesb}{3 best trees of all undersamples of 'nesunder'; refer to an individual tree as \code{treesb[[k]]}, k = 1, ..., 3*repin}
#' }
#' 
#' @details
#' Standard output can be produced by means of \code{print(name)} or just \code{ name } as well as \code{plot(name)}  where 'name' is the output data 
#' frame of the function.\cr
#' The plot function will produce a series of more than one plot. If you use R, you might want to specify \code{windows(record=TRUE)} before 
#' \code{plot(name)} to save the whole series of plots. In R-Studio this functionality is provided automatically.
#'
#' @export NesPrInDT
#' @import MASS
#' @import party
#'
#' @examples
#' # data input and preparation --> data frame with 
#' #   class variable, factors, and numericals (no character variables)!!
#' data <- PrInDT::data_speaker
#' data <- na.omit(data)
#' nesvar <- "SPEAKER"
#' N <- 49  # no. of repetitions in inner loop
#' plarge <- 0.06 # sampling percentage for larger class in nesunder-subsample
#' psmall <- 1 # sampling percentage for smaller class in nesunder-subsample
#' nesunder <- data$SPEAKER
#' data[,nesvar] <- list(NULL)
#' outNes <- NesPrInDT(data,"class",ctestv=NA,N,plarge,psmall,conf.level=0.95,nesvar=nesvar,
#'   nesunder=nesunder,repin=5)
#' outNes
#' plot(outNes)
#' hist(outNes$undba,main=" ",xlab = "balanced accuracies of 3 best trees of all undersamples")
#'
NesPrInDT <- function(datain,classname,ctestv=NA,N,plarge=NA,psmall=1.0,conf.level=0.95,thres=0.5,stratvers=0,strat=NA,seedl=TRUE,nesvar,nesunder,repin,
                                     minsplit=NA,minbucket=NA){
  ## input check
  if (typeof(datain) != "list" || typeof(classname) != "character" || !(typeof(ctestv) %in% c("logical", "character")) || N <= 0 ||
      !((0 < plarge & plarge <= 1) | typeof(plarge)=="logical") || !(0 < psmall & psmall <= 1) || !(0 < conf.level & conf.level <= 1) || !(0 <= thres & thres <= 1) ||
      !(0 <= stratvers) || !(typeof(strat) %in% c("logical", "character")) || typeof(seedl) != "logical" || typeof(nesvar) != "character" || 
      typeof(nesunder) != "integer" || typeof(repin) != "double" || !(typeof(minsplit) %in% c("logical","double")) || 
      !(typeof(minbucket) %in% c("logical", "double"))  ) {
    stop("irregular input")
  }
  if ((is.na(minsplit) == TRUE) & (is.na(minbucket) == TRUE)){
    minsplit <- 20
    minbucket <- 7
  }
  if (!(is.na(minsplit) == TRUE) & (is.na(minbucket) == TRUE)){
    minbucket <- minsplit / 3
  }
  if ((is.na(minsplit) == TRUE) & !(is.na(minbucket) == TRUE)){
    minsplit <- minbucket * 3
  }
  ##
  if (seedl == TRUE){
    set.seed(87654321)  # seed of random numbers
  }
  ###
  data <- datain
  names(data)[names(data)==classname] <- "class"
  n_class1 <- table(data$class)[1] # no. of elements of class 1
  n_class2 <- table(data$class)[2] # no. of elements of class 2
  if (min(n_class1,n_class2) == 0){
   stop("irregular input: only 1 class")
  }
  if (n_class1 < n_class2){
    # relevel of classes if smaller class first
    data$class <- stats::relevel(data$class, levels(data$class)[2]) # larger class now first
    n_class1 <- table(data$class)[1] # no. of elements of larger class 1
    n_class2 <- table(data$class)[2] # no. of elements of smaller class 2
  }
  n <- n_class1 + n_class2 # overall no. of observations in data
  ##
  n_classind1 <- table(nesunder)[1] # no. of elements of class 1 in nesunder
  n_classind2 <- table(nesunder)[2] # no. of elements of class 2 in nesunder
  if (n_classind1 < n_classind2){
    # relevel of classes if smaller class first
    nesunder <- stats::relevel(nesunder, levels(nesunder)[2]) # larger class now first
    n_classind1 <- table(nesunder)[1] # no. of elements of larger class 1 in nesunder
    n_classind2 <- table(nesunder)[2] # no. of elements of smaller class 2 in nesunder
  }
  n_ind <- n_classind1 + n_classind2 # no. of elements in nesunder
  if (n_ind != n){
    stop("Independent and dependent variables do not have the same length")
  }
  ### initialisation
  undba <- matrix(0,ncol=3,nrow=repin)
  undba3en <- rep(0,repin)
  index <- matrix(0,ncol=table(nesunder)[2],nrow=repin)
  ##
  # Start of predictor resampling loop
  for (j in 1:repin){
    message("repetition:", j)
    order_class2 <- order(as.numeric(nesunder),decreasing=TRUE)
    x <- data[order_class2,] # data now reordered: class2 first in nesunder
    index[j,] <- sample.int(n_classind1,n_classind2) + n_classind2 # class2 fully included
    x <- x[index[j,],]
    y <- rbind(data[as.integer(nesunder) == 2,],x) # data is reordered: class2 first (with indund resampling)
    y <- as.data.frame(y)
    ## call of PrInDT (without re-setting seed for random numbers)
    outde <- PrInDT(y,classname,ctestv,N,percl=plarge,percs=psmall,conf.level,seedl=FALSE,minsplit=minsplit,minbucket=minbucket)
    ##
    undba[j,] <- c(outde$ba1st[3],outde$ba2nd[3],outde$ba3rd[3])
    undba3en[j] <- outde$baen[2,3]
    if (j == 1){
      treesb <- c(outde$tree1st,outde$tree2nd,outde$tree3rd)
    } else {
      treesb <- c(treesb,outde$tree1st,outde$tree2nd,outde$tree3rd)
    }
  }
  ###
  ## evaluation
  ###  indices of trees with maximum balanced accuracy on undersamples
  imax <- which(t(undba) == max(undba)) # output
  ## balanced accuracies on full sample
  accF <- rep(0,length(treesb))
  for (i in 1:length(treesb)){
    ct <- treesb[[i]]
    ctpreds <- as.integer(predict(ct,newdata=data))
    preds1 <- ctpreds[as.integer(data$class) == 1]
    preds2 <- ctpreds[as.integer(data$class) == 2]
    accF1 <- sum (preds1 == 1) / n_class1
    accF2 <- sum (preds2 == 2) / n_class2
    accF[i] <- (accF1 + accF2) / 2
  }
  mbaF <- max(accF) # output
  maxt <- which(accF == mbaF) # output: indices of trees with maximum balanced accuracy on full sample
  ##
  ### ensemble of the 3 best trees on undersamples
  modus <- function(x){which.max(tabulate(x))} # definition of Mode function
  three <- sort(undba,decreasing=TRUE,index.return=TRUE)$ix[1:3]
  treesE <- treesb[three]
  preds_orderedE <- matrix(0,nrow=dim(data)[1],ncol=length(treesE))
  for (i in 1:length(treesE)){
    preds_orderedE[,i] <- predict(treesE[[i]],newdata=data)
  }
  classE <- as.integer(data$class)
  n_class1 <- table(data$class)[1] # no. of elements of larger class 1
  n_class2 <- table(data$class)[2] # no. of elements of smaller class 2
  preds1 <- preds_orderedE[classE == 1,]
  preds2 <- preds_orderedE[classE == 2,]
  accE1 <- sum (apply(preds1,1,modus) == 1) / n_class1
  accE2 <- sum (apply(preds2,1,modus) == 2) / n_class2
  accE <- (accE1 + accE2) / 2
### results
  message("\n")
  result <- list( undba = undba, imax = imax, undba3en = undba3en, accF = accF, accE = accE, maxt = maxt, treesb = treesb) 
  class(result) <- "NesPrInDT"
  result
}
#' @export
print.NesPrInDT <- function(x, ...){
cat("\n","Evaluation of nested PrInDT","\n\n","************************","\n")
  ## balanced accuracies on undersamples of nesunder
  cat(" results for undersamples","\n","************************","\n")
  mba <- max(x$undba) # output
  cat("best balanced accuracy of undersampled trees","\n")
  cat(unname(mba))
  cat("\n","table of balanced accuracies (with no. of occurences) of 3 best trees on all undersamples","\n")
  print(table(unname(x$undba)))
  cat("\n","Best trees on undersamples","\n")
  for (k in 1:length(x$imax)){
    print(x$treesb[[x$imax[k]]])
  }
  ## balanced accuracies on full sample
  cat("\n","***************************","\n")
  cat(" results for the full sample","\n","***************************","\n")
  cat("balanced accuracy of best undersampled tree on full sample","\n")
  cat(unname(x$accF[round(x$imax[1],digits=6)]))
  cat("\n","table of balanced accuracies (with no. of occurences) of 3 best trees of all replications on full sample")
  print(table(unname(round(x$accF,digits=6))))
  cat("\n","Best trees on full sample","\n")
  for (k in 1:length(x$maxt)){
    print(x$treesb[[x$maxt[k]]])
  }
  ##
  cat("\n","*************************************","\n","results for ensembles of 3 best trees",
      "\n","*************************************","\n")
  ### ensemble of the 3 best trees on undersamples
  cat("balanced accuracy on full sample of ensemble of best 3 trees from undersampling","\n")
  cat(unname(x$accE))
  ### best ensemble of 3 trees from undersampling
  cat("\n")
  cat("best balanced accuracy on undersample of ensemble of best 3 trees from undersampling","\n")
  cat(unname(max(x$undba3en)),"\n")
}
#' @export
plot.NesPrInDT <- function(x, ...){
## Best trees on undersamples
if (length(x$imax) > 1){
  for (k in 1:length(x$imax)){
    plot(x$treesb[[x$imax[k]]],main=paste0("One of the ",length(x$imax)," best trees on undersamples"))
  }
}
else {
  plot(x$treesb[[x$imax[1]]],main="The best tree on undersamples")
}  
## Best trees on full sample
if (length(x$maxt) > 1){
  for (k in 1:length(x$maxt)){
    plot(x$treesb[[x$maxt[k]]],main=paste0("One of the ",length(x$maxt)," best trees on full sample"))
  }
}
else {
  plot(x$treesb[[x$maxt[1]]],main="The best tree on full sample")
} 
}

Any scripts or data that you put into this service are public.

PrInDT documentation built on Sept. 11, 2025, 5:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PrInDT
Prediction and Interpretation in Decision Trees for Classification and Regression

R/NesPrInDT.R
In PrInDT: Prediction and Interpretation in Decision Trees for Classification and Regression

Defines functions plot.NesPrInDT print.NesPrInDT NesPrInDT

Documented in NesPrInDT

Try the PrInDT package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

PrInDT Prediction and Interpretation in Decision Trees for Classification and Regression

R/NesPrInDT.R In PrInDT: Prediction and Interpretation in Decision Trees for Classification and Regression

Defines functions plot.NesPrInDT print.NesPrInDT NesPrInDT

Documented in NesPrInDT

Try the PrInDT package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

PrInDT
Prediction and Interpretation in Decision Trees for Classification and Regression

R/NesPrInDT.R
In PrInDT: Prediction and Interpretation in Decision Trees for Classification and Regression