trimTrees: Trimmed Opinion Pools of Trees in Random Forest
In trimTrees: Trimmed opinion pools of trees in a random forest

Description Usage Arguments Value Author(s) References See Also Examples

This function creates point and probability forecasts from the trees in a random forest using Jose et al.'s trimmed opinion pool, a trimmed average of the trees' empirical cumulative distribution functions (cdf). For tuning purposes, the user can input the trimming level used in this trimmed average and then compare the scores of the trimmed and untrimmed opinion pools, or ensembles.

trimTrees(xtrain, ytrain, xtest, ytest=NULL, ntree = 500, 
          mtry = if (!is.null(ytrain) && !is.factor(ytrain)) 
          max(floor(ncol(xtrain)/3), 1) else floor(sqrt(ncol(xtrain))), 
          nodesize = if (!is.null(ytrain) && !is.factor(ytrain)) 5 else 1, 
          trim = 0,trimIsExterior = TRUE, 
          uQuantiles = seq(0.05, 0.95, 0.05), methodIsCDF = TRUE)

`xtrain`	A data frame or a matrix of predictors for the training set.
`ytrain`	A response vector for the training set. If a factor, classification is assumed, otherwise regression is assumed.
`xtest`	A data frame or a matrix of predictors for the testing set.
`ytest`	A response vector for the testing set. If no testing set is passed, probability integral transform (PIT) values and scores will be returned as `NA`s.
`ntree`	Number of trees to grow.
`mtry`	Number of variables randomly sampled as candidates at each split.
`nodesize`	Minimum size of terminal nodes.
`trim`	The trimming level used in the trimmed average of the trees' empirical cdfs. For the cdf approach, the trimming level is the fraction of cdfs values to be trimmed from each end of the ordered vector of cdf values (for each support point) before the average is computed. For the moment approach, the trees' means are computed, ordered, and trimmed. The trimmed opinion pool using the moment approach is an average of the remaining trees.
`trimIsExterior`	If `TRUE`, the trimming is done exteriorly, or from the ends of the ordered vector. If `FALSE`, the trimming is done interiorly, or from the middle of the ordered vector.
`uQuantiles`	A vector of probabilities in a strictly increasing order and between 0 and 1. For instance, if `uQuantiles=c(0.25,0.75)`, then the 0.25-quantile and the 0.75-quantile of the trimmed and untrimmed ensembles are scored.
`methodIsCDF`	If `TRUE`, the method for forming the trimmed opinion pool is according to the cdf approach in Jose et al (2014). If `FALSE`, the moment approach is used.

An object of class trimTrees, which is a list with the following components:

`forestSupport`	Possible points of support for the trees and ensembles.
`treeValues`	For the last testing set row, this component outputs each tree's `ytrain` values (not necessarily unique) that are both inbag and in the `xtest`'s terminal node. Note that the `ytrain` values may not be unique. This component is an `ntrain`-by-`ntree` matrix where `ntrain` is the number of rows in the training set.
`treeCounts`	For the last testing set row, each tree's counts of `treeValues` and lists them by their unique values. This component is an `nSupport`-by-`ntree` matrix. `nSupport` is the number of unique `ytrain` values, or support points of the forest.
`treeCumCounts`	Cumulative tally of `treeCounts` of dimension `nSupport+1`-by-`ntree`.
`treeCDFs`	Each tree's empirical cdf based on `treeCumCounts` for the last testing set row only. This component is an `nSupport+1`-by-`ntree` matrix. Note that the first row in this matrix is all zeros.
`treePMFs`	Each tree's empirical probability mass function (pmf) for the last testing set row. This component is an `nSupport`-by-`ntree` matrix.
`treeMeans`	For each testing set row, each tree's mean according to its empirical pmf. This component is an `ntest`-by-`ntree` matrix where `ntest` is the number of rows in the testing set.
`treeVars`	For each testing set row, each tree's variance according to its empirical pmf. This component is an `ntest`-by-`ntree` matrix.
`treePITs`	For each testing set row, each tree's probability integral transform (PIT), the empirical cdf evaluated at the realized `ytest` value. This component is an `ntest`-by-`ntree` matrix. If `ytest` is `NULL`, `NA`s are returned.
`treeQuantiles`	For the last testing set row, each tree's quantiles – one for each element in `uQuantiles`, the empirical cdf evaluated at the realized `ytest` value. This component is an `ntree`-by-`nQuantile` matrix where `nQuantile` is the number of elements in `uQuantiles`.
`treeFirstPMFValues`	For each testing set row, this component outputs the pmf value on the minimum (or first) support point in the forest. For binary classification, this corresponds to the probability that the minimum (or first) support point will occur. This component's dimension is `ntest`-by-`ntree`. It is useful for generating calibration curves (stated probabilities in bins vs. their observed frequencies) for binary classification.
`bracketingRate`	For each testing set row, the bracketing rate from Larrick et al. (2012) is computed as `2p(1-p)` where `p` is the fraction of trees' means above the `ytest` value. If `ytest` is `NULL`, `NA`s are returned.
`bracketingRateAllPairs`	The average bracketing rate across all testing set rows for each pair of trees. This component is a symmetric `ntree`-by-`ntree` matrix. If `ytest` is `NULL`, `NA`s are returned.
`trimmedEnsembleCDFs`	For each testing set row, the trimmed ensemble's forecast of `ytest` in the form of a cdf. This component is an `ntest`-by-`nSupport + 1` matrix. `nSupport` is the number of unique `ytrain` values, or support points of the forest.
`trimmedEnsemblePMFs`	For each testing set row, the trimmed ensemble's pmf. This component is an `ntest`-by-`nSupport` matrix.
`trimmedEnsembleMeans`	For each testing set row, the trimmed ensemble's mean. This component is an `ntest` vector.
`trimmedEnsembleVars`	For each testing set row, the trimmed ensemble's variance.
`trimmedEnsemblePITs`	For each testing set row, the trimmed ensemble's probability integral transform (PIT), the empirical cdf evaluated at the realized `ytest` value. If `ytest` is `NULL`, `NA`s are returned.
`trimmedEnsembleQuantiles`	For the last testing set row, the trimmed ensemble's quantiles – one for each element in `uQuantiles`.
`trimmedEnsembleComponentScores`	For the last testing set row, the components of the trimmed ensemble's linear and log quantile scores.If `ytest` is `NULL`, `NA`s are returned.
`trimmedEnsembleScores`	For each testing set row, the trimmed ensemble's linear and log quantile scores, ranked probability score, and two-moment score. See Jose and Winkler (2009) for a description of the linear and log quantile scores. See Gneiting and Raftery (2007) for a description of the ranked probability score. The two-moment score is the score in Equation 27 of Gneiting and Raftery (2007). If `ytest` is `NULL`, `NA`s are returned.
`untrimmedEnsembleCDFs`	For each testing set row, the linear opinion pool's, or untrimmed ensemble's, forecast of `ytest` in the form of a cdf.
`untrimmedEnsemblePMFs`	For each testing set row, the untrimmed ensemble's pmf.
`untrimmedEnsembleMeans`	For each testing set row, the untrimmed ensemble's mean.
`untrimmedEnsembleVars`	For each testing set row, the untrimmed ensemble's variance.
`untrimmedEnsemblePITs`	For each testing set row, the untrimmed ensemble's probability integral transform (PIT), the empirical cdf evaluated at the realized `ytest` value. If `ytest` is `NULL`, `NA`s are returned.
`untrimmedEnsembleQuantiles`	For the last testing set row, the untrimmed ensemble's quantiles – one for each element in `uQuantiles`.
`untrimmedEnsembleComponentScores`	For the last testing set row, the components of the untrimmed ensemble's linear and log quantile scores. If `ytest` is `NULL`, `NA`s are returned.
`untrimmedEnsembleScores`	For each testing set row, the untrimmed ensemble's linear and log quantile scores, ranked probability score, and two-moment score. If `ytest` is `NULL`, `NA`s are returned.

Yael Grushka-Cockayne, Victor Richmond R. Jose, Kenneth C. Lichtendahl Jr., and Huanghui Zeng.

Gneiting T, Raftery AE. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 359-378.

Jose VRR, Grushka-Cockayne Y, Lichtendahl KC Jr. (2014). Trimmed opinion pools and the crowd's calibration problem. Management Science 60 463-475.

Jose VRR, Winkler RL (2009). Evaluating quantile assessments. Operations Research 57 1287-1297.

Grushka-Cockayne Y, Jose VRR, Lichtendahl KC Jr. (2014). Ensembles of overfit and overconfident forecasts, working paper.

Larrick RP, Mannes AE, Soll JB (2011). The social psychology of the wisdom of crowds. In J.I. Krueger, ed., Frontiers in Social Psychology: Social Judgment and Decision Making. New York: Psychology Press, 227-242.

hitRate, cinbag

# Load the data
set.seed(201) # Can be removed; useful for replication
data <- as.data.frame(mlbench.friedman1(500, sd=1))
summary(data)

# Prepare data for trimming
train <- data[1:400, ]
test <- data[401:500, ]
xtrain <- train[,-11]  
ytrain <- train[,11]
xtest <- test[,-11]
ytest <- test[,11]
      
# Option 1. Run trimTrees with responses in testing set.
set.seed(201) # Can be removed; useful for replication
tt1 <- trimTrees(xtrain, ytrain, xtest, ytest, trim=0.15)

#Some outputs from trimTrees: scores, hit rates, PIT densities.
colMeans(tt1$trimmedEnsembleScores)
colMeans(tt1$untrimmedEnsembleScores)
mean(hitRate(tt1$treePITs))
hitRate(tt1$trimmedEnsemblePITs)
hitRate(tt1$untrimmedEnsemblePITs)
hist(tt1$trimmedEnsemblePITs, prob=TRUE)
hist(tt1$untrimmedEnsemblePITs, prob=TRUE)

# Option 2. Run trimTrees without responses in testing set. 
# In this case, scores, PITs, or hit rates will not be available.
set.seed(201) # Can be removed; useful for replication
tt2 <- trimTrees(xtrain, ytrain, xtest, trim=0.15)

# Some outputs from trimTrees: cdfs for last test value.
plot(tt2$trimmedEnsembleCDFs[100,],type="l",col="red",ylab="cdf",xlab="y") 
lines(tt2$untrimmedEnsembleCDFs[100,])
legend(275,0.2,c("trimmed", "untrimmed"),col=c("red","black"),lty = c(1, 1))
title("CDFs of Trimmed and Untrimmed Ensembles")

# Compare the CDF and moment approaches to trimming the trees.
ttCDF <- trimTrees(xtrain, ytrain, xtest, trim=0.15, methodIsCDF=TRUE)
ttMA <- trimTrees(xtrain, ytrain, xtest, trim=0.15, methodIsCDF=FALSE)
plot(ttCDF$trimmedEnsembleCDFs[100,], type="l", col="red", ylab="cdf", xlab="y") 
lines(ttMA$trimmedEnsembleCDFs[100,])
legend(275,0.2,c("CDF Approach", "Moment Approach"), col=c("red","black"),lty = c(1, 1))
title("CDFs of Trimmed Ensembles")