knitr::opts_chunk$set( collapse = TRUE, fig.width = 7 )

This guide is meant for users who are familiar with regression trees, random forests, and the spline projection method but who are unfamiliar with the specifics of using the `splinetree`

package. The guide walks through examples of building and evaluating a spline forest. This guide builds off of the vignettes *Introduction to splinetree* and *Tree Building with splinetree*. The data used for these examples comes from the National Longitudinal Survey of Youth (NLSY), 1979. In the example, the trajectory of interest is body mass index (BMI) across age, and the split variables are baseline variables related to socioeconomic status and family background.

For users familiar with building trees in `splinetree`

(see *Introduction to Tree Building with splinetree*), building a forest is straightforward. The majority of the parameters needed to use the `splineForest()`

function are identical to those used in the `splineTree()`

function. The process used to project the longitudinal data onto smoothed trajectories and then make splits is identical. There are just two additional parameters for the `splineForest`

function.

The `nTree`

parameter specifies the number of trees in the forest. The default value is $50$. Large forests provide additional stability over smaller forests, but on large datasets building a large spline forest may take several minutes. The `prob`

parameter specifies the probability that a variable will be in consideration as a split variable at a given node. To avoid a situation where no variables are considered at a certain node, we recommend that that `prob`

is relatively large when the number of split variables is small. If only 6 split variables are in consideration, then setting `prob=1/3`

leaves around an 8\% chance $\left( (2/3)^6 \right)$ that no variable will be selected for a split at the root node. Increasing `prob`

to `1/2`

reduces this probability to under 2\%. The `bootstrap`

parameter specifies whether the data subsample used for each tree will be a bootstrap sample (drawn with replacement, size equal to original dataset) or a random sample of 63.2% of the original dataset drawn without replacement. The choice of 63.2% was made because this is how much of the data will be included in a bootstrap sample, on average. Following the work of @strobl2007bias, who show that sampling without replacement is preferable to the traditional random forest bootstrap sampling, `bootstrap=FALSE`

is the default setting.

We will build a forest of 50 trees using a probability of 0.5. As split variables, we will include indicators for the subject's race and sex as well as the subject's number of siblings and the highest grade completed (HGC) by the subject's mother and father. We will use a spline basis with 1 degree and 1 internal knot; this choice is based on the knowledge that adult BMI trajectories tend to be steeper in early adulthood before flattening out in late adulthood (@clarke2008social). We will build both a forest with an intercept for the sake of demonstrating a few functions (such as prediction responses) that are not available for forests without an intercept.

library(splinetree) split_formula <- ~HISP+WHITE+BLACK+SEX+Num_sibs+HGC_FATHER+HGC_MOTHER tformula <- BMI~AGE

set.seed(123) forest <- splineForest(split_formula, tformula, idvar="ID", data=nlsySample, degree=1, df=3, intercept=TRUE,ntree=50, prob=0.5, cp=0.005)

This newly built spline forest is a named list with 15 components.

```
names(forest)
```

A few of these 15 components have unsurprising values. For example, `forest$data`

holds a copy of the original dataset, and `forest$formula`

, `forest$idvar`

, `forest$yvar`

, and `forest$tvar`

each hold pieces of information about the call to the `splineTree()`

function.

forest$formula forest$idvar forest$yvar forest$tvar

Other components allow us reconstruct the spline basis that was used to build the tree. These are useful for converting coefficients into smoothed trajectories without using built-in functions. The attribute `forest$flat_data`

contains the flattened dataset used to build the trees, including the individually projected coefficients for each individual. We can access the individually projected coefficients in `forest$flat_data$Ydata`

. We can use the average of all the individual coefficients as well as the information that is stored about the spline basis in `forest$innerKnots`

, `forest$degree`

, `forest$intercept`

, and `forest$boundaryKnots`

to reconstruct the population average trajectory.

mean_coeffs <- apply(forest$flat_data$Ydata, 2, mean) times <- sort(unique(forest$data[[forest$tvar]])) basisMatrix <- bs(times, degree=forest$degree, Boundary.knots = forest$boundaryKnots, knots = forest$innerKnots) if (forest$intercept) { basisMatrix <- cbind(1, basisMatrix) } preds <- basisMatrix %*% mean_coeffs plot(times, preds, type='l', main="Population Average Trajectory")

Apart from all of the information that is stored about the function call and the spline basis, the `splineforest`

model contains information about the forest itself. Most important of all is the component `forest$Trees`

, which stores a list of `rpart`

objects. This ensemble of trees is the fucntional part of the forest model. We can view any individual tree using `stPrint()`

.

stPrint(forest$Trees[[17]])

Although we can print any tree in the forest using `stPrint()`

, it is important to note that these trees are not identical to trees returned by `splineTree()`

. These `rpart`

trees do not store all of the extra information that is stored in a typical `splineTree()`

model. For example, we can note the difference between a single tree and `forest$Trees[[17]]`

.

sample_tree <- splineTree(split_formula, tformula, idvar="ID", data=nlsySample, degree=1, df=2, intercept=TRUE, cp=0.005) ### Try to evaluate both trees yR2(sample_tree) test <- try(yR2(forest$Trees[[17]]), silent=TRUE) class(test) ### Try to access additional information from both trees sample_tree$parms$degree forest$Trees[[17]]$parms$degree

The remaining components of `forest`

include `forest$index`

, `forest$oob_indices`

, and `forest$splits`

. The `index`

component let's us access the data that was used to build each tree. For example, `forest$index[[17]]`

stores the indices of each row in `forest$flat_data`

that was used to build the 17th tree. On the other hand, `forest$oob_indices[[17]]`

stores a list of indices that correpond to rows in `forest$flat_data`

that were NOT used to build tree 17 (these rows were "out of the bag", or "oob"). Storing both of these pieces of information may seem redundant, but both are accessed frequently in the forest prediction and evaluation functions. If we want to uncover datapoints that are "in the bag" and "out of the bag" for tree 17, we can use:

itb17 <- forest$flat_data[forest$index[[17]],] oob17 <- forest$flat_data[forest$oob_indices[[17]],]

The `in the bag`

dataset is around twice as large as the `out of bag`

dataset since each tree has 63.2% of the data `in the bag`

.

The final component of a forest is accessed in `forest$splits`

. This very long vector stores the string name of every variable selected as a split throughout the entire forest. It might be useful to compare the frequency with which different variables are selected. It is important to note that these frequecies should $not$ be used as a variable importance metric. The barplot below shows that HGC_FATHER, HGC_MOTHER, and Num_Sibs are the most frequently selected varaibles throughout the forest. These three variables are the only numeric variables in the dataset; the rest are binary, meaning that they can never be used consecutively in the same branch of a tree. More appropriate measures of variable importance will be discussed below.

freqs <- table(forest$splits)/sum(table(forest$splits)) par(las = 2) barplot(freqs, cex.names = 0.5)

The bias in splits towards variables with more unique values is exacerbated in forests where each tree is very large; in forests such as these, numeric variables with many unique values will be repeatedly used for successive splits while binary variables are limited to one split per tree branch. It is sometimes useful to check the size of the average tree in the forest and, if necessary, prune each tree in the forest to reduce the average size.

avSize(forest) avSize(pruneForest(forest, cp=0.01))

Making a prediction using a spline forest involves averaging predictions over individual trees in the forest. When making predictions for a datapoint that was not in the training set, the only reasonable option is to use all the trees in the forest to make the prediction. Using this method, either coefficients or response values (assuming `forest$intercept==TRUE`

) can be predicted. In predicting coefficients, no value is required for the "AGE" variable, but in predicting responses ages must be specified.

newData <- data.frame("WHITE" = 0, "BLACK"=1, "HISP"=0, "Num_sibs"=3, "HGC_MOTHER"=12, "HGC_FATHER"=12, "SEX"=1) preds <- predictCoeffsForest(forest, testdata = newData)

AGE <- c(16,18,20,22,25,30,37,39,50) newData2 <- cbind(AGE, newData) predictions <- predictYForest(forest, testdata=newData2) plot(AGE, predictions, type='l')

When predicting coefficients or responses for a datapoint that was in the training set, we have the option to use one of three different methods, specified by the "methods" parameter in `predictYForest`

and `predictCoeffsForest`

. For a given datapoint, we can either average its prediction over all trees in the forest (`method = "all"`

), over only trees in the forest for which this datapoint was not in the random subsample (`method="oob"`

), or over only trees in the forest for which this datapoint was in the random subsample (`method="itb"`

). The `oob`

method is preferred, as it gives a sense of out-of-sample performance and avoids overfitting the training data. We can compare response predictions for the tree methods in terms of how closely they match the actual responses. As expected, the `itb`

predictions match the actual values much more closely.

cor(nlsySample$BMI, predictYForest(forest, method="oob")) cor(nlsySample$BMI, predictYForest(forest, method="all")) cor(nlsySample$BMI, predictYForest(forest, method="itb"))

As with a single tree, we can evaluate a forest with respect to how well it predicts actual responses or with respect to how well it predicts individually-smoothed trajectories. In many cases, it makes more sense to look at how well the forest is predicting actual responses; a forest that is excellent at predicting individually projected trajectories may be mostly useless if the individual trajectories do not approximate the actual responses (due to a poorly chosen basis). However, there are also advantages to evaluating a forest with respect to projected trajectories. The projectedion-based metrics can be used whether or not the forest includes an intercept. Furthermore, they do not give the forest credit for the portion of variation that is explained simply by the population average trend of BMI and AGE; they only give the forest credit for explaining additional variation through covariates. Furthermore, when used on a no-intercept forest or when used with the arguement `removeIntercept=TRUE`

, these metrics tell us how well we explain variation in shape of trajectory, regardless of what is happening with level. When our goal is to explain variation in shape, this metric may be more useful than the prediction metric.

Apart from the response vs. projected response choice, we can also measure forest performance using `out-of-the-bag`

, `in-the-bag`

, or all tree predictions. Out of the bag performance metrics are often preferred because they provide a more realistic assesment of how the tree will perform out of sample. Combining these different dimensions of choice, there are 9 different ways that we could evaluate the performance of one single forest with an intercept! We can compare the 9 metrics.

The response-based $R^2$ measure is the most straightforward. Unsuprisingly, performance is worst using `oob`

predictions and best using `itb`

predictions.

yR2Forest(forest, method="oob") yR2Forest(forest, method="all") yR2Forest(forest, method="itb")

The projection based $R^2$ metrics follow the same pattern of `oob`

measures performing the worst, but overall performance is much lower because the forest no longer gets credit for explaining variation that could be explained with the population average BMI vs. AGE trajectory.

projectedR2Forest(forest, method="oob", removeIntercept = FALSE) projectedR2Forest(forest, method="all", removeIntercept = FALSE) projectedR2Forest(forest, method="itb", removeIntercept = FALSE)

Finally, when we exclude the intercept, the same patterns hold but we see that even less total variation is explained by our model with we do not give the model credit for explaining variation in starting level. These metrics suggest that, moving out of sample, we will only be able to explain aroudn 3\% of variation in shape of BMI trajectory.

projectedR2Forest(forest, method="oob", removeIntercept = TRUE) projectedR2Forest(forest, method="all", removeIntercept = TRUE) projectedR2Forest(forest, method="itb", removeIntercept = TRUE)

Our primary motivation in building forests of spline trees is to obtain stable measures of variable importance using a permutation importance metric related to that described in @breiman2001random and @liaw2002classification.

For every tree in the forest, tree performance is measured on out-of-the-bag datapoints. Then, the values of variable $V$ are randomly permuted, and tree performance is re-measured. The average difference in performance over all trees in forest becomes the variable importance score for variable $V$. The `splinetree`

package provides scores using the absolute difference in performance, the percent difference in importance, and standardized difference in importance (differences divided by their standard deviation). In most cases, these three metrics will rank the variables in the same way, and so the choice is a matter of preference.

The "tree performance" metric involved in the permuation importance could be response-based or projection-based, and the latter metric (in the case of forests that include an intercept) can be modified to either include or exclude the intercept. The function `varImpY()`

implements response-based projection where the tree performance metric used is the Mean Squared Prediction Error (MSE), as in @liaw2002classification. Alternatively, the `varImpCoeff()`

funcion uses projection sum of squares, and has an additional arguement `removeIntercept`

to specify whether or not the intercept should be included in the projection sum or squares. As with performance metrics, the variable importance metrics can be calculated according to `oob`

, `all`

, or `itb`

frameworks. The `oob`

method is recommended, because in measuring variable importance it is important to consider which variables are most likely to be associated with the outcome outside of our training sample.

We can create three different variable importance matrices using our sample forest. We can then compare the response-based importance to the projection-based importances (both including and ignoring the intercept). Each of these importance matrix contains three columns, corresponding to absolute differences in performance, percent differences in performance, and standardized differences in importance. We will use the third column.

Y_imps <- varImpY(forest, method="oob") coeff_imps <- varImpCoeff(forest, method="oob", removeIntercept=FALSE) shape_imps <- varImpCoeff(forest, method="oob", removeIntercept=TRUE)

Y_imps = importance[[1]] shape_imps = importance[[3]] coeff_imps = importance[[2]]

par(mfrow=c(1,3)) plotImp(Y_imps[,3], main="Response") plotImp(coeff_imps[,3], main = "Coeff w/ Intercept") plotImp(shape_imps[,3], main = "Coeff w/out Intercept")

The first two panels of the plot look relatively similar. While based on different performance metrics, both take into account variability explained by the forest with respect to level of the outcome and shape of the trajectory. The high level of agreement between the first two panels suggests that the projected trajectories are reasonable approximations of the true response; variables that are important in explaining the projected trajectories are also important in explaining the actual responses. The third panel shows more stark differences, as this metric defines performance only in terms of explaining the shape of the trajectory. Sex is far less important when level is ignored, suggesting that sex impacts the level of an individual's BMI but has almost no bearing on trajectory of BMI over time.

Spline forests are useful tools for understanding which variables may be associated with heterogeneity in longitudninal trajectories in a population.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.