Introduction to splinetree

knitr::opts_chunk$set(
  collapse = TRUE,
  fig.width = 7
)

This vignette describes the motivation for the splinetree package and introduces the algorithms used by the package. For a more practical guide to working with data using the functionalities of the splinetree package, see the vignettes Tree Building with splinetree and Forest Building with splinetree.

Motivation

Longitudinal studies, where an outcome of interest is repeatedly measured in the same subjects over time, play a key role in research across a variety of disciplines, including medicine, epidemiology, and the social sciences. In these studies, the trajectory of a longitudinal outcome over time can look very different in different members of the population. It is often useful to find factors that help explain this variation in growth patterns. We propose using longitudinal regression trees, specifically splinetree, to find population subgroups where group members follow similar longitudinal trajectory patterns and share common baseline covariate values. We also propose using spline forests to understand which baseline covariates can explain the most variation in trajectories.

Regression trees, which recursively partition a population into homogenous subgroups (nodes), are a promising approach to this problem, but many regression tree algorithms cannot be used for studying longitudinal trajectories. Longitudinal regression trees were first proposed by @segal1992tree, who suggested minimizing within-node Mahalanobis distance of longitudinal response vectors. This method requires that all individuals in the sample have the same number of equally spaced measurements of the outcome, which is rarely the case in studies involving human subjects. @yu1999fitting eliminated this need for balanced data by first reducing the dimension of the outcome vector using smoothing splines and then maximizing within-node homogeneity of the smoothed trajectories. The work of Yu and Lambert is the basis for the algorithm employed by splinetree.

Algorithm

The method for building a splinetree model, which is invoked with every call to the function splineTree(), consists of two steps. First, the longitudinal response vector for each individual in the dataset is projected onto a spline basis, yielding a set of coefficients for each individual. Second, the coefficients from the projection are used in constructing a regression tree that maximizes the within-node homogeneity of the projected trajectories.

Spline Projection

Yu and Lambert suggested treating each response vector as a functional curve $$Y_i(t) = f_i(t) + \epsilon_i(t)$$ where $f_i(t) = \sum_{k = 1}^q\beta_{ik} X_k(t)$ for a set of basis functions, $X_k(t)$ and coefficient vector $\boldsymbol{\beta}i = (\beta{i1},...,\beta_{iq})^T$, and where $\epsilon_i(t)$ is a white noise process with mean zero and constant variance. The first step of splineTree() transforms the user-provided longitudinal dataset (in long format) into a wide format dataset where each individual is defined by their coefficient vector $\boldsymbol{\hat{\beta}}i = (\hat{\beta}{i1},...,\hat{\beta}{iq})^T$. In the final regression tree, terminal nodes are labeled with average coefficient vectors $\boldsymbol{\bar{\beta}} = (\bar{\beta}{1},...,\bar{\beta}_{q})^T$ over the individuals in the node. These average coefficients, along with the tree-wide basis functions $X_k(t)$, describe the average trajectory for individuals whose covariate values place them in this node.

The splinetree package uses B-Splines (built using splines::bs()), rather than smoothing splines as originally suggested in Yu and Lambert, for the basis functions $X_k(t)$ and obtains the coefficients by regressing each individual's longitudinal response on these common basis functions. Whereas the original dataset may have a irregular number of observations per individual, the transformed dataset is balanced; each individual has the same number of estimated coefficients, all of which correspond to a common set of basis functions.

A consequence of flattening the data in this way is that baseline covariates used for splitting the tree cannot be time-varying. When splineTree() is called it expects a dataset in long format where the baseline covariates it can split on take on the same value in every row corresponding to a certain individual. In order to include information about time-varying covariates in the tree, summary measured can be defined and included as split variables. For example, instead of including time-varying yearly income as a split variable, we could define a new time-constant variable for each individual such as starting income, average income, or average yearly increase in income. Alternatively, if income is a non-linear, one could smooth those variables with a spline basis and use the coefficients as split variables (at the expense of interpretability).

One unique feature of the splineTree() function is that the user can specify whether or not the set of basis functions $X_k(t)$ includes the intercept function. This allows the user to explore relationships with shape, separate from level. When the intercept is included, the splits of the tree take into account both the shape and the level of the longitudinal response trajectory, and the terminal nodes of the tree describe full trajectories that can be used for prediction. When the intercept is excluded, the splits of the tree take into account only the shape of the trajectories. In this case, the predicted coefficients for terminal nodes describe trajectories that are equal to $0$ when $t=0$, which will typically not provide reasonable prediction values. To obtain rough predictions from intercept-less trajectories, the mean starting response for all individuals in the node can be added onto the trajectory after the fact as an estimated intercept.

The set of basis functions $X_1(t), ..., X_q(t)$ is determined by parameters to the splineTree() function which are passed forward to the bs() function. The degree parameter, with default equal to 3 for a cubic spline, specifies the degree of the polynomials for the spline basis. The boundary knots for the spline basis are set to be the minimum and maximum values that the time variable takes on in the dataset, and no internal knots are included in the spline basis as a default. In order to accomodate more complex trajectory patterns, a vector of internal knot locations can be provided through the knots parameter. Alternatively, internal knots can be added using the df parameter. If provided, the df parameter sets q (the number of basis functions), and so the number of internal knots is set to be df-degree-intercept. In this case, the appropriate number of knots are placed automatically at percentiles of the time variable to break up the data into equally size groups. The parameters knots and df should not be simultaneously provided; one or the other should be used to add internal knots.

The tree-building process seeks to maximize homogeneity of smoothed trajectories within nodes; if the projected trajectories poorly approximate the real data, the tree will not be useful. Therefore, the choice of parameters such as degree and df should be guided by knowledge of the trajectories being modeled. If the trajectories are suspected to be non-monotonic, a linear basis with no knots is a poor choice. Although more complex bases allow the individual projections to more closely model the individual response vectors, overly complex bases should be avoided. If the number of coefficients ($q$) exceeds the number of observations for a certain individual, the individual will be assigned NA coefficient values and this individual will be ignored during the tree-building process. To avoid throwing out data, $q$ should be less than the minimum number of observations per individual. Even an individual with many observations may be assigned NA coefficients if they have no observations to one side of an internal knot. Individuals with only one observation to one side of an internal knot can be assigned extremely large coefficients that affect the tree-building process. For these reasons, before choosing a basis it is important to ensure that individuals in the dataset have enough observations to justify the spine basis, and that these observations are spread out enough to support the placement of internal knots.

Split Criteria

Once the data has been transformed into a matrix of spline coefficients, a regression tree is built. The tree building process employed by splinetree uses the same exhaustive search algortihm as CART, from @brieman1984classification, but with a modified measure of node purity. CART is implemented in the popular rpart package (@therneau2018rpart). The splinetree package is built on top of rpart using the custom split function framework explained in @therneau2018user, and so a splinetree model is actually an object of class rpart.

As in CART, at each node the splinetree algorithm computes a split score for all possible ways to separate the data based on the split covariates. The score is based entirely off of the smoothed response curves described by the spline coefficients; the actual response data is never seen by the split function. The split score is based on the reduction in the sum of squared errors of the smoothed trajectories in the node around the mean smoothed trajectory in the node. The errors are evaluated at a set of fixed grid points. If $N$ denotes the data found in the current node, $L$ denots the data found to the left of the split, and $R$ denotes the data to the right of the split, the split score is $SS_N - SS_L - SS_R$. The value $SS_S$ for a set $S$ is defined as

$$ SS_S = \sum_{i \in S}(\mathbf{X}\hat{\boldsymbol \beta}{i} - \mathbf{X}\bar{\boldsymbol{\beta}}{S} )^T(\mathbf{X}\boldsymbol{\beta}{i} - \mathbf{X}\bar{\boldsymbol{\beta}}_S )= \sum{i \in S}(\boldsymbol \beta_{i} - \bar{\boldsymbol{\beta}}{S} )^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta}{i} \ - \bar{\boldsymbol{\beta}}_{S})$$

where the matrix $\mathbf{X}$ contains the values of the basis functions $X_1(t), ..., X_k(t)$ evaluated at the chosen fixed time points on a grid $(t_1,...,t_m)$, the $\hat{\boldsymbol{\beta}}_i$ are the spline coefficients for individual $i$, and $\bar{\boldsymbol{\beta}}_S$ is the mean coefficient vector for all individuals in set $S$. The grid points can be specified using the parameter nGrid, which specifies the number of grid points to be placed automatically at percentiles of the time variable, or the parameter gridPoints, which provides a vector of specific time points.

At the end of the exhaustive search, the split with the highest split score (meaning that it provides the largest decrease in the projection sum of squared errors) is selected as the split for the current node. The process is then repeated recursively. For computational efficiency, not every possible partition of categories is considered for a categorical variable. The cateogries are first placed into an approximate order based on the magnitude of the average coefficients of data belonging to each category, and then only binary splits along the ordering are considered.

Apart from the custom split score, other behavior of the tree building process mimics that of rpart. For example, missing covariate values are handled with surrogate splits (see the rpart documentaiton for more details). The size of a splinetree can be controlled with the complexity parameter, cp, which is passed forward to rpart. A large splinetree can also be pruned with the rpart::prune() method. Since the rpart tree is built using a flattened version of the original data, the rows of an rpart-created attribute such as tree$where will correspond to the rows of the flattened data, not the original data. For users who are familiar with the rpart package and wish to use features such as tree$where, the flattened dataset that indexes the tree$where vector can be found in tree$parms$flat_data.

Although a splinetree model is an rpart object, not all rpart functions will have the expected behavior when applied to a splinetree model. Numerous splinetree functions have been provided to make the models easier to work with. For example, print.rpart() will not print out the full vector of predicted coefficients for each node of a splinetree model, and so the splinetree::stPrint() method is preferable. Similarly, predict.rpart() will fail to return the full vector of coefficients associated with a node, and so splinetree::predictCoeffs() is preferable.

Example

In this section, we will show how to use splineTree() to build a tree. More details on customizing, visualizing, and evaluating splinetree models can be found in the vignette Tree Building with splinetree.

This example uses data taken from the National Longitudinal Survey of Youth, 1979 (NLSY). The longitudinal trajectory of interest is body mass index (BMI) over time. We randomly sample 1,000 individuals from the NLSY out of those who have non-missing BMI data at at least 10 timepoints spread out over at least 20 years. We are interested in the relationship between BMI trajectories and time-constant variables such as HISP, WHITE, BLACK (indicator variables for subject's reported race), SEX (indicator for subject's reported sex), Num_sibs (number of siblings), and HGC_FATHER and HGC_MOTHER (reported highest grade completed by subject's father and subject's mother).

Previous research by @clarke2008social suggests that adult BMIs tend to increase steadily throughout early adulthood and flatten out in later adulthood. This type of trajectory can be modeled with a piecewise linear trajectory, so we will define our spline basis with degree = 1 and one internal knot. Since we do not have a particular location in mind for the internal knot, we use the df parameter rather than the knots parameter. This will place the knot at the median age. If we include an intercept, we will need to let df = 3 to add one internal knot, but if we do not include an intercept df = 2 will suffice. In this example, we build one tree with an intercept and one tree without an intercept so as to compare the two. The default value in the rpart package for the cp parameter, which controls the size of the tree, is 0.01. Here, we set the value of cp slightly lower to create trees that are large enough to be interesting but small enough to view in a plot.

library(splinetree)
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tformula <- BMI ~ AGE
sample_tree <- splineTree(split_formula, tformula, idvar = "ID", 
                          data = nlsySample, degree = 1, df = 2, 
                          intercept = FALSE, cp = 0.005)
sample_tree_intercept <- splineTree(split_formula, tformula, idvar = "ID", 
                                    data = nlsySample, degree = 1, df = 3, 
                                    intercept = TRUE, cp = 0.005)

After building these two trees, we can view a printed summary or a plot of each. Beginning with the no-intercept tree, we see that non-white individuals are predicted to have more rapid growth in body mass index than white individuals. Among white individuals, those whose fathers did not complete more than 8.5 years of schooling show more rapid BMI growth. Since this tree was built without an intercept, the average starting BMI is added to each average trajectory in the plot.

stPrint(sample_tree)
stPlot(sample_tree)

In the tree that is built with an intercept, each node is associated with three coefficients instead of two. While we see two of the same variables that were in the no-intercept tree included here, we now also see sex playing a role. This suggests that sex may impact the level of BMI but not the shape of the trajectory.

stPrint(sample_tree_intercept)
stPlot(sample_tree_intercept, colors=c("red", "orange", "yellow", "blue", "cyan"))

Forests

The examples above suggest that the variables WHITE and HGC_FATHER are most associated with the shape of a BMI trajectory, and that these same variables, with the addition of SEX, are associated with the BMI level. However, a single regression tree gives a poor overview of the true importance of covariates. If there is a close tie between the "goodness" of two possible splits at the top level of the tree, then a new tree built to a slightly perturbed dataset could include an entirely different set of covariates. If two covariates that are associated with the trajectories are highly correlated, it is likely that only one will appear in the tree.

The rpart package has a built in variable importance metric based on improvements of goodness of fit provided by both split varaibles and surrogate split variables (see the rpart documentation for more information). While this measure helps capture the importance of second-best or correlated covariates that do not appear in the tree itself, this measure can show bias towards variables with more unique values; a very important binary variable that is used at the top level of a tree is not available for surrogate splits later in the tree, and so its importance according to this metric may be smaller than a less important numeric variable that is repeatedly used as a surrogate split throughout the tree. A consequence of this bias is that the rpart variable importance rankings are highly dependent on the size that a tree is pruned to; in larger trees, variables with many unique values can appear over and over again as primary or surrogate splits, which inflates their importance. This bias is illustrated in the plot below, which displays the rpart variable importance metric for the sample_intercept_tree from above along with an identical tree that is allowed to grow much larger. While SEX is rated to be the most important variable in the small tree, when the tree is allowed to grow larger the importance of all three non-binary variables surpasses the importance of SEX. This is an example of a general trend; as trees grow larger, the rpart importance of variables with many unique values is inflated relative to variables with fewer unique values.

extra_large_tree <- splineTree(split_formula, BMI ~ AGE, "ID", nlsySample, degree = 1, df = 2, intercept = TRUE, cp = 0.0005)
vars = attr(terms(split_formula), "term.labels")
imp_large=extra_large_tree$variable.importance
imp_small=sample_tree_intercept$variable.importance
vars = attr(terms(split_formula), "term.labels")
imp=extra_large_tree$variable.importance
imp_pruned=sample_tree_intercept$variable.importance
all_imps = rep(0,2)
for (var in vars) {
  thisRow = rep(0,2)
  if (length(imp[names(imp)==var])>0) thisRow[1] = imp[names(imp)==var]
  if (length(imp[names(imp_pruned)==var])>0) thisRow[2] = imp_pruned[names(imp_pruned)==var]
  all_imps = rbind(all_imps, thisRow)
}
all_imps = all_imps[-1,]
row.names(all_imps)=vars
all_imps=data.frame(all_imps)
names(all_imps) = c("Large Tree", "Small Tree")
par(mfrow=c(1,2))
par(las=2) # make label text perpendicular to axis
par(mar=c(1,7,3,1)) # increase y-axis margin.
barplot(all_imps[,2]/sum(all_imps[,2]), horiz=TRUE, names.arg=row.names(all_imps), cex.names=0.7, main="Small Tree", axes=FALSE)
barplot(all_imps[,1]/sum(all_imps[,1]), horiz=TRUE, names.arg=row.names(all_imps), cex.names=0.7, main="Large Tree", axes=FALSE)

In the univariate setting, Random Forests and the associated permutation importance measure are popular ways to obtain more stable measures of variable importance (see @breiman2001random or @liaw2002classification for more). Through random data subsampling and random variable selection, Random Forests give variables that are not locally optimal in a single tree a chance to appear in a tree. The splineForest() method allows users to build ensembles of spline trees, combining the projection splitting framework of splineTree() with the random subsampling and random variable selection principles of a Random Forest.

Spline Forest Implementation

Once a user is familiar with the splineTree() function, using the splineForest() function is straightforward; most of the parameters are identical. Although a spline forest is an ensemble of spline trees, a call to splineForest() does not involve repeated calls to splineTree(). The projection process happens only once in a forest, and the random subsampling occurs on the level of the flattened data, not the individual responses. This ensures that full trajectories are kept together in the forest building process. Following the work of @strobl2007bias, we perform random subsampling without replacement, and we subsample 63.5% of the data for each tree (this matches the proportion of data that we would expect to be used for each tree if we used the more traditional practice of bootstrap sampling). @strobl2007bias show that sampling without replacement reduces bias in the permutation importance metric when compared to bootstrap sampling. By setting the parameter bootstrap=TRUE, users can choose to use bootstrap sampling instead.

The ntree parameter specifies the number of trees in the ensemble; ntree random samples are drawn from the flattened data and a tree is built to each of these samples. Each tree uses the same split criteria as the splineTree() function but incorporates a random component into the split selection. The prob parameter determines the probability that any one variable is put into contention to be the split variable at a node. An appropriate value for the prob parameter depends on the number of total split variables. If there are only three total split variables and prob = 1/2, then there is a $1/8$ chance that no variables will be considered for a split and the tree building will terminate prematurely. However, with a large number of split variables, a probability of around $1/3$ is appropriate. This method for incorporating random variable selection is different than the method implemented in the randomForest package (@liaw2002classification), where the number of variables to be considered at each node is specified, not left up to chance. The choice to use a probability arose from the desire to stay within the rpart custom-split framework; the probability version could be easily implemented without modifying the base rpart code.

Example

We build an example spline forest using the same dataset introduced above, using a linear spline basis with one internal knot and an intercept.

forest <- splineForest(split_formula, tformula, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE,ntree = 50, prob = 0.5, cp = 0.005)

This model returned by this function is a named list with 15 components. The individual trees in the forest are saved in forest$Trees. Looking at forest$Trees[[1]], we see that, due to the bootsrapping and the randomization, new variables appear that did not appear in a single tree.

stPrint(forest$Trees[[1]])

While each tree within forest$Trees is an rpart object that can be printed with stPrint, these trees are not the same as splinetree model. A single splinetree tree stores information about the spline basis used and the flattened dataset within it. For the sake of eliminating redundancy in already large splineforest models, this information is not stored in every tree in a forest; it is only saved once at the forest level (see, for example, forest$innerKnots and forest$flat_data). Many splinetree functions, including stPlot(), expect this additional information to be stored at the tree level, and therefore stPlot(forest$Trees[[1]]) will result in an error.

To determine which data points were used in the making of the first tree in the forest, we can use forest$index[[1]]. The indices found in this vector refer to rows of forest$flat_data. It is important to know which data points were used in the making of each tree because this can determine the predictions from the forest. In obtaining predicted responses for a datapoint that was in the training sample, we can average the predictions from each individual tree in the dataset, or we can only average predictions across trees in the forest for which this datapoint was "out-of-the-bag," meaning that it was not in the boostrap sample. "Out-of-the-bag"" (or "oob") predictions are suggested by @breiman2001random because they provide a sense of out-of-sample performance of the forest without using a distinct test set. In the splinetree package, we can predict BMIs for each individual in the NLSY sample either using "all" trees for each datapoint, only trees for which this datapoint was "out of the bag," or only trees for which this datapoint was "in the bag" (or "itb").

fullpreds <- predictYForest(forest, method = "all")
oobpreds <- predictYForest(forest, method = "oob")
itbpreds <- predictYForest(forest, method = "itb")

If we look at these predictions, we see that the "oob" prediction errors are greater, but this lower performance is more representative of how our forest might perform on a new test set.

cor(fullpreds, nlsySample$BMI)
cor(itbpreds, nlsySample$BMI)
cor(oobpreds, nlsySample$BMI)

Variable Importance

Our key motivation in building a spline forest was to figure out which variables are most importantly associated with heterogeneity in BMI trajectory. The splinetree package implements a variable importance metric closely related to that of @breiman2001random and @liaw2002classification.

For every tree in the forest, tree performance is measured on out-of-the-bag data points. Then, the values of variable $V$ are randomly permuted, and tree performance is re-measured. The average difference in performance over all trees in forest becomes the variable importance score for variable $V$. The splinetree package provides scores using the absolute difference in performance, the percent difference in importance, and the standardized difference in importance (differences divided by their standard deviation). In most cases, these three metrics will rank the variables in the same way, and so the choice is a matter of preference. A variable importance score reported as negative should be interpreted as a variable importance score of $0$.

Two different definitions of tree "performance" can be used. If the forest's spline basis includes the intercept, variable importace can be measured with respect to predicting the response outcome. In this case, the tree performance metric used is the Mean Squared Prediction Error (MSE), as in @liaw2002classification. This version is implemented in the function varImpY().

Alternatively, we can measure tree performance using the projection sum of squared errors. The advantage of this metric is that it can be used whether or not the forest includes an intercept. When used on a non-intercept forest, it measures the importance of each variable in determining the shape of a trajectory. When used on a forest with an intercept, there is an option to ignore the intercept in the variable importance calculations. Therefore, even though the sample forest we have been working with includes an intercept, we can still calculate a measure of shape-based variable importance.

We can create three different variable importance matrices using our sample forest. We can then compare the response-based importance to the projection-based importances (both including and ignoring the intercept). Each of these importance matrix contains three columns, corresponding to absolute differences in performance, percent differences in performance, and standardized differences in importance. We will use the third column.

Y_imps <- varImpY(forest) #on the slow side
coeff_imps <- varImpCoeff(forest, removeIntercept = FALSE)
shape_imps <- varImpCoeff(forest, removeIntercept = TRUE)
coeff_imps <- importance[[2]] 
shape_imps <- importance[[3]]
Y_imps <- importance[[1]]
par(mfrow = c(1,3))
plotImp(Y_imps[,3], main = "Response")
plotImp(coeff_imps[,3], main ="Coeff w/ Intercept")
plotImp(shape_imps[,3], main = "Coeff w/out Intercept")

Using a splineForest, some new variables have relatively high importance. For example, while HGC_MOTHER did not appear in a splineTree, it has a similar overall imporance to HGC_FATHER. These two variables may be unlikely to appear in the same tree due to their correlation (0.667 among complete cases), but the permutation importance metric shows that both variables are associated with the level and the shape of the outcome. The first two panels of the graph are quite similar because they both take into account level and shape; the difference is that the first panel looks at how influential the variables are for approximating BMI values, whereas the second panel looks at how influential the variables are for approximating the smoothed BMI trajectories. The fact that the panels look similar is good because it suggests that the smoothed trajectories used are reasonable approximations of true BMI.

The third panel of the graph shows more differences. When the intercept of the smoothed trajectories is ignored, the SEX variable loses all importance. This mirrors the observation from our splineTree, where SEX appeared in the intercept tree but not the no intercept tree. Now that we are using a forest, we can more confidently assert that reported sex is associated with the level of BMI but not the shape of the trajectory over time.

Conclusion

The splinetree package allows users to build regression trees and random forests for longitudinal trajectories. The spline basis method is flexible enough to accomodate complex trajectory shapes, and the models can be used to study just the shape of the trajectories, or to study shape and level simultaneously. For more details on the functionalities of the package, see the vignettes Tree Building with splinetree and Forest Building with splinetree.

References



Try the splinetree package in your browser

Any scripts or data that you put into this service are public.

splinetree documentation built on July 18, 2019, 9:08 a.m.