htb: Tree boosting for longitudinal data
In htree: Historical Tree Ensembles for Longitudinal Data

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/htb.R

Fits a boosted ensemble of historical regression trees to longitudinal data.

htb(x,
    time=NULL,
    id=NULL,
    yindx,
    ntrees = 100,
    method="freq",
    nsplit=1,
    lambda=.05,
    family="gaussian",
    cv.fold=0,
    cv.rep=NULL,
    nsamp=5,
    historical=TRUE,
    vh=NULL,
    vc=NULL,
    delta=NULL,
    control=list())

`x`	A data frame containing response and predictors
`time`	A vector of length `nrow(x)` of observation times
`id`	A vector of subject identifiers (length equal to `nrow(x)`), if `NULL` observations are assumed independent
`yindx`	Name of response variable, alt. its column number in `x`
`ntrees`	Number of trees in ensemble
`method`	Historical summary method, can be `freq`, `frac`, `mean0`, `freqw`, `fracw` and `mean0w`
`nsplit`	Number of splits in each regression tree.
`lambda`	Shrinkage parameter applied to each tree.
`family`	Either "gaussian" (default) or "bernoulli".
`cv.fold`	Number of cross-validation folds, if `cv.fold<=1` no cross-validation is run.
`cv.rep`	Number of times to repeat the cross-validation. If not given set to `cv.fold`.
`historical`	If `TRUE` then historical splitting is done, else standard splitting.
`nsamp`	Number of sampled `delta` values, see below
`vh`	Optional vector of indexes giving the historical predictors.
`vc`	Optional vector of indexes giving strictly concurrent effects.
`delta`	A vector of history lags to be used (see below), defaults to `NULL` in which case all possible observed lags are available for splitting.
`control`	A list of control parameters (see below). All arguments, except those describing the data, can be set in `control`. Arguments in `control` are used if both are given.

The htb function fits a boosted tree ensemble to longitudinal data. Data are assumed to be of form: z_{ij}=(y_{ij},t_{ij},x_{ij}) for i=1,..,n and j=1,..,n_i, with y_{ij} being the response for the i-th subject at the j-th observation time t_{ij}. The predictors at time t_{ij} are x_{ij}. The number of observations can vary across subjects, and the sampling in time can be irregular.

htb estimates a model for the response y_{ij} using both (t_{ij},x_{ij}) (the observations concurrent with y_{ij}) and all preceeding observations of the i-th subject up to (but not including) time t_{ij}. The model is fit using historical regression (alt. classification) trees. Here a predictor is one of two types, either concurrent or historical. The concurrent predictors for y_{ij} are the elements of the vector ((t_{ij},x_{ij})), while a historic predictor is the set of all preceeding values (preceeding time t_{ij}) of a given element of (y_{ij},t_{ij},x_{ij}), for subject i. In a historic regression tree, node splitting on a concurrent predictor follows the approach in standard regression (classification) trees. For historical predictors the splitting is modified since, associated with each observed response y_{ij}, the set of observations of a historical predictor will vary according to i and j. For these, the splitting is done by first transforming the preceeding values of a predictor using a summary function. This summary is invertible, in the sense that knowledge of it is equivalent to knowledge of the covariate history. Letting \bar{z}_{ijk} denote the set of historical values of the k-th element of z_{ij}, the summary function is denoted s(η;\bar{z}_{ijk}) where η is the argument vector of the summary function. Node splitting based on a historical predictor is done by solving

\mbox{argmin}∑_{(ij)\in Node} (y_{ij}-μ_L I(s(η;\bar{z}_{ijk})<c)-μ_R I(s(η;\bar{z}_{ijk})≥q c))^2,

where the minimization is over the vector (k,μ,c,η). Each node of historical regression tree is split using the best split among all splits of concurrent and historical predictors.

Different summary functions are available in htb, specified by the argument method. Setting method="freq" corresponds the summary

s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}} I(z_{ihk}<η_2);

method="frac":

s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}} I(z_{ihk}<η_2)/n_{ij}(η);

method="mean0":

s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}} z_{ihk}/n_{ij}(η);

method="freqw":

s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}-η_2} I(z_{ihk}<η_3);

method="fracw":

s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}-η_2} I(z_{ihk}<η_3)/n_{ij}(η);

method="meanw0":

s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}-η_2} z_{ihk}/n_{ij}(η).

Here n_{ij}(η) denotes the number of observations of subject i in the time window [t_{ij}-η_1,t_{ij}-η_2). In the case n_{ij}(η)=0, the summary function is set to zero, i.e s(η;\bar{z}_{ijk})=0. The default is method="freqw". The possible values of η_1 in the summary function can be set by the argument delta. If not supplied, the set of possible values of η_1 is determined by the difference in time between successive observations in the data. When a split is attempted on a historical predictor, a sample of this set is taken from which the best split is selected. The size of this set equals that of the nsamp argument. See below on control for futher arguments governing the historical splitting.

When cv.fold>1 then cross-validation is performed. In subsequent summaries of the model, say the partial dependence plots from partdep_htb, these cross-validation model fits are used to estimate the standard error. This is done using the delete-d jackknife estimator (where deletion refers to subjects, instead of rows of the training data). Each cross-validation model fit is performed by removing a random sample of 1/cv.fold of the subjects. The number of cross-validation runs is determined by cv.rep which defaults to the value of cv.fold.

All arguments (except those decribing the data x, yindx,time and id) can be set in the control list. The arguments supplied in control are used if both are supplied, so if ntrees=300 and control=list(ntrees=500) then 500 trees are fit. Besides the arguments described above, a number of other parameters can be set in control. These are: nodesize giving the minimum number of training observations in a terminal node; sample_fraction giving the fraction of data sample to train each tree; dtry the number of sampled delta values used when splitting a variable based on a covariate history (note this is alternatively controlled by the argument nsamp above ; ndelta the number of delta values to use if delta is not supplied, these are taken as the quantiles from the distribution of observed delta values in the data; qtry the number of sampled values of η_2 for method=freq/frac, η_3 for method=freqw/fracw; quantiles is a vector of probabilities, and is used when method=frac or method=fracw, ie when covariate histories are split on their windowed empirical distributions. Splitting is restricted to these quantiles. The fold-size for cross-validation can be set by control$cvfold, and the number of repetitions by control$nfold.

Returns a list whose more important elements are: trees giving the tree ensemble, cv (if cv.fold>1) the cross-validation model estimates, cv_error cross-validated error (mse when family="gaussian" and negative log-likelihood when family="bernoulli", control a list controlling the fit, x,id,time give the training data.

Joe Sexton joesexton0@gmail.com

J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.

J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.

predict_htb, partdep_htb, varimp_htb.

## Not run: 


# ----------------------------------------------------------------------------------------- ##
# Mother's stress on child illness:
# 	Investigate whether mother's stress is (Granger) causal of child illness 
#	'htb' model is fit using previous observations of mother's stress to predict 
#	child's illness at given time point, but not mother's stress at that time point
#
#	Predictor variables are classified into "historical" and "concurrent"
#
#	A predictor is "historical" if its prior realizations can be used to predict 
#	the outcome.   
#
#	A predictor is "concurrent" if its realization at the same timepoint as the outcome
#	can be used to predict the outcome at that timepoint
#
#	A predictor can be both "concurrent" and "historical", the status of the predictors 
#	can be set by the 'vh' and 'vc' arguments of 'hrf'. 
#	(if not set these are automatically determined) 
#	 
# ------------------------------------------------------------------------------------------- ##


data(mscm) 
mscm=as.data.frame(na.omit(mscm))


# -- set concurrent and historical predictors 
historical_predictors=match(c("illness","stress"),names(mscm))
concurrent_predictors=which(names(mscm)!="stress")
control=list(vh=historical_predictors,vc=concurrent_predictors,
	ntrees=200,family="bernoulli",cvfold=10,lambda=.1)

# logistic regression
ff=htb(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)

# cross-validated negative log-likelihood
plot(1:ff$ntrees,ff$cv_error,type="l",#col="blue",
	xlab="iterations",ylab="cross-validation error")

# -- variable importance table 
vi=varimp_htb(ff)
vi

# -- plot partial dependence (with +/-2 approximate standard errors)
par(mfrow=c(2,2))
for(k in 1:4)
	partdep_htb(ff,xindx=as.character(vi$Predictor[k]))

par(mfrow=c(1,1))


# -- Standard errors are based on cross-validation runs (using delete-d 
#	(subjects) jackknife estimator, d="number-of-subjects"/cvfold)
# -- increasing nfold (which defaults to equal cvfold) gives less 
#	noisy standard error estimates ...    
control$nfold=50  
ff=htb(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)

par(mfrow=c(2,2))
for(k in 1:4)
	partdep_htb(ff,xindx=as.character(vi$Predictor[k]))

par(mfrow=c(1,1))


# -------------------------------- #
# Data w/irregular observation times 
# ------------------------------- #
data(cd4)

control=list(cvfold=10,lambda=.1,nsplit=3,ntrees=200)
ff=htb(x=cd4,id=cd4$id,time=cd4$time,yindx="count",control=control)

vi=varimp_htb(ff)

# -- partial dependence for top 4 predictors (with +/-2 SE estimates) 
par(mfrow=c(2,2))
for(k in 1:4)
	pd=partdep_htb(ff,xindx=as.character(vi$Predictor[k]))
par(mfrow=c(1,1))


# -- cv error
plot(1:ff$control$ntrees,ff$cv_error,xlab="boosting iteration",
	ylab="cross-validated mse",type="l")
# estimated boosting iteration
abline(v=ff$nboost_cv,lty=2)

## by default, the number of delta values (parameter 'eta_1' above) is 20
## can set this using 'ndelta'  
control$ndelta=50

ff=htb(x=cd4,id=cd4$id,time=cd4$time,yindx="count",control=control)
points(1:ff$control$ntrees,ff$cv_error,type="l",lty=2)  
## note: differences in cv_error can be due (in part) to randomness
##	 in out-of-sample selection

# the grid of delta values 
ff$control$delta


# ------------------------------------------ #
# Boston Housing data (not longitudinal)
# ------------------------------------------ #
# library(htree)
library(mlbench)
data(BostonHousing)
dat=as.data.frame(na.omit(BostonHousing))


# omitting arguments 'time' and 'id' assumes rows are iid 
control=list(cvfold=10,ntrees=500)
h=htb(x=dat,yindx="medv",control=control)


# -- plot cross-validated Mean-squared error --- #
plot(1:h$ntrees,h$cv_error,type="l",xlab="Boosting iterations",
	ylab="MSE",main="Cross-validated Mean-squared error")

# -- variable importance table
vi=varimp_htb(h,nperm=20)
vi 

# -- partial dependence of top 4 predictors with +/-2 S.E.
par(mfrow=c(2,2))
for(k in 1:4)
	partdep_htb(h,xindx=as.character(vi$Predictor[k]))


par(mfrow=c(1,1))


## End(Not run)