Description Usage Arguments Details Value Author(s) References See Also Examples
Fits a random forest of historical regression trees to longitudinal data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
x |
A data frame containing response and predictors |
time |
A vector of observation times associated with rows of |
id |
Subject identifier, if |
yindx |
Name of response variable, alt. its column number in |
ntrees |
Number of trees in ensemble |
method |
Historical summary method, can be |
mtry |
Number of predictors sampled at each split |
se |
If |
B |
Only used if |
R |
Only used if |
nsamp |
Number of sampled |
historical |
If |
vh |
Optional vector of variable names to be used as historical predictors, can be a numeric vector giving column numbers of |
vc |
Optional vector of variable names to be used as concurrent predictors, can be a numeric vector giving column numbers of |
delta |
An optional vector of time-lags to be used (see below). |
classify |
If |
control |
A list of control parameters (see below). All arguments, except those describing the data, can be set in |
The hrf
function fits a random forest model to longitudinal data. Data is assumed to be of form:
z_{ij}=(y_{ij},t_{ij},x_{ij}) for i=1,..,n and j=1,..,n_i, with y_{ij} being the response for the i-th subject
at the j-th observation time t_{ij}. The vector of predictors at time t_{ij} are x_{ij}. The number of observations can vary across
subjects, and the sampling in time can be irregular.
hrf
estimates a model for the response y_{ij} using both (t_{ij},x_{ij}) (the observations concurrent with y_{ij}) and all preceeding observations of the i-th subject up to (but not including) time t_{ij}. The model is fit using historical
regression (alt. classification) trees. Here a predictor is one
of two types, either concurrent
or historical
. The concurrent predictors for y_{ij} are the elements of the vector ((t_{ij},x_{ij})), while a
historic predictor is the set of all preceeding values (i.e. prior to time t_{ij}) of a given element of (y_{ij},t_{ij},x_{ij}), for subject i. In
a historic regression tree, node splitting on a concurrent
predictor follows the approach in standard regression (classification) trees. For historical
predictors
the splitting is modified since, associated with each observed response y_{ij}, the number (and observation times) of observations of a historical
predictor will vary according to
i and j. For these, the splitting is done by first transforming the preceeding values of a predictor using a summary function
. This summary is
invertible, in the sense that knowledge of it is equivalent to knowledge of the covariate history. Letting \bar{z}_{ijk} denote the set of historical values of the
k-th element of z_{ij}, the summary function is denoted s(η;\bar{z}_{ijk}) where η is the argument vector of the summary
function. Node splitting based on a historical predictor is done by solving
\mbox{argmin}∑_{(ij)\in Node} (y_{ij}-μ_L I(s(η;\bar{z}_{ijk})<c)-μ_R I(s(η;\bar{z}_{ijk})≥q c))^2,
where the minimization is over the vector (k,μ,c,η). Each node of historical
regression tree is split using the best split
among all splits of concurrent and historical predictors.
Different summary functions
are available in hrf
, specified by the argument method
. If method="freq"
the summary function summarizing a covariate history is:
s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}} I(z_{ihk}<η_2);
method="frac"
:
s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}} I(z_{ihk}<η_2)/n_{ij}(η);
method="mean0"
:
s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}} z_{ihk}/n_{ij}(η);
method="freqw"
:
s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}-η_2} I(z_{ihk}<η_3);
method="fracw"
:
s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}-η_2} I(z_{ihk}<η_3)/n_{ij}(η);
method="meanw0"
:
s(η;\bar{z}_{ijk})=∑_{h: t_{ij}-η_1≤q t_{ih}<t_{ij}-η_2} z_{ihk}/n_{ij}(η).
Here n_{ij}(η) denotes the number of observations of subject
i in the
time window [t_{ij}-η_1,t_{ij}-η_2). In the case n_{ij}(η)=0, the summary function is set to zero, i.e s(η;\bar{z}_{ijk})=0. The default is method="freq"
. The possible values of η_1 in the summary function can be set by the argument delta
. If not supplied, the set of possible values of
η_1 is
determined by the difference in time between within-subject successive observations in the data. When a split is attempted on a historical predictor, a sample of this set is
taken from which the best split is selected. The size of this set equals that of the nsamp
argument. See below on control
for futher arguments governing the
historical splitting. See below on control
for futher arguments governing the
historical splitting.
Setting se=TRUE
performs standard error estimation. The number of bootstrap samples (sampling subjects with replacement) is determined by
B
. For each bootstrap sample a random forest with R
trees is built, which defaults to R=10
. The bias induced by
using smaller bootstrap ensemble sizes is corrected for in the estimate. Using se=TRUE
will influence summaries from
the fitted model, such as providing approximate confidence intervals for partial dependence plots (when running partdep_hrf
), and give
standard errors for predictions when predict_hrf
is used.
All arguments (except those decribing the data x
, yindx
,time
and id
) can be set in the control
list. The arguments supplied
in control
are used if both are supplied. So if ntrees=300
and control=list(ntrees=500)
then 500
trees are fit. Besides the arguments
described above, a number of other parameters can be set in control. These are: nodesize
giving the minimum number of training observations in a terminal node;
sample_fraction
giving the fraction of data sample to train each tree; dtry
the number of sampled delta
values used when splitting based on
a historical variable (note this is alternatively controlled by the argument nsamp
above ; ndelta
the number of delta
values to use if delta
is not supplied, these are taken as the quantiles from the
distribution of observed delta
values in the data; qtry
the number of sampled values of η_2 for method=freq/frac
, η_3 for
method=freqw/fracw
; quantiles
is a vector of probabilities, and is used when method=frac
or method=fracw
, ie when covariate histories are split on their windowed empirical distributions. Splitting is restricted to these quantiles.
Returns a list
with elements, the most important being: h
is a list returned by parLapply
for fitting trees in parallel, error
gives the OOB error (mse for regression, misclassification rate for classification), control
a list containing control arguments, boot
gives bootstrapped model estimates (if se=TRUE
), x,id,time
give the training data.
Joe Sexton joesexton0@gmail.com
L. Breiman (2001). “Random Forests,” Machine Learning 45(1):5-32.
Zhang and Singer (2010) “Recursive Partioning and Applications” Springer.
Sexton and Laake (2009) “Standard errors for bagged and random forest estimators,” Computational Statistics and Data Analysis.
predict_hrf
, partdep_hrf
,
varimp_hrf
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | ## Not run:
# ----------------------------------------------------------------------------------------- ##
# Mother's stress on child illness:
# Investigate whether mother's stress is (Granger) causal for child illness
# 'hrf' model is fit using previous observations of mother's stress to predict
# child's illness at given time point, but not mother's stress at that time point
#
# Predictor variables are classified into "historical" and "concurrent"
#
# A predictor is "historical" if its prior realizations can be used to predict
# the outcome.
#
# A predictor is "concurrent" if its realization at the same timepoint as the outcome
# can be used to predict the outcome at that timepoint
#
# A predictor can be both "concurrent" and "historical", the status of the predictors
# can be set by the 'vh' and 'vc' arguments of 'hrf'.
# (if not set these are automatically determined)
#
# ------------------------------------------------------------------------------------------- ##
data(mscm)
mscm=as.data.frame(na.omit(mscm))
# -- set concurrent and historical predictors
historical_predictors=match(c("stress","illness"),names(mscm))
concurrent_predictors=which(names(mscm)!="stress")
control=list(vh=historical_predictors,vc=concurrent_predictors)
# -- fit historical random forest
# (NOTE: response is 0/1 so a regression tree is
# the same as a classification tree with Gini-index splitting)
ff=hrf(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)
# out-of-bag error
plot(1:length(ff$error),ff$error,type="l",main="OOB error",xlab="forest size",ylab="mse")
# .. larger nodesize works slightly better
control$nodesize=20
ff=hrf(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)
points(1:length(ff$error),ff$error,type="l",col="blue")
# -- variable importance table
vi=varimp_hrf(ff)
vi
# -- fit historical random forest with 'se=TRUE'
control$se=TRUE
ff=hrf(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)
# -- partial dependence for top 4 predictors (with +/-2 SE estimates)
par(mfrow=c(2,2))
for(k in 1:4)
pd=partdep_hrf(ff,xindx=as.character(vi$Predictor[k]))
par(mfrow=c(1,1))
## -- Classification trees
## setting classify=TRUE builds classification tree (gini-impurity node splitting)
control$classify=TRUE
## ... standard error estimation not implemented .. turn off bootstrapping
control$se=FALSE
ff=hrf(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)
# -- plot oob classification error
plot(1:length(ff$error),ff$error,type="l",xlab="forest size",ylab="oob classification error")
abline(mean(mscm$illness),0,lty=2) ## error of constant model
p=predict_hrf(ff)
## variable importance table (model error measured by gini-impurity)
vi=varimp_hrf(ff)
vi
# -------------------------------- #
# Data w/irregular observation times
# ------------------------------- #
data(cd4)
control=list(se=TRUE)
ff=hrf(x=cd4,id=cd4$id,time=cd4$time,yindx="count",control=control)
vi=varimp_hrf(ff)
# -- partial dependence for top 4 predictors (with +/-2 SE estimates)
par(mfrow=c(2,2))
for(k in 1:4)
pd=partdep_hrf(ff,xindx=as.character(vi$Predictor[k]))
par(mfrow=c(1,1))
plot(1:length(ff$error),ff$error,xlab="forest size",ylab="oob mse",type="l")
## by default, the number of delta values (parameter 'eta_1' above) is 20
## can set this using 'ndelta'
control$ndelta=50
control$se=FALSE # -- turning off bootstrapping ..
ff=hrf(x=cd4,id=cd4$id,time=cd4$time,yindx="count",control=control)
points(1:length(ff$error),ff$error,type="l",lty=2)
# the grid of delta values
ff$control$delta
# --------------------------------------- ##
# Boston Housing data (not longitudinal)
# --------------------------------------- ##
# library(htree)
library(mlbench)
library(randomForest)
data(BostonHousing)
dat=as.data.frame(na.omit(BostonHousing))
## omitting arguments time/id assumes rows are iid
control=list(ntrees=500,sample_fraction=.5,nodesize=1)
h=hrf(x=dat,yindx="medv",control=control)
## randomForest comparison
## (by default, randomForest samples with replacement, while hrf samples without)
r=randomForest(medv~.,data=dat,replace=F,sampsize=ceiling(.5*nrow(dat)),nodesize=1)
## plot oob-error for both
plot(1:length(r$mse),r$mse,type="l",ylim=c(min(r$mse,h$error),max(r$mse,h$error)),
main="BostonHousing",xlab="forest size",ylab="out-of-bag mse")
points(1:length(h$error),h$error,type="l",col="blue")
## -- variable importance table
vi=varimp_hrf(h)
vi
## -- partial dependence plots with approximate 95
control$se=TRUE
h=hrf(x=dat,yindx="medv",control=control)
par(mfrow=c(2,2))
for(k in 1:4)
pd=partdep_hrf(h,xindx=as.character(vi$Predictor[k]))
par(mfrow=c(1,1))
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.