| boostmtree | R Documentation |
Fit multivariate boosted tree models for repeated-measures outcomes. The method combines gradient boosting, subject-level tree partitioning, and smooth time effects to estimate subject-specific mean trajectories or class probabilities over time. Supports continuous, binary, nominal, and ordinal longitudinal data, as well as univariate responses.
boostmtree(
x,
tm = NULL,
id = NULL,
y,
family = c("continuous", "binary", "nominal", "ordinal"),
y.reference = NULL,
M = 200,
nu = 0.05,
na.action = c("na.omit", "na.impute")[2],
k = 5,
mtry = NULL,
n.knots = 10,
d = 3,
pen.ord = 3,
lambda = NULL,
rho = NULL,
lambda.max = 1e6,
lambda.iter = 2,
svd.tol = 1e-6,
verbose = TRUE,
cv.flag = FALSE,
eps = 1e-5,
mod.grad = TRUE,
nr.iter = 3,
control = boostmtree.control()
)
x |
A data frame or matrix of covariates. For longitudinal fitting,
rows must be repeated so that the covariate row for subject |
tm |
A vector of observation times with one entry per row of |
id |
A subject identifier with one entry per row of |
y |
Observed responses with one entry per row of |
family |
Response type. Choose one of |
y.reference |
Reference level used when |
M |
Number of boosting iterations. Larger values allow a more flexible
fit, while |
nu |
Boosting step size. Supply either a scalar or a length-two vector.
The first element is used for the intercept part of the time basis and the
second for the remaining time-basis coefficients. All values must lie in
|
na.action |
How to handle missing values in the covariates |
k |
Requested number of terminal nodes in the tree base learner. |
mtry |
Number of candidate covariates sampled at each split. If
|
n.knots |
Number of internal knots used for the B-spline time basis. |
d |
Degree of the B-spline basis. Values smaller than 1 remove the smooth time component and leave an intercept-only time design. |
pen.ord |
Differencing order used in the P-spline penalty. Larger values encourage smoother time profiles. |
lambda |
Smoothing parameter for the time-basis penalty. If
|
rho |
Working within-subject correlation parameter. If |
lambda.max |
Upper bound used during adaptive estimation of
|
lambda.iter |
Number of updates used when |
svd.tol |
Tolerance used in the singular-value decomposition of the penalty matrix. |
verbose |
Logical; should progress output be printed? |
cv.flag |
Logical; should out-of-bag cross-validation be tracked to
select an optimal stopping iteration? When |
eps |
Tolerance used when selecting the stopping iteration from the cross-validation error path. |
mod.grad |
Logical; controls the pseudo-response used to grow
each tree base learner. During tree growing, the working gradient
is first computed for each subject and then passed to |
nr.iter |
Number of Newton–Raphson updates used inside the terminal node coefficient solver for non-continuous families. |
control |
A control object created by |
boostmtree() is designed for longitudinal data in which subject
i is measured at times t_{ij} and is described by a baseline
covariate vector x_i. The goal is to estimate how the conditional mean
response changes over time and across subjects with different covariate
profiles.
At a high level, the method combines three ideas.
First, it uses gradient boosting. Starting from an initial fit, the
algorithm repeatedly adds small updates in the direction that improves the fit.
If \eta_i(t) denotes the current linear predictor, then boosting updates
it according to
\eta_i^{(m)}(t) = \eta_i^{(m-1)}(t) + \nu b_m(x_i, t),
where b_m is the base learner fitted at iteration m and
\nu is the step size.
Second, the base learner is a multivariate tree. The tree partitions subjects into terminal nodes based on their covariates. Within each terminal node, the time profile is represented by coefficients of a smooth basis in time. In this way the model can capture nonlinear covariate effects, interactions, and different trajectory shapes for different groups of subjects.
Third, the time effect is modeled by a penalized B-spline. When
d >= 1, the observed times are expanded into a spline basis and the
terminal-node coefficients for that basis are penalized to avoid unstable or
overly wiggly trajectories. The smoothing parameter lambda controls the
strength of that penalty. If lambda is not supplied, the function can
estimate it adaptively.
The fitted mean is linked to the response scale through
g\{\mu_i(t)\} = \eta_i(t),
where the link g depends on the family:
For "continuous", the identity link is used and the method
models the mean trajectory directly.
For "binary", the logit link is used so that the fitted values
represent estimated probabilities for the non-reference class.
For "nominal", the response is decomposed into one-vs-reference
logit submodels, one for each non-reference class. The final class
probabilities are then reconstructed from those submodels.
For "ordinal", the response is modeled through cumulative logit
submodels. The final class probabilities are recovered from the fitted
cumulative probabilities, with a monotonicity correction applied so that the
fitted cumulative curves remain ordered.
For longitudinal data, boostmtree uses a working covariance model to
describe the dependence among repeated measurements from the same subject after
the mean trajectory has been fitted. For subject i with n_i
observations, and for response component q, the working covariance has
compound-symmetry form
V_{iq} = \phi_q \{ (1 - \rho_q) I_{n_i} + \rho_q J_{n_i} \},
where I_{n_i} is the n_i \times n_i identity matrix and
J_{n_i} is the n_i \times n_i matrix of ones.
The roles of phi and rho are different:
phi is a scale parameter. It controls the overall size of the
residual variation. Larger values of phi mean that, after accounting
for the fitted mean structure, the repeated measurements still show greater
variability.
rho is a within-subject correlation parameter. It controls how
strongly repeated measurements from the same subject move together after
accounting for the fitted mean structure. When rho = 0, the working
residuals are uncorrelated within subject. Positive values indicate positive
within-subject association.
Thus, phi answers the question "how much variation remains?", whereas
rho answers the question "how is that remaining variation shared across
measurements from the same subject?".
These parameters should be distinguished from lambda. The parameter
lambda penalizes roughness in the time-basis coefficients, whereas
phi and rho describe the residual covariance structure.
For continuous families there is a single pair (\phi, \rho). For the
binary, nominal, and ordinal families, the method is built from one or more
working submodels, so the fitted object may contain a separate path
(\phi_q, \rho_q) for each response component q.
Out-of-bag tracking via cv.flag = TRUE records a standardized error path
and selects an optimal stopping iteration m.opt. This package uses
out-of-bag subjects in two places: for the cross-validation path stored
when cv.flag = TRUE, and for grow-object variable importance returned by
vimp.boostmtree(). The default control object uses
bootstrap = "by.root", which naturally creates out-of-bag subjects.
By contrast, bootstrap = "none" produces a deterministic in-bag fit
with no out-of-bag data. That can still be useful for ordinary fitting and for
prediction-object analyses based on a separate test set, but it is not suitable
for out-of-bag CV or grow-object variable importance.
Note that when cv.flag = TRUE, boostmtree() checks the requested
resampling rule before fitting. If the requested control object would
yield no out-of-bag subjects, the function either adjusts the resampling
rule to the default out-of-bag scheme or stops with a clear message when
a user-supplied sample matrix is incompatible with OOB CV.
When tm and id are omitted, the function switches to univariate
mode. In that case there is no longitudinal time basis, and the method acts as
a boosted tree model for a single response per row.
An object of class c("boostmtree", "grow", ...) with components:
Fitted tree base learners, indexed by boosted subproblem and boosting iteration.
The normalized control object used for fitting.
Logical; whether cross-validation tracking was requested.
Degree of the time basis used in the fitted model.
Standardized cross-validation error summaries. For single-response fits this is
typically a matrix with columns "l1" and "l2"; for multi-subproblem
fits it is stored by subproblem. NULL when cv.flag = FALSE.
The fitted response family.
Terminal-node coefficient summaries by subproblem and boosting iteration.
Cross-validated terminal-node coefficient summaries when
cv.flag = TRUE; otherwise NULL.
The sorted long-format subject identifier.
Unique subject identifiers in subject order.
Requested number of terminal nodes.
Path of smoothing-parameter values across boosting iterations and
subproblems, or NULL in univariate mode.
Number of boosting iterations.
Selected stopping iteration for each subproblem when
cv.flag = TRUE; otherwise NULL.
Terminal-node memberships by subproblem and boosting iteration.
Fitted mean trajectories. For continuous and binary families this is a subject-level list of fitted trajectories. For nominal and ordinal families this is indexed first by boosted subproblem and then by subject.
Number of subjects.
Number of boosted subproblems.
Missing-data rule applied to x.
Number of observations per subject.
Number of trees requested through control. The current implementation
requires ntree = 1.
Boosting step-size vector used internally.
Logical flag indicating whether the fitted resampling path produced out-of-bag subjects at each boosting iteration.
Number of out-of-bag subjects at each boosting iteration. For single-response
fits this is typically a vector of length M; for multi-subproblem fits
it may be stored by subproblem.
Differencing order used in the P-spline penalty.
Estimated path of the residual scale parameter. In the working covariance
model, phi determines the marginal variance of repeated measurements
after accounting for the fitted mean structure.
Class probabilities by subject for non-continuous families;
NULL for the continuous family.
Threshold levels (ordinal) or non-reference levels (binary or nominal) used to define the boosted subproblems.
Total number of observed response levels for non-continuous families.
Estimated path of the within-subject correlation parameter. In the working
covariance model, rho determines how strongly repeated measurements
from the same subject are associated after accounting for the fitted mean
structure. Under compound symmetry, the off-diagonal covariance is
\rho \phi.
Standardized root mean squared error evaluated at m.opt,
or NULL when cv.flag = FALSE.
A list of time vectors, one per subject.
Subject-specific time-design matrices.
Sorted unique observation times used to build the time basis.
Logical; whether the model was fit in univariate mode.
Subject-level covariate data with one row per subject.
Time-basis design matrix defined on time.unique.
Covariate names.
Observed responses split by subject.
Observed response levels for non-continuous families.
NA for the continuous family.
Overall response mean used for standardization.
Observed responses represented at the boosted-subproblem level. For non-continuous families this is the encoded binary or cumulative response used by the corresponding submodel.
Reference response level for the nominal family.
NULL otherwise.
Overall response standard deviation used for standardization.
Amol Pande, Udaya B. Kogalur and Hemant Ishwaran
Friedman J.H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29(5):1189–1232.
Friedman J.H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378.
Pande A., Li L., Rajeswaran J., Ehrlinger J., Kogalur U.B., Blackstone E.H., Ishwaran H. (2017). Boosted multivariate trees for longitudinal data. Machine Learning, 106(2):277–305.
Pande A., Ishwaran H., Blackstone E.H., Rajeswaran J., and Gillanov M. (2022). Application of gradient boosting in evaluating surgical ablation for atrial fibrillation. SN Computer Science, 3:466.
Pande A., Ishwaran H., and Blackstone E.H. (2022). Boosting for multivariate longitudinal responses. SN Computer Science, 3:186.
Pande A. (2017). Boosting for longitudinal data. Ph.D. dissertation, Miller School of Medicine, University of Miami.
AF,
boostmtree.control,
marginal.plot.boostmtree,
partial.plot.boostmtree,
plot.boostmtree,
plot.vimp.boostmtree,
predict.boostmtree,
print.boostmtree,
simLong,
spirometry,
vimp.boostmtree
## -------------------------------------------------------------
## Simulated continuous longitudinal response.
## Use fixed lambda and rho for a fast, example run.
## -------------------------------------------------------------
set.seed(7)
sim.obj <- simLong(n = 20, n.time = 4, rho = 0.5, model = 1,
family = "continuous")
dta <- sim.obj$data.list
fit <- boostmtree(
x = dta$features,
tm = dta$time,
id = dta$id,
y = dta$y,
family = "continuous",
M = 10,
lambda = 0,
rho = 0,
verbose = FALSE,
control = boostmtree.control(seed = 7)
)
print(fit)
plot.data <- plot(fit, output = "data")
str(plot.data[[1]], max.level = 1)
## -------------------------------------------------------------
## Univariate regression: omit tm and id.
## -------------------------------------------------------------
x.uni <- mtcars[, setdiff(names(mtcars), "mpg")]
y.uni <- mtcars$mpg
fit.uni <- boostmtree(
x = x.uni,
y = y.uni,
family = "continuous",
M = 10,
lambda = 0,
verbose = FALSE
)
print(fit.uni)
## -------------------------------------------------------------
## Simulated binary longitudinal response.
## This example uses a fast, deterministic settings.
## -------------------------------------------------------------
set.seed(17)
sim.bin <- simLong(n = 20, n.time = 4, rho = 0.5, model = 2,
family = "binary")
dta.bin <- sim.bin$data.list
fit.bin <- boostmtree(
x = dta.bin$features,
tm = dta.bin$time,
id = dta.bin$id,
y = dta.bin$y,
family = "binary",
M = 10,
lambda = 0,
rho = 0,
verbose = FALSE
)
print(fit.bin)
plot(fit.bin)
## -------------------------------------------------------------
## AF data illustration. Parameters selected for a fast run.
## -------------------------------------------------------------
data(AF, package = "boostmtree")
fit.af <- boostmtree(
x = AF$features,
tm = AF$time,
id = AF$id,
y = AF$y,
family = "binary",
M = 10,
k = 15,
nu = .025,
n.knots = 15,
verbose = TRUE
)
print(fit.af)
## -------------------------------------------------------------
## Ordinal response illustration.
## A latent continuous response is grouped into three ordered levels.
## -------------------------------------------------------------
set.seed(31)
sim.ord <- simLong(n = 25, n.time = 4, family = "continuous")
dta.ord <- sim.ord$data.list
rank.y <- rank(dta.ord$y, ties.method = "first")
brks <- c(0, floor(length(rank.y) / 3), floor(2 * length(rank.y) / 3), length(rank.y))
y.ord <- cut(
rank.y,
breaks = brks,
labels = c(0, 1, 2),
include.lowest = TRUE,
ordered_result = TRUE
)
fit.ord <- boostmtree(
x = dta.ord$features,
tm = dta.ord$time,
id = dta.ord$id,
y = y.ord,
family = "ordinal",
M = 50,
verbose = FALSE
)
print(fit.ord)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.