boostmtree: Boosted multivariate trees for longitudinal data

View source: R/boostmtree.R

boostmtreeR Documentation

Boosted multivariate trees for longitudinal data

Description

Fit multivariate boosted tree models for repeated-measures outcomes. The method combines gradient boosting, subject-level tree partitioning, and smooth time effects to estimate subject-specific mean trajectories or class probabilities over time. Supports continuous, binary, nominal, and ordinal longitudinal data, as well as univariate responses.

Usage

boostmtree(
  x,
  tm = NULL,
  id = NULL,
  y,
  family = c("continuous", "binary", "nominal", "ordinal"),
  y.reference = NULL,
  M = 200,
  nu = 0.05,
  na.action = c("na.omit", "na.impute")[2],
  k = 5,
  mtry = NULL,
  n.knots = 10,
  d = 3,
  pen.ord = 3,
  lambda = NULL,
  rho = NULL,
  lambda.max = 1e6,
  lambda.iter = 2,
  svd.tol = 1e-6,
  verbose = TRUE,
  cv.flag = FALSE,
  eps = 1e-5,
  mod.grad = TRUE,
  nr.iter = 3,
  control = boostmtree.control()
)

Arguments

x

A data frame or matrix of covariates. For longitudinal fitting, rows must be repeated so that the covariate row for subject i appears once for each observation time recorded for that subject.

tm

A vector of observation times with one entry per row of x. If tm and id are both omitted, the function switches to univariate mode and treats each row of x as a single observation.

id

A subject identifier with one entry per row of x. Must be supplied whenever tm is supplied.

y

Observed responses with one entry per row of x.

family

Response type. Choose one of "continuous", "binary", "nominal", or "ordinal".

y.reference

Reference level used when family = "nominal". If NULL, the first observed response level is used. Ignored for the other families.

M

Number of boosting iterations. Larger values allow a more flexible fit, while cv.flag = TRUE can be used to select a stopping iteration automatically.

nu

Boosting step size. Supply either a scalar or a length-two vector. The first element is used for the intercept part of the time basis and the second for the remaining time-basis coefficients. All values must lie in (0, 1]. Smaller values generally require larger M.

na.action

How to handle missing values in the covariates x. Use "na.omit" to remove subjects with incomplete covariates, or "na.impute" to keep subjects unless all covariates are missing. Missing values in tm, id, or y are not allowed.

k

Requested number of terminal nodes in the tree base learner.

mtry

Number of candidate covariates sampled at each split. If NULL, all covariates are considered.

n.knots

Number of internal knots used for the B-spline time basis.

d

Degree of the B-spline basis. Values smaller than 1 remove the smooth time component and leave an intercept-only time design.

pen.ord

Differencing order used in the P-spline penalty. Larger values encourage smoother time profiles.

lambda

Smoothing parameter for the time-basis penalty. If NULL, the function estimates lambda adaptively when the time basis has sufficient degrees of freedom. A scalar is recycled across subproblems; otherwise the supplied vector must have length n.q.

rho

Working within-subject correlation parameter. If NULL, it is updated during fitting. A scalar is recycled across subproblems; otherwise the supplied vector must have length n.q.

lambda.max

Upper bound used during adaptive estimation of lambda.

lambda.iter

Number of updates used when lambda is estimated adaptively.

svd.tol

Tolerance used in the singular-value decomposition of the penalty matrix.

verbose

Logical; should progress output be printed?

cv.flag

Logical; should out-of-bag cross-validation be tracked to select an optimal stopping iteration? When cv.flag = TRUE, the fit must have out-of-bag subjects at each boosting iteration. The default control setting control = boostmtree.control() already provides this through bootstrap = "by.root". If a no-OOB scheme such as bootstrap = "none" is requested, boostmtree() switches back to "by.root" with a warning.

eps

Tolerance used when selecting the stopping iteration from the cross-validation error path.

mod.grad

Logical; controls the pseudo-response used to grow each tree base learner. During tree growing, the working gradient is first computed for each subject and then passed to rfsrc() as the multivariate pseudo-response used to determine tree splits. If mod.grad = FALSE, the inverse of the current working covariance matrix is used to compute the gradient, otherwise the inverse covariance factor is omitted. Operationally, mod.grad changes how candidate splits are evaluated, but it does not remove covariance weighting from the later terminal-node coefficient update. After the tree structure has been chosen, the node-specific coefficient estimates are still obtained from score and Hessian updates that use the inverse working covariance matrix.

nr.iter

Number of Newton–Raphson updates used inside the terminal node coefficient solver for non-continuous families.

control

A control object created by boostmtree.control().

Details

boostmtree() is designed for longitudinal data in which subject i is measured at times t_{ij} and is described by a baseline covariate vector x_i. The goal is to estimate how the conditional mean response changes over time and across subjects with different covariate profiles.

At a high level, the method combines three ideas.

First, it uses gradient boosting. Starting from an initial fit, the algorithm repeatedly adds small updates in the direction that improves the fit. If \eta_i(t) denotes the current linear predictor, then boosting updates it according to

\eta_i^{(m)}(t) = \eta_i^{(m-1)}(t) + \nu b_m(x_i, t),

where b_m is the base learner fitted at iteration m and \nu is the step size.

Second, the base learner is a multivariate tree. The tree partitions subjects into terminal nodes based on their covariates. Within each terminal node, the time profile is represented by coefficients of a smooth basis in time. In this way the model can capture nonlinear covariate effects, interactions, and different trajectory shapes for different groups of subjects.

Third, the time effect is modeled by a penalized B-spline. When d >= 1, the observed times are expanded into a spline basis and the terminal-node coefficients for that basis are penalized to avoid unstable or overly wiggly trajectories. The smoothing parameter lambda controls the strength of that penalty. If lambda is not supplied, the function can estimate it adaptively.

The fitted mean is linked to the response scale through

g\{\mu_i(t)\} = \eta_i(t),

where the link g depends on the family:

  • For "continuous", the identity link is used and the method models the mean trajectory directly.

  • For "binary", the logit link is used so that the fitted values represent estimated probabilities for the non-reference class.

  • For "nominal", the response is decomposed into one-vs-reference logit submodels, one for each non-reference class. The final class probabilities are then reconstructed from those submodels.

  • For "ordinal", the response is modeled through cumulative logit submodels. The final class probabilities are recovered from the fitted cumulative probabilities, with a monotonicity correction applied so that the fitted cumulative curves remain ordered.

For longitudinal data, boostmtree uses a working covariance model to describe the dependence among repeated measurements from the same subject after the mean trajectory has been fitted. For subject i with n_i observations, and for response component q, the working covariance has compound-symmetry form

V_{iq} = \phi_q \{ (1 - \rho_q) I_{n_i} + \rho_q J_{n_i} \},

where I_{n_i} is the n_i \times n_i identity matrix and J_{n_i} is the n_i \times n_i matrix of ones.

The roles of phi and rho are different:

  • phi is a scale parameter. It controls the overall size of the residual variation. Larger values of phi mean that, after accounting for the fitted mean structure, the repeated measurements still show greater variability.

  • rho is a within-subject correlation parameter. It controls how strongly repeated measurements from the same subject move together after accounting for the fitted mean structure. When rho = 0, the working residuals are uncorrelated within subject. Positive values indicate positive within-subject association.

Thus, phi answers the question "how much variation remains?", whereas rho answers the question "how is that remaining variation shared across measurements from the same subject?".

These parameters should be distinguished from lambda. The parameter lambda penalizes roughness in the time-basis coefficients, whereas phi and rho describe the residual covariance structure.

For continuous families there is a single pair (\phi, \rho). For the binary, nominal, and ordinal families, the method is built from one or more working submodels, so the fitted object may contain a separate path (\phi_q, \rho_q) for each response component q.

Out-of-bag tracking via cv.flag = TRUE records a standardized error path and selects an optimal stopping iteration m.opt. This package uses out-of-bag subjects in two places: for the cross-validation path stored when cv.flag = TRUE, and for grow-object variable importance returned by vimp.boostmtree(). The default control object uses bootstrap = "by.root", which naturally creates out-of-bag subjects. By contrast, bootstrap = "none" produces a deterministic in-bag fit with no out-of-bag data. That can still be useful for ordinary fitting and for prediction-object analyses based on a separate test set, but it is not suitable for out-of-bag CV or grow-object variable importance.

Note that when cv.flag = TRUE, boostmtree() checks the requested resampling rule before fitting. If the requested control object would yield no out-of-bag subjects, the function either adjusts the resampling rule to the default out-of-bag scheme or stops with a clear message when a user-supplied sample matrix is incompatible with OOB CV.

When tm and id are omitted, the function switches to univariate mode. In that case there is no longitudinal time basis, and the method acts as a boosted tree model for a single response per row.

Value

An object of class c("boostmtree", "grow", ...) with components:

base.learner

Fitted tree base learners, indexed by boosted subproblem and boosting iteration.

control

The normalized control object used for fitting.

cv.flag

Logical; whether cross-validation tracking was requested.

d

Degree of the time basis used in the fitted model.

err.rate

Standardized cross-validation error summaries. For single-response fits this is typically a matrix with columns "l1" and "l2"; for multi-subproblem fits it is stored by subproblem. NULL when cv.flag = FALSE.

family

The fitted response family.

gamma

Terminal-node coefficient summaries by subproblem and boosting iteration.

gamma.i.list

Cross-validated terminal-node coefficient summaries when cv.flag = TRUE; otherwise NULL.

id

The sorted long-format subject identifier.

id.unique

Unique subject identifiers in subject order.

k

Requested number of terminal nodes.

lambda

Path of smoothing-parameter values across boosting iterations and subproblems, or NULL in univariate mode.

M

Number of boosting iterations.

m.opt

Selected stopping iteration for each subproblem when cv.flag = TRUE; otherwise NULL.

membership

Terminal-node memberships by subproblem and boosting iteration.

mu

Fitted mean trajectories. For continuous and binary families this is a subject-level list of fitted trajectories. For nominal and ordinal families this is indexed first by boosted subproblem and then by subject.

n

Number of subjects.

n.q

Number of boosted subproblems.

na.action

Missing-data rule applied to x.

ni

Number of observations per subject.

ntree

Number of trees requested through control. The current implementation requires ntree = 1.

nu

Boosting step-size vector used internally.

oob.available

Logical flag indicating whether the fitted resampling path produced out-of-bag subjects at each boosting iteration.

oob.subject.count

Number of out-of-bag subjects at each boosting iteration. For single-response fits this is typically a vector of length M; for multi-subproblem fits it may be stored by subproblem.

pen.ord

Differencing order used in the P-spline penalty.

phi

Estimated path of the residual scale parameter. In the working covariance model, phi determines the marginal variance of repeated measurements after accounting for the fitted mean structure.

prob.class

Class probabilities by subject for non-continuous families; NULL for the continuous family.

q.set

Threshold levels (ordinal) or non-reference levels (binary or nominal) used to define the boosted subproblems.

q.total

Total number of observed response levels for non-continuous families.

rho

Estimated path of the within-subject correlation parameter. In the working covariance model, rho determines how strongly repeated measurements from the same subject are associated after accounting for the fitted mean structure. Under compound symmetry, the off-diagonal covariance is \rho \phi.

rmse

Standardized root mean squared error evaluated at m.opt, or NULL when cv.flag = FALSE.

time

A list of time vectors, one per subject.

time.design

Subject-specific time-design matrices.

time.unique

Sorted unique observation times used to build the time basis.

univariate

Logical; whether the model was fit in univariate mode.

x

Subject-level covariate data with one row per subject.

x.tm

Time-basis design matrix defined on time.unique.

x.var.names

Covariate names.

y

Observed responses split by subject.

y.levels

Observed response levels for non-continuous families. NA for the continuous family.

y.mean

Overall response mean used for standardization.

y.org

Observed responses represented at the boosted-subproblem level. For non-continuous families this is the encoded binary or cumulative response used by the corresponding submodel.

y.reference

Reference response level for the nominal family. NULL otherwise.

y.sd

Overall response standard deviation used for standardization.

Author(s)

Amol Pande, Udaya B. Kogalur and Hemant Ishwaran

References

Friedman J.H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29(5):1189–1232.

Friedman J.H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378.

Pande A., Li L., Rajeswaran J., Ehrlinger J., Kogalur U.B., Blackstone E.H., Ishwaran H. (2017). Boosted multivariate trees for longitudinal data. Machine Learning, 106(2):277–305.

Pande A., Ishwaran H., Blackstone E.H., Rajeswaran J., and Gillanov M. (2022). Application of gradient boosting in evaluating surgical ablation for atrial fibrillation. SN Computer Science, 3:466.

Pande A., Ishwaran H., and Blackstone E.H. (2022). Boosting for multivariate longitudinal responses. SN Computer Science, 3:186.

Pande A. (2017). Boosting for longitudinal data. Ph.D. dissertation, Miller School of Medicine, University of Miami.

See Also

AF, boostmtree.control, marginal.plot.boostmtree, partial.plot.boostmtree, plot.boostmtree, plot.vimp.boostmtree, predict.boostmtree, print.boostmtree, simLong, spirometry, vimp.boostmtree

Examples

## -------------------------------------------------------------
## Simulated continuous longitudinal response.
## Use fixed lambda and rho for a fast, example run.
## -------------------------------------------------------------

set.seed(7)
sim.obj <- simLong(n = 20, n.time = 4, rho = 0.5, model = 1,
                   family = "continuous")
dta <- sim.obj$data.list

fit <- boostmtree(
  x = dta$features,
  tm = dta$time,
  id = dta$id,
  y = dta$y,
  family = "continuous",
  M = 10,
  lambda = 0,
  rho = 0,
  verbose = FALSE,
  control = boostmtree.control(seed = 7)
)

print(fit)
plot.data <- plot(fit, output = "data")
str(plot.data[[1]], max.level = 1)

## -------------------------------------------------------------
## Univariate regression: omit tm and id.
## -------------------------------------------------------------

x.uni <- mtcars[, setdiff(names(mtcars), "mpg")]
y.uni <- mtcars$mpg

fit.uni <- boostmtree(
  x = x.uni,
  y = y.uni,
  family = "continuous",
  M = 10,
  lambda = 0,
  verbose = FALSE
)

print(fit.uni)


## -------------------------------------------------------------
## Simulated binary longitudinal response.
## This example uses a fast, deterministic settings.
## -------------------------------------------------------------

set.seed(17)
sim.bin <- simLong(n = 20, n.time = 4, rho = 0.5, model = 2,
                   family = "binary")
dta.bin <- sim.bin$data.list

fit.bin <- boostmtree(
  x = dta.bin$features,
  tm = dta.bin$time,
  id = dta.bin$id,
  y = dta.bin$y,
  family = "binary",
  M = 10,
  lambda = 0,
  rho = 0,
  verbose = FALSE
)

print(fit.bin)
plot(fit.bin)

## -------------------------------------------------------------
## AF data illustration.  Parameters selected for a fast run.
## -------------------------------------------------------------

data(AF, package = "boostmtree")

fit.af <- boostmtree(
  x = AF$features,
  tm = AF$time,
  id = AF$id,
  y = AF$y,
  family = "binary",
  M = 10,
  k = 15,
  nu = .025,
  n.knots = 15,
  verbose = TRUE
)


print(fit.af)


## -------------------------------------------------------------
## Ordinal response illustration.
## A latent continuous response is grouped into three ordered levels.
## -------------------------------------------------------------

set.seed(31)
sim.ord <- simLong(n = 25, n.time = 4, family = "continuous")
dta.ord <- sim.ord$data.list
rank.y <- rank(dta.ord$y, ties.method = "first")
brks <- c(0, floor(length(rank.y) / 3), floor(2 * length(rank.y) / 3), length(rank.y))
y.ord <- cut(
  rank.y,
  breaks = brks,
  labels = c(0, 1, 2),
  include.lowest = TRUE,
  ordered_result = TRUE
)

fit.ord <- boostmtree(
  x = dta.ord$features,
  tm = dta.ord$time,
  id = dta.ord$id,
  y = y.ord,
  family = "ordinal",
  M = 50,
  verbose = FALSE
)

print(fit.ord)



boostmtree documentation built on April 10, 2026, 9:10 a.m.