MST: Multivariate Survival Trees

Description Usage Arguments Details Value Warning Note Author(s) References See Also Examples

View source: R/MST.R

Description

Constructs trees for multivariate survival data using marginal and frailty models. A wrapper function that grows a large initial tree, prunes the tree, and selects the best sized tree.

Usage

1
2
3
4
5
6
7
MST(formula, data, test = NULL, weights_data, weights_test, subset,
  method = c("marginal", "gamma.frailty", "exp.frailty", "stratified", "independence"),
  minsplit = 20, minevents = 3, minbucket = round(minsplit/3), maxdepth = 10,
  mtry = NULL, distinct = TRUE, delta = 0.05, nCutPoints = 50,
  selection.method = c("test.sample", "bootstrap"),
  B = 30, LeBlanc = TRUE, min.boot.tree.size = 1,
  plot.Ga = TRUE, filename = NULL, horizontal = TRUE, details = FALSE, sortTrees = TRUE)

Arguments

formula

A linear survival model with the response on the left of a ~ operator and the predictors, separated by + operators, on the right. Cluster (or id) variable is distinguished by a vertical bar | (e.g. Surv(time,status) ~ x1 + x2 | id). Categorical predictors must be treated as a factor.

data

Data to grow and prune the tree

test

Test sample if available

weights_data

An optional vector of weights to grow the tree

weights_test

An optional vector of weights to select the best-sized tree

subset

An optional vector specifying a subset of observations to be used to grow the tree

method

Indicates method of handling correlation: must be either "marginal", "gamma.frailty", "exp.frailty", "stratified", or "independence"

minsplit

Number: Controls the minimum node size

minevents

Number: Controls the minimum number of uncensored event times

minbucket

Number: Controls the minimum number of observations in any terminal node

maxdepth

Number: Maximum depth of tree

mtry

Number of variables considered at each split. The default is to consider all variables

distinct

Logical: Indicates if all distinct cutpoints or only percentiles considered

delta

Consider cutpoints from delta to 1 - delta. Only used when distinct = TRUE

nCutPoints

Number of cutpoints (percentiles) considered. Only used when distinct = TRUE

selection.method

Indicates method of selecting the best-sized subtree: "test.sample" or "bootstrap"

B

Number of bootstrap samples. Only used if selection.method = "bootstrap"

LeBlanc

Logical: Indicates if entire sample used (alternative is out-of-bag sample). Only used if selection.method = "bootstrap"

min.boot.tree.size

Number: Minimum size of tree grown at each bootstrap

plot.Ga

Logical: Indicates if goodness-of-fit vs. tree size should be plotted

filename

Name of the file plotted

horizontal

Logical: Indicates if plot should be landscape

details

Logical: Indicates if detailed information on the construction should be printed

sortTrees

Logical: Indicates if trees should be sorted such that each split to the left has lower risk of failure

Details

Marginal and frailty models are the two main ways to analyze correlated failure times. Let X_{ij} represent the covariate vector for the jth member in the ith cluster.

The marginal model uses the Cox (1972) proportional hazards model:

λ_{ij}(t|X_{ij})=λ_{0}(t) \exp(β \cdot I(X_{ij} ≤q c))

where λ_{0}(t) is an unspecified baseline hazard function and I(\cdot) is the indicator function.

The gamma frailty model uses the proportional hazards model:

λ_{ij}(t|X_{ij}, w_{i})=λ_{0}(t) \exp(β \cdot I(X_{ij} ≤q c)) w_{i}

where λ_{0}(t) is an unspecified baseline hazard function, I(\cdot) is the indicator function, and w_{i} is the frailty term for the ith cluster.

The exponential frailty model uses the proportional hazards model:

λ_{ij}(t|X_{ij}, w_{i})=\exp(β_{0} + β_{1} \cdot I(X_{ij} ≤q c)) w_{i}

where I(\cdot) is the indicator function and w_{i} is the frailty term for the ith cluster.

For the marginal model, a robust logrank statistic is calculated for each covariate X and possible cutpoint c. The estimate of the score function and likelihood of β can be obtained assuming independence. However, the variance-covariance structure adjusts for the dependence using a sandwich-type estimator. The best split is the one with the largest robust logrank statistic.

For the frailty models, a score test statistic is calculated from the maximum integrated log likelihood for each covariate X and possible cutpoint c. The frailty term must follow some known positive distribution; one common choice is w_{i} \sim Γ(1/ν, 1/ν) where ν represents an unknown variance. Note, the exponential frailty model replaces the baseline hazard function with a constant, yielding different score test statistics and typically computationally faster splits. The best split is the one with the largest score test statistic.

Stratified model grows a tree by minimizing the within-strata variation. This method should be used with care because the tree will not split on variables with a fixed value within each stratum. The independence model ignores the dependence and uses the logrank statistic as the splitting rule.

For continuous variables with many distinct cutpoints, the number of cutpoints considered can be reduced to percentiles. Using percentiles increases efficiency at the expense of less accuracy.

Growing the initial tree is done by splitting nodes (as described above) reiteratively until the maximum depth of the tree is reached or a small number of observations remain at terminal node. However, as the final tree model can be any subtree of the initial tree, the number of subtrees can become massive. A goodness-of-fit with an added penalty for the number of internal nodes is used to prune the trees (i.e. reduce the number of subtrees considered). The best-sized tree is selected by the largest goodness-of-fit with the added penalty using either the test sample or bootstrap samples.

Value

tree0

The initial tree. Tree listed as constparty object

prunining.info

Trees pruned and considered in the best tree selection

best.tree.size

The best tree size based on the penalty used

best.tree.structure

The best tree structure based on the penalty used. Tree listed as constparty object

Note, the constparty object requires a constant fit from each terminal node. Thus, the predict and plot functions ignore the dependence, so users are recommended to fit their own model when making predictions (see example)

Warning

Error messages in the gamma frailty models sometimes occur when using the bootstrap method. Increasing minsplit may help fix these errors. The exponential frailty model can have problems for large, extremely unbalanced designs. Currently weights can only be applied to marginal and gamma frailty models.

Note

Code may take awhile to implement large datasets. To decrease computation time, user should use test sample (selection.method = "test.sample"). User can also split continuous variables based on percentiles (distinct = FALSE) at the expense of slightly less accuracy. Gamma frailty models are more computationally intensive

Author(s)

Xiaogang Su, Peter Calhoun, and Juanjuan Fan

References

Calhoun P., Su X., Nunn M., Fan J. (2018) Constructing Multivariate Survival Trees: The MST Package for R. Journal of Statistical Software, 83(12), 1–21.

Cox D.R. (1972) Regression models and life-tables (with discussion). Journal of the Royal Statistical Society Series B, 34(2), 187–220.

Fan J., Su X., Levine R., Nunn M., LeBlanc M. (2006) Trees for Correlated Survival Data by Goodness of Split, With Applications to Tooth Prognosis. Journal of American Statistical Association, 101(475), 959–967.

Fan J., Nunn M., Su X. (2009) Multivariate exponential survival trees and their application to tooth prognosis. Computational Statistics and Data Analysis, 53(4), 1110–1121.

Su X., Fan J. (2004) Multivariate Survival Trees: A Maximum Likelihood Approach Based on Frailty Models. Biometrics, 60(1), 93–99.

See Also

rpart

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
set.seed(186117)
data <- rmultime(N = 200, K = 4, beta = c(-1, 0.8, 0.8, 0, 0), cutoff = c(0.5, 0.3, 0, 0),
    model = "marginal.multivariate.exponential", rho = 0.65)$dat
test <- rmultime(N = 100, K = 4, beta = c(-1, 0.8, 0.8, 0, 0), cutoff = c(0.5, 0.3, 0, 0),
    model = "marginal.multivariate.exponential", rho = 0.65)$dat

#Construct Multivariate Survival Tree:
fit <- MST(formula = Surv(time, status) ~ x1 + x2 + x3 + x4 | id, data, test,
    method = "marginal", minsplit = 100, minevents = 20, selection.method = "test.sample")

(tree_final <- getTree(fit, 4))
plot(tree_final)

#Fit a model from the final tree
data$term_nodes <- as.factor(predict(tree_final, newdata = data, type = 'node'))
coxph(Surv(time, status) ~ term_nodes + cluster(id), data = data)

Example output

Loading required package: survival

Model formula:
Surv(time, status) ~ x1 + x2 + x3 + x4 + id

Fitted party:
[1] root
|   [2] x2 > 0.3
|   |   [3] x1 > 0.5: 1.694 (n = 208)
|   |   [4] x1 <= 0.5: 0.702 (n = 292)
|   [5] x2 <= 0.3
|   |   [6] x1 > 0.5: 0.977 (n = 112)
|   |   [7] x1 <= 0.5: 0.334 (n = 188)

Number of inner nodes:    3
Number of terminal nodes: 4
Call:
coxph(formula = Surv(time, status) ~ term_nodes + cluster(id), 
    data = data)

              coef exp(coef) se(coef) robust se     z        p
term_nodes4 1.0337    2.8113   0.1391    0.1934 5.346 9.01e-08
term_nodes6 0.9485    2.5819   0.1782    0.1736 5.465 4.64e-08
term_nodes7 1.6916    5.4281   0.1660    0.2234 7.573 3.65e-14

Likelihood ratio test=114.8  on 3 df, p=< 2.2e-16
n= 800, number of events= 410 

MST documentation built on April 14, 2020, 6:14 p.m.