knitr::opts_chunk$set(echo = TRUE)
This vignette documents fitting multivariate adaptive regression splines using generalized estimating equations (MARGE) via the marge
R-package. We also fit some additional models and compare them with MARGE. In the second example we compare the predictive squared error (PSE) on test data, and report the computational times for each model. For specific details on the MARGE algorithm see:
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, pp. 245--253.
Also, see ?marge
for further details on the fitting function, input arguments and output.\
marge
, mgcv
and BayesTree
R-packages.suppressMessages(library("marge")) suppressMessages(library("mgcv")) suppressMessages(library("BayesTree"))
Our first example uses the longitudinal tree growth multivariate data from Wood (2006). Fourteen trees ($N$ = 14) had their heights measured across six ages ($n$ = 6). The response variable is the tree height and we use age as a predictor. We also take logarithms of the response variable and assume the distribution is Gaussian. Below, we fit a GLMM and MARGE models with several different working correlation structures.\
N <- 14 # Sample size (number of clusters). n <- 6 # Cluster size. id <- factor(rep(1:N, each = n)) # The ID of each cluster. Y0 <- matrix(log(Loblolly$height), ncol = n, byrow = TRUE) # Response variable. Y <- c(t(Y0)) X_pred0 <- matrix(log(Loblolly$age), ncol = n, byrow = TRUE) # Predictor variable. X_pred <- scale(c(t(X_pred0))) # Standardize the predictor. colnames(X_pred) <- c("age") dat1 <- as.data.frame(cbind(X_pred, c(t(Y)), id)) colnames(dat1) <- c("age", "Y", "id") dat1$age1 <- c(t(X_pred0)) dat1$id <- factor(dat1$id)
family <- "gaussian" # The selected "exponential" family for the GLM/GEE. is.gee <- TRUE # Is the model a GEE? nb <- FALSE # Is this a negative binomial model? tols_score <- 0.00001 # Tolerance for stopping condition in forward pass for GEE. M <- 21 # Max. no. of terms. pen <- 2 # Penalty to be used in GCV. minspan <- NULL # A set minimum span value. print.disp <- FALSE # Print ALL the output?
# GAMM/cubic mixed effect model. dat2 <- groupedData(Y ~ age|id, data = dat1) model_gamm <- lme(Y ~ age + I(age^2) + I(age^3), data = dat2, correlation = corAR1(form = ~ age|id), random = ~ 1) # MARGE. corstr <- "independence" # Independent working correlation structure. model_marge_ind <- marge(X_pred, Y, N, n, id, family, corstr, pen, tols_score, M, minspan, print.disp, nb, is.gee) corstr <- "ar1" # AR1 working correlation structure. model_marge_ar1 <- marge(X_pred, Y, N, n, id, family, corstr, pen, tols_score, M, minspan, print.disp, nb, is.gee) corstr <- "exchangeable" # Exchangeable working correlation structure. model_marge_exch <- marge(X_pred, Y, N, n, id, family, corstr, pen, tols_score, M, minspan, print.disp, nb, is.gee)
\clearpage
Our second example uses the leptrine
data set which is provided within the marge
R-package and analyzed in Stoklosa and Warton (2018). These data are binary (presence-absence) data collected on plant species for 8,678 sites located in the Blue Mountains region. These data contain nine environmental predictor variables collected at each site. Our objective is to predict the presence of the species Leptospermum trinervium using the environmental predictor variables. There are are 1,751 absences and 6,927 presences. The leptrine
data consists of training and test sets, such that $N_{train}$ = 4,339. The nine environmental predictor variables have been standardized. See ?leptrine
for further details.\
data("leptrine") dat1 <- leptrine[[1]] # Training data. dat1_t <- leptrine[[2]] # Test data. Y <- dat1$Y # Response variable. N <- length(Y) # Sample size (number of clusters). n <- 1 # Cluster size. id <- rep(1:N, each = n) # The ID of each cluster. X_pred <- dat1[, -10] # A matrix of the nine environmental predictor variables (note # that we remove the response variable). X_predt <- dat1_t[, -11] # The provided test data already contains a column of ones (for # the intercept term).
Below, we fit a MARS model, a generalized additive model (GAM), a Bayesian additive regression tree (BART) model and MARGE (with two different penalties, $\lambda = 2$ and $\lambda = \log(N)$). We record the computational times (in seconds) and calculate the predictive squared errors (PSE) on test data.\
family <- "binomial" # The selected "exponential" family for the GLM/GEE. corstr <- "independence" # The selected "working correlation" structure (used for GEEs). is.gee <- FALSE # Is the model a GEE? nb <- FALSE # Is this a negative binomial model? tols <- 0.0001 # A set tolerance (stopping condition) in the forward pass for MARS. tols_score <- 0.0001 # A set tolerance (stopping condition) in forward pass for MARGE. M <- 21 # A set threshold for the maximum no. of basis functions to be used. pen <- 2 # Penalty to be used in GCV (required for MARS). minspan <- NULL # A set minimum span value. print.disp <- FALSE # Print ALL the output?
#### Fit models and record the computational times.
Model fitting can take ~12 minutes to complete (this depends on your computer speed/memory).
wzxhzdk:7
wzxhzdk:8
wzxhzdk:9
wzxhzdk:10
### Calculate the predictive squared error (PSE) on test data using each fitted model.
wzxhzdk:11
### Display results - these PSE values should match the results given in Table 3 of Stoklosa and Warton (2018), except for BART (which should be similar).
wzxhzdk:12
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.