premixed: Fitting a mixed-effects prediction rule ensemble
In marjoleinF/premixed: Derives mixed-effects prediction rule ensembles

Description Usage Arguments Details Value Examples

Experimental function for fitting mixed-effects prediction rule ensembles. Estimates a random intercept in addition to a prediction rule ensemble. This allows for analysing datasets with a clustered or multilevel structure, or longitudinal datasets. Experimental, so use at own risk.

1
2
3

premixed(formula, cluster = NULL, data, penalty.par.val = "lambda.min",
  learnrate = 0, use.grad = FALSE, conv.thresh = 0.001,
  family = "gaussian", ridge.ranef = FALSE, max.iter = 1000, ...)

`formula`	a formula with three-part right-hand side, like `y ~ 1 \| cluster \| x1 + x2 + x3`; or with one-part right hand side, like `y ~ x1 + x2 + x3`. In the latter case, the cluster indicator must be specified through the `cluster` argument. #' @param data a dataframe containing the variables in the model
`cluster`	optional character string supplying the name of the cluster indicator. If specified, `formula` should not involve random effects (e.g., `y ~ x1+ x2 + x3`). If `cluster` is specified, random effects will not be estimated during tree induction. This will substantially speed up computations, but may yield a less accurate model, depending on the magnitude of the random effects.
`data`	dataframe containing the variables specified in `formula`.
`penalty.par.val`	as usual.
`learnrate`	as usual.
`use.grad`	as usual.
`conv.thresh`	numeric vector of length 1, specifies the convergence criterion for estimation of the model. If `ridge.ranef = FALSE`, it specifies the maximum difference in log-likelihoods of the random-effects model from two consecutive iterations for estimation to converge. If `ridge.ranef = TRUE`, it specifies the maximum absolute difference in random-effects predictions from two consecutive iterations for estimation to converge.
`family`	as usual. Note: should be a character vector!
`ridge.ranef`	logical vector of length 1. Should random effects be estimated through a ridge regression? If set to `TRUE`, random effects will be estimated through fitting a ridge regression model using function `cv.glmnet`. If set to `FALSE`, random effects will be estimated through fitting a mixed-effects regression model using function `lmer` or `glmer`.
`max.iter`	numeric vector of length 1. Maximum number of iterations performed to re-estimate fixed and random effects parameters.
`...`	further arguments to be passed to `pre`.

Function premixed() allows for taking into account a random intercept in I) rule induction and/or II) coefficient estimation. To take into account the random intercept in both rule induction and coefficient estimation, see Example 1 below. To take into account the random intercept only in coefficient estimation, see Example 2 below. Alternatively, it has been suggested that random effects do not need to be taken into account explicitly but only through employing a blocked bootstrap or subampling approach, see Examples 3a and 3b below.

Note that approaches / examples 1 and 2 can be combined with the third approach / example 3. However, whether employing a cluster bootstrap- or subsampling approach is actually sufficient to take info account the clustered structure is a topic that still needs to be addressed.

Note that random intercept-only models are currently supported. That is, random slopes can currently not be specified.

An object of class 'premixed'.

## Example 1: Take into account clustered structure in rule induction
## as well as coeficient estimation: 
set.seed(42)
airq <- airquality[complete.cases(airquality),]
airq.ens1 <- premixed(Ozone ~ 1 | Month | Solar.R + Wind + Temp + Day, data = airq, ntrees = 10)
airq.ens1



## Example 2: Take into account clustered stucture in coefficient estimation
## only:
set.seed(42)
airq <- airquality[complete.cases(airquality),]
airq.ens2 <- premixed(Ozone ~ Solar.R + Wind + Temp + Day, cluster = "Month", data = airq, 
  ntrees = 10)
airq.ens2



## Example 3a: Take into account clustered structure in rule induction through 
## bootstrap- or subsampling:

## Create a sampling function that bootstrap samples whole clusters:
bb_sampfunc <- function(cluster = airq$Month) {
  result <- c()
  for(i in sample(unique(cluster), replace = TRUE)) {
    result <- c(result, which(cluster == i))
  }
  result
}
## Employ blocked bootstrap sampling function in fitting PRE:
library(pre)
set.seed(42)
airq.ens3a.bs <- pre(Ozone ~ Solar.R + Wind + Temp + Day, data = airq, sampfrac = bb_sampfunc)
airq.ens3a.bs

## Create a sampling function that subsamples ~75% of the clusters: 
ss_sampfunc <- function(cluster = airq$Month, sampfrac = .75) {
  result <- c()
  n_clusters <- round(length(unique(cluster)) * sampfrac)
  for(i in sample(unique(cluster), size = n_clusters, replace = FALSE)) {
    result <- c(result, which(cluster == i))
  }
  result
}
## Employ cluster subsampling in fitting PRE:
library(pre)
set.seed(42)
airq.ens3a.ss <- pre(Ozone ~ Solar.R + Wind + Temp + Day, data = airq, sampfrac = ss_sampfunc)
airq.ens3a.ss



## Example 3b: Take into account clustered structure in both rule induction and
## coefficient estimation:

## Generate fold ids:
airq <- airquality[complete.cases(airquality),]
foldids <- vector("numeric", length = nrow(airq))
counter <- 0
for (i in unique(airq$Month)) {
  counter <- counter + 1
  foldids[airq$Month == i] <- counter
}
foldids

## Employ clustered bootstrap sampling function for rule induction, as well as 
## cluster-specific fold ids for estimating coefficients:
set.seed(42)
airq.ens3b.ss <- pre(Ozone ~ Solar.R + Wind + Temp + Day, data = airq, sampfrac = ss_sampfunc, 
  foldid = foldids)
airq.ens3b.ss