gendat.glmm: Simulates datasets based on a Generalized Linear Mixed Model...
In emitanaka/rpql: Regularized PQL for Joint Selection in GLMMs

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/auxilaryfunctions.R

Datasets are simulated from a GLMM given a series of inputs including: model matrices X and Z for the fixed and random effects respectively, a set of true fixed effect coefficients beta, a list of true random effect covariance matrices D, the family of response, and some other nusiance parameters if appropriate.

 
gendat.glmm(id, X, beta, Z, D, trial.size = 1, family = gaussian(), 
  phi = NULL, shape = NULL, zeroprob = NULL, upper.count = Inf)

`id`	A list with each element being a vector of IDs that reference the model matrix in the corresponding element in the list `Z`. Each vector of IDs must be integers (but not factors).
`X`	A model matrix of corresponding to the fixed effects. A column of ones should be included if a fixed intercept is to be included in the model.
`beta`	A vector of true fixed effect parameters, with the same length as the number of columns in `X`.
`Z`	A list with each element being a model matrix for a set of random effects. Each element of `Z` is referenced by a vector of IDs given by the corresponding element in the list `id`. Each model matrix (element of `Z`) should have the same number of rows as the length of `y`.
`D`	A list with each element being a symmetric random effects covariance matrix which is used to generate random effects. These random effects are then applied to the corresponding element in the list `Z`, and are referenced by the corresponding element in the list `id`.
`trial.size`	The trial size if `family = binomial()`. Either takes a single non-zero value or a vector of non-zero values with length the same as the number of rows in `X`. The latter allows for differing trial sizes across responses. Defaults to 1.
`family`	The distribution for the responses in GLMM. The argument must be applied as a object of class "family". Currently supported arguments include: `gaussian()`, `poisson()`, `binomial()`, `Gamma()`, `nb2()` for negative binomial, `LOGNO()` for log-normal, and `ZIP()` for zero-inflated Poisson.
`phi`	A non-zero value for the true variance parameter σ^2 if `family = gaussian()`, the true variance parameter σ^2 on the log scale if `family = LOGNO()`, or the overdispersion parameter if `family = nb2()`, where the negative binomial variance is parameterized as V = μ + φμ^2. Defaults to NULL.
`shape`	A non-zero value for the shape parameter a if `family = Gamma()`, where the variance is parameterized as V = μ^2/a. Defaults to NULL.
`zeroprob`	A value between 0 and 1 for the probability of a structural zero if `family = ZIP()` for zero-inflated Poisson. Defaults to NULL.
`upper.count`	A non-zero integer which allows the user to control the maximum value of the counts generates for datasets when `family = poisson()` or `nb2()`. When the responses are simulated, a `while` loop is run to ensure that all responses generated are less than or equal to `upper.count`. Default to `Inf`.

The relationship between the mean of the responses and covariates in a GLMM is given as follows: For i = 1,…,n, where n is, equivalently, the number of rows in X, the length of each element in id, and the number of rows in each element of Z, we have

g(μ_{i}) = \bm{x}^T_i \bm{β} + \bm{z}^T_{i1} \bm{b}_{i1} + \bm{z}^T_{i2} \bm{b}_{i2} + …,

where g(\cdot) is the link function, μ_i is the mean of the distribution for observation i, \bm{x}_i is row i of the fixed effects model matrix X, and \bm{β} is the fixed effects coefficients. For the random effects, \bm{z}_{i1} is row i of the random effects model matrix in the first element of Z, while \bm{b}_{i1} is the vector of random effects generated for observation i based on the first element of D. The remaining parameters \bm{z}_{i2}, \bm{b}_{i2} and so on, are defined similarly.

Having lists for id, Z, and D allows for multiple sets of random effects to be included in the true GLMM. This is analogous to the lme4 package, where multiple random effects are permitted in the formula, e.g., (1|creek) + (1|creek:sample). If the true GLMM contains only one set of random effects, e.g., in longitudinal data, then the three lists will all contain only one element. Cases with multiple sets of random effects include nested and crossed designs, in which case id, Z, and D will have two or more elements.

It is recommended that the user think through and design these lists carefully to ensure that they are actually constructing a true GLMM that they want to simulated data from. Yes it takes some getting use too, and we apologize for this =( Please see examples below for some ideas.

Finally, note that some of the elements of beta can be zero, i.e. truly unimportant fixed effects. Likewise, each element of D can be a random effects covariance matrix containing zero rows and columns, i.e. truly unimportant random effects.

A list containing the following elements

`y`	The vector simulated responses.
`b`	A list with each element being a matrix of random effects simulated from a multivariate normal distribution with mean zero and covariance matrix equal to the corresponding element in the list `D`. For each element in `b`, the number of columns of the matrix equals the dimension of corresponding covariance matrix element in `D`, while the number of rows equals to the number of unique IDs in the corresponding element of the list `id`.
`id, X, Z, beta, D, phi, shape, zeroprob, trial.size, family`	Some of the arguments entered into `gendat.glmm`.
`nonzero.beta`	A vector indexing the non-zero values of `beta`, i.e. the truly important fixed effects.
`nonzero.b`	A list with each element being a vector indexing the non-zero diagonal variances in the corresponding element of the list `D`, i.e. the truly important random effects.

NA

Maintainer: NA

Schielzeth, H., & Nakagawa, S. (2013). Nested by design: model fitting and interpretation in a mixed model era. Methods in Ecology and Evolution, 4, 14-24.

rpql for fitting and performing model selection in GLMMs using regularized PQL.

##################
## Example 1: Linear Mixed Models 
## Independent cluster model with 50 clusters
## Nine covariates including a fixed and random intercept
library(mvtnorm)
library(lme4)

n <- 50; m <- 10; p <- 8; 
## Generate rows of a model matrix from a multivariate normal distribution with 
## AR1 covariance structure. 

H <- abs(outer(1:p, 1:p, "-")) 
X <- cbind(1,rmvnorm(n*m,rep(0,p),sigma=0.5^H)); 

Z <- X 
true_betas <- c(1,3,2,1.5,-1,0,0,0,0) ## 5 important fixed effects 
true_D <- matrix(0,p+1,p+1) ## 3 important random effects
true_D[1:3,1:3] <- matrix(c(9,4.8,0.6,4.8,4,1,0.6,1,1),3,3,byrow=TRUE)

simy <- gendat.glmm(id = list(cluster = rep(1:n,each=m)), X = X, beta = true_betas, 
	Z = list(cluster = Z), D = list(cluster = true_D), phi = 1, family = gaussian()) 
## Notice how id, Z, and D all are lists with one element, and that 
## the name of the first element (a generic name "cluster") is the 
## same for all three lists. 
## id is where the action takes place. In particular, id$cluster is 
## designed so that the first 12 elements correspond to cluster 1, 
## the second 12 elements correspond to cluster 2, and so forth. 
## In turn, the first 12 rows of X and Z$cluster correspond 
## to cluster 1, and so on. 

## Not run: 
dat <- data.frame(y = simy$y, simy$X, simy$Z$cluster, simy$id)
fit_satlme4 <- lmer(y ~ X - 1 + (Z - 1 | cluster), data = dat,
	REML = FALSE)
fit_sat <- build.start.fit(fit_satlme4, gamma = 2)


lambda_seq <- lseq(1e-4,1,length=100)
fit <- rpqlseq(y = simy$y, X = simy$X, Z = simy$Z, id = simy$id, 
	family = gaussian(), lambda = lambda_seq, pen.type = "adl", 
	pen.weights = fit_sat$pen.weights, start = fit_sat)

summary(fit$best.fit[[3]])  
# apply(fit$collect.ics, 2, which.min) ## Look at best fit chosen by different ICs

## End(Not run)


##################
## Example 2: Bernoulli GLMMs on simulated data
## Nested data with 200 observations in total: split into 10 creeks, 
## 5 samples nested within each creek

mn <- 200; 
X <- as.matrix(rep(1,mn)); 
ids <- list(samples = rep(1:50,each=4), creek = rep(1:10,each=20)) 
## We have two sets of random intercepts only, one for creek and one 
## 	for samples nested within creek.
Zs <- list(samples = X, creek = X) 

true_betas <- -0.1 
## Please ensure each element of true_D is a matrix
true_D <- list(samples = as.matrix(0.001), creek = as.matrix(1)) 

simy <- gendat.glmm(id = ids, X = X, beta = true_betas, Z = Zs, D = true_D, 
	trial.size = 1, family = binomial())

## Not run: 

## Construct a solution path use adaptive LASSO for selection
## Here is another way of constructing the adaptive weights:
## Use the fact that rpql can do a final fit based on maximum likelihood
## to obtain a good saturated fit.
fit_sat <- rpql(y = simy$y, X = simy$X, Z = simy$Z, id = simy$id, 
	family = binomial(), lambda = 0, hybrid = TRUE)
fit_sat <- build.start.fit(fit_sat$hybrid, gamma = 2)
	
lambda_seq <- lseq(1e-6,1,length=100)
fit <- rpqlseq(y = simy$y, X = simy$X, Z = simy$Z, id = simy$id, 
	family = binomial(), lambda = lambda_seq, pen.type = "adl", 
	pen.weights = fit_sat$pen.weights, start = fit_sat)

summary(fit$best.fit[[3]])  
# apply(fit$collect.ics, 2, which.min) ## Look at best fit chosen by different ICs

## End(Not run)