simulation.study: Run simulations allowing for between-sample heterogeneity

Description Usage Arguments Value

Description

simulation.study implements a simulation framework sampling repeatedly from linear regression models and GLMs, allowing for between-sample heterogeneity. The purpose is to allow the study of AIC and related statistics in the context of model selection, with prediction quality as target.

Usage

1
2
3
4
5
6
7
8
9
simulation.study(type = "lm", nsims = 1000, nsamples = c(20, 50, 100, 200,
  500, 1000, 2000, 5000, 10000), alpha = 0, beta.x = 1, nX = 10, nZ = 5,
  meanX = 0, meanZ = 0, XZCov = diag(nX + nZ), varmeanX = 0,
  varmeanZ = 0, simulate.from.data = FALSE, X = NULL, Y = NULL,
  var.res = 1, var.RE.Intercept = 0, var.RE.X = 0, rho = NULL,
  epsilon = NULL, corsim.var = NULL, noise.epsilon = NULL,
  step.k = qchisq(0.05, 1, lower.tail = FALSE), keep.dredge = FALSE,
  Xin.or.out = rep(TRUE, nX), glm.family = NULL, glm.offset = NULL,
  binomial.n = 1, filename = "results")

Arguments

type

Character string determining what type of model to fit. At present, available model types are "lm" and "glm", with the former the default.

nsims

Number of simulated data sets to analyse for each sample size

nsamples

Vector of integers containing the sample sizes

alpha

Intercept for simulation model

beta.x

Either: vector of slopes for the X covariates; or a single numeric values for a constant slope for all X's

nX

Number of "real" covariates

nZ

Number of "spurious" covariates

meanX

Either: vector of means for the X covariates; or a single numeric value for a constant mean across all X's

meanZ

As for meanX but for the Z covariates

XZCov

Covariance matrix of the X's and Z's. Must be of dimension (nX+nZ) by (nX+nZ). Ignored if simulate.from.data==TRUE or if is.null(rho)==FALSE

varmeanX

Either: vector of variances for the means of the X covariates; or a single numeric value for a constant mean across all X's. Non-zero values will produce a different set of covariate means for each individual simulated data set

varmeanZ

As for varmeanX but for the Z covariates

simulate.from.data

Logical. If TRUE, function takes actual covariate data to use as the basis of simulations; if FALSE (the default) the function uses the distributions defined by the model parameters given as input to the function

X

Matrix of "real" covariates; only used if simulate.from.data==TRUE

Y

Vector of "real" response variables; only used if simulate.from.data==TRUE, and if given the values of alpha and beta above will be ignored, but instead derived from a regression model of Y against X

var.res

Residual variance of the simulation model

var.RE.Intercept

Random effect variance for the intercept

var.RE.X

Either: vector of random effect variances for the X covariate slopes; or a single numeric value for no random slopes in the X's

rho

A numeric constant specifying the mean correlation between the X's and the Z's

epsilon

A numeric constant specifying the level of variability around the mean correlation rho; note that a necessary condition is max(abs(rho+epsilon))<=1, so combinations of rho and epsilon which break this constraint will cause and error (as the simcor function would produce correlations outside the range [-1,1]). If not supplied but is.null(rho)==TRUE, then epsilon is set to zero

corsim.var

If generating the covariance matrices using rho and epsilon, we to specify the variances (which otherwise are in the leading diagonal of XZCov)

noise.epsilon

A numeric constant used to specify whether XZCov is to vary from sample to sample. Higher values indicate more variability; note that this cannot be greater than 1 minus the largest absolute value of (off-diagonal) correlations in the corresponding correlation matrix

step.k

Numeric value of the AIC criterion in the stepwise analysis; defaults to about 3.84, corresponding to a p-value of 0.05 for both adding and removing variables

keep.dredge

Logical constant on whether to keep the dredge outputs (TRUE==yes); required if simulate.from.data==TRUE

Xin.or.out

Vector of length nX (or nrow(X)) of logicals, specifying whether an X is made available as data (TRUE for yes; FALSE for no)

glm.family

If a GLM is to be fitted, the error distribution must be supplied (to the standard family argument to glm).

glm.offset

An (optional) offset can be supplied if fitting a GLM. (Not currently implemented.)

binomial.n

If fitting a binomial GLM, the number of trials per sample. Must be either a scalar (in which case the same number of trials are used for each sample) or a vector of length nsamples. (Default is 1)

filename

Character string providing the root for the output files. Intermediate files are saved as "filenameX.RData" where X is an incremental count from 1 to length(nsamples). The final output is in "filename.RData".

Value

If keep.dredge==FALSE (the default), the output is a list of length equal to the length of nsamples, each containing two matrices, reg.bias (prediction bias for each sample) and reg.rmse (root mean square error of prediction for each sample). Each of these two matrices has length nsims and four columns, corresponding to model selection by AICc, AIC, BIC and stepwise regression. If keep.dredge==TRUE, then the output is a list of lists, with a top level list with length equal to the length of nsamples as before, and with the next level having length equal to nsims; this inner list contains the full model set output from dredge, converted to a matrix for storage efficiency.


MarkJBrewer/ICsims documentation built on May 7, 2019, 3:34 p.m.