Simulation function for case-control study designs.

Share:

Description

Monte Carlo based evaluation of operating characteristics of the maximum likelihood estimator (MLE) for the coefficients of a logistic regression model, based on the case-control.

Usage

1
2
3
4
ccSim(B=1000, betaTruth, X, N,  expandX="all", etaTerms=NULL,
      nCC, r, refDesign=1, alpha=0.05,
      threshold=c(-Inf, Inf), digits=1, betaNames=NULL,
      monitor=NULL, returnRaw=FALSE)

Arguments

B

The number of datasets generated by the simulation.

betaTruth

Regression coefficients from the logistic regression model.

X

Design matrix for the logistic regression model. The first column should correspond to intercept. For each exposure, the baseline group should be coded as 0, the first level as 1, and so on.

N

A numeric vector providing the sample size for each row of the design matrix, X.

expandX

Character vector indicating which columns of X to expand as a series of dummy variables. Useful when at least one exposure is continuous (and should not be expanded). Default is ‘all’; other options include ‘none’ or character vector of column names. See Details, below.

etaTerms

Character vector indicating which columns of X are to be included in the model. See Details, below.

nCC

A numeric value indicating the total case-control sample size.

r

A numeric value indicating the control:case ratio in the case-control sample. If a vector is provided, separate simulations are run for each value.

refDesign

A numeric value indicating the control:case ratio for the referent design (for the relative uncertainty calculation).

alpha

Type I error rate assumed for the evaluation of coverage probabilities and power.

threshold

An interval that specifies truncation of the Monte Carlo sampling distribution of the MLE.

digits

Integer indicating the precision to be used for the output.

betaNames

An optional character vector of names for the regression coefficients, betaTruth.

monitor

Numeric value indicating how often ccSim() reports real-time progress on the simulation, as the B datasets are generated and evaluated. The default of NULL indicates no output.

returnRaw

Logical indicator of whether or not the raw coefficient and standard error estimates for each of the design/estimator combinations should be returned.

Details

A simulation study is performed to evaluate the operating characteristics of the MLE for betaTruth from a case-control design (Prentice and Pyke, 1979). The operating characteristics are evaluated using the Monte Carlo sampling distribution of the estimator. The latter is generated using the following steps:

  • (i) Specify the (joint) marginal exposure distribution of underlying population, using X and N.

  • (ii) Simulate outcomes for all sum(N) individuals in the population, based on an underlying logistic regression model specified via betaTruth.

  • (iii) Sample n0 controls and n1 cases, on the basis of nCC and r.

  • (iv) Evaluate the MLE estimator, its estimated standard error and store the results.

  • (v) Repeat steps (ii)-(iv) B times.

All case-control MLEs are evaluated using the generic glm function.

The correspondence between betaTruth and X, specifically the ordering of elements, is based on successive use of factor to each column of X which is expanded via the expandX argument. Each exposure that is expanded must conform to a 0, 1, 2, ... integer-based coding convention.

The etaTerms argument is useful when only certain columns in X are to be included in the model.

When evaluating operating characteristics of the MLE, some simulated datasets may result in unusually large or small estimates. Particularly, when the the case-control sample size, nCC, is small. In some settings, it may be desirable to truncate the Monte Carlo sampling distribution prior to evaluating operating characteristics. The threshold argument indicates the interval beyond which MLEs are ignored. The default is such that all B datasets are kept.

Value

ccSim() returns an object of class "ccSim", a list containing all the input arguments, as well list results with the following components:

betaMean

Mean of the Monte Carlo sampling distribution for each regression coefficient estimator.

betaMeanBias

Bias based on the mean, calculated as betaMean - betaTruth.

betaMeanPB

Percent bias based on mean, calculated as ((betaMean - betaTruth) / betaTruth) x 100. If a regression coefficient is zero, percent bias is not calculated and an NA is returned.

betaMedian

Median of the Monte Carlo sampling distribution for each regression coefficient estimator.

betaMedianBias

Bias based on the median, calculated as betaMedian - betaTruth.

betaMedianPB

Percent bias based on median, calculated as ((betaMedian - betaTruth) / betaTruth) x 100. If a regression coefficient is zero, median percent bias is not calculated and an NA is returned.

betaSD

Standard deviation of the Monte Carlo sampling distribution for each regression coefficient estimator.

betaMSE

Mean squared error of the Monte Carlo sampling distribution for each regression coefficient estimator.

seMean

Mean of the Monte Carlo sampling distribution for the standard error estimates reported by glm().

seRatio

Ratio of the mean reported standard error to the standard deviation of the Monte Carlo sampling distribution for each regression coefficient estimator. The ratio is multiplied by 100.

betaCP

Coverage probability for Wald-based confidence intervals, evaluated on the basis of an alpha type I error rate.

betaPower

Power against the null hypothesis that the regression coefficient is zero for a Wald-based test with an alpha type I error rate.

betaRU

The ratio of the standard deviation of the Monte Carlo sampling distribution for each estimator to the standard deviation of the Monte Carlo sampling distribution for the estimator corresponding to refDesign. The ratio is multiplied by 100.

Also returned is an object failed which is a vector consisting of the number of datasets excluded from the power calculations (i.e. set to NA), for each simulation performed. For the evaluation of general operating characteristics, the three reasons are: (1) lack of convergence indicated by NA point estimates returned by glm, (2) lack of convergence indicated by NA standard error point estimates returned by glm, (3) exclusion on the basis of the threshold argument.

Note

A generic print method provides formatted output of the results.

Author(s)

Sebastien Haneuse, Takumi Saegusa

References

Prentice, R. and Pyke, R. (1979) "Logistic disease incidence models and case-control studies." Biometrika 66:403-411.

Haneuse, S. and Saegusa, T. and Lumley, T. (2011) "osDesign: An R Package for the Analysis, Evaluation, and Design of Two-Phase and Case-Control Studies." Journal of Statistical Software, 43(11), 1-29.

See Also

plotPower.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
##
data(Ohio)

## 
XM   <- cbind(Int=1, Ohio[,1:3])
fitM <- glm(cbind(Death, N-Death) ~ factor(Age) + Sex + Race, data=Ohio,
            family=binomial)
betaNamesM <- c("Int", "Age1", "Age2", "Sex", "Race")

## Single case-control design
##
## Not run: 
ccResults1 <- ccSim(B=1000, betaTruth=fitM$coef, X=XM, N=Ohio$N,
                    nCC=500, r=1, betaNames=betaNamesM, monitor=100)
ccResults1
## End(Not run)

## Examining unbalanced case-control designs
##
## Not run: 
ccResults2 <- ccSim(B=1000, betaTruth=fitM$coef, X=XM, N=Ohio$N,
                    nCC=500, r=c(0.25, 0.33, 0.5, 1, 2, 3, 4),
                    betaNames=betaNamesM, monitor=100)
ccResults2
## End(Not run)