Monte Carlo based evaluation of operating characteristics for estimators of the components of a logistic regression model, based on the twophase and casecontrol study designs (Breslow and Chatterjee, 1999; Prentice and Pykle, 1979).
1 2 3 4 
B 
The number of datasets generated by the simulation. 
betaTruth 
Regression coefficients from the logistic regression model. 
X 
Design matrix for the logistic regression model. The first column should correspond to intercept. For each exposure, the baseline group should be coded as 0, the first level as 1, and so on. 
N 
A numeric vector providing the sample size for each row of the design matrix, 
strata 
A list of numeric vectors indicating which columns of the design matrix, 
expandX 
Character vector indicating which columns of 
etaTerms 
Character vector indicating which columns of 
nII0 
A numeric vector of sample sizes at phase II for controls. The length must correspond to the number of unique values for the 
nII1 
A numeric vector of sample sizes at phase II for cases. The length must correspond to the number of unique values for the 
nII 
A pair of numbers providing the sample sizes for controls and cases at phase II. This is only used when simulating all stratified twophase sampling schemes (i.e., 
nCC 
A pair of sample sizes at phase II for controls and cases in a casecontrol design. If left 
alpha 
Type I error rate assumed for the evaluation of coverage probabilities and power. 
threshold 
An interval that specifies truncation of the Monte Carlo sampling distribution of each estimator. 
digits 
Integer indicating the precision to be used for the output. 
betaNames 
An optional character vector of names for the regression coefficients,

referent 
An numeric value specifying which estimator is taken as the referent (denominator) for the relative uncertainty calculation. 1=CD, 2=CC, 3=WL, 4=PL, 5=ML (see Details below). 
monitor 
Numeric value indicating how often 
cohort 
Logical flag. TRUE indicates phase I is drawn as a cohort; FALSE indicates phase I is drawn as a casecontrol sample. 
NI 
A pair of integers providing the outcomespecific phase I sample sizes when the phase I data are drawn as a casecontrol sample. The first element corresponds to the controls and the second to the cases. 
returnRaw 
Logical indicator of whether or not the raw coefficient and standard error estimates for each of the design/estimator combinations should be returned. 
A simulation study is performed to evaluate the operating
characteristics of various designs/estimators for betaTruth
:
(a) complete data maximum likelihood (CD)
(b) casecontrol maximum likelihood (CC)
(c) twophase weighted likelihood (WL)
(d) twophase pseudo or profile likelihood (PL)
(e) twophase maximum likelihood (ML)
The operating characteristics are evaluated using the Monte Carlo sampling distribution of each estimator. The latter is generated using the following steps:
(i) Specify the (joint) marginal exposure distribution of underlying population, using X
and N
.
(ii) Simulate outcomes for all sum(N
) individuals in the population, based on an underlying logistic regression model specified via betaTruth
.
(iii) Evaluate the CD estimator on the basis of the complete data.
(iv) Sample either (a) ccDesign
controls and cases or (b) sum(n0
) controls and sum(n1
) cases, (without regard to the strata
variable) and evaluate the CC estimator.
(v) Stratify the population according to outcome and the strata
argument, to form the phase I data.
(vi) Sample n0
controls and n1
cases from their respective phase I strata.
(vii) Evaluate the WL, PL and ML estimators.
(viii) Repeat steps (ii)(vii) B
times.
Both the CD and CC estimators are evaluated using the generic glm
function. The three twophase estimators are based on the tps
function.
The correspondence between betaTruth
and X
, specifically the ordering of elements, is based on successive use of factor
to each column of X
which is expanded via the expandX
argument. Each exposure that is expanded must conform to a 0, 1, 2, ... integerbased coding convention.
The etaTerms
argument is useful when only certain columns in X
are to be included in the model. In the context of the twophase design, this might be the case if phase I stratifies on some surrogate exposure and a more detailed/accurate measure is to be included in the main model.
When evaluating operating characteristics, some simulated datasets may result in unusually large or small estimates. Particularly, when the the casecontrol/phase II sample sizes are small. In some settings, it may be desirable to truncate the Monte Carlo sampling distribution prior to evaluating operating characteristics. The threshold
argument indicates the interval beyond which point estimates are ignored. The default is such that all B
datasets are kept.
NOTE: In some settings, the current implementation of the ML estimator returns point estimates that do not satisfy the phase I and/or phase II constraints. If this is the case a warning is printed and the "fail" elements of the returned list is set to TRUE. An example of this is phenomenon is given the help file for tps
. When this occurs, tpsSim()
considers ML estimation for the particular dataset to have failed.
tpsSim()
returns an object of class "tpsSim", a list containing all the input arguments, as well list results
with the following components:
betaMean 
Mean of the Monte Carlo sampling distribution for each regression coefficient estimator. 
betaMeanBias 
Bias based on the mean, calculated as 
betaMeanPB 
Percent bias based on mean, calculated as (( 
betaMedian 
Median of the Monte Carlo sampling distribution for each regression coefficient estimator. 
betaMedianBias 
Bias based on the median, calculated as 
betaMedianPB 
Percent bias based on median, calculated as (( 
betaSD 
Standard deviation of the Monte Carlo sampling distribution for each regression coefficient estimator. 
betaMSE 
Mean squared error of the Monte Carlo sampling distribution for each regression coefficient estimator. 
seMean 
Mean of the Monte Carlo sampling distribution for the standard error estimates reported by glm(). 
seRatio 
Ratio of the mean reported standard error to the standard deviation of the Monte Carlo sampling distribution for each regression coefficient estimator. The ratio is multiplied by 100. 
betaCP 
Coverage probability for Waldbased confidence intervals, evaluated on the basis of an 
betaPower 
Power against the null hypothesis that the regression coefficient is zero for a Waldbased test with an 
betaRU 
The ratio of the standard deviation of the Monte Carlo sampling
distribution for each estimator to the standard deviation of the Monte Carlo sampling distribution
for the estimator corresponding to 
Also returned is an object failed
which is a vector consisting of the number of datasets excluded from the power calculations (i.e. set to NA
), for each simulation performed. For the evaluation of general operating characteristics, the four reasons are: (1) lack of convergence indicated by NA
point estimates returned by glm
or tps
; (2) lack of convergence indicated by NA
standard error point estimates returned by glm
or tps
; (3) exclusion on the basis of the threshold
argument; and (4) for the ML estimator only, the phase I and/or phase II constraints are not satisfied.
A generic print method provides formatted output of the results.
Sebastien Haneuse, Takumi Saegusa
Prentice, R. and Pyke, R. (1979) "Logistic disease incidence models and casecontrol studies." Biometrika 66:403411.
Breslow, N. and Chatterjee, N. (1999) "Design and analysis of two phase studies with binary outcome applied to Wilms tumour prognosis." Applied Statistics 48:457468.
Haneuse, S. and Saegusa, T. and Lumley, T. (2011) "osDesign: An R Package for the Analysis, Evaluation, and Design of TwoPhase and CaseControl Studies." Journal of Statistical Software, 43(11), 129.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98  ##
data(Ohio)
## Design matrix that forms the basis for model and
## phase I strata specification
##
XM < cbind(Int=1, Ohio[,1:3]) ## main effects only
XI < cbind(XM, SbyR=XM[,3]*XM[,4]) ## interaction between sex and race
## 'True' values for the underlying logistic model
##
fitM < glm(cbind(Death, NDeath) ~ factor(Age) + Sex + Race, data=Ohio,
family=binomial)
fitI < glm(cbind(Death, NDeath) ~ factor(Age) + Sex * Race, data=Ohio,
family=binomial)
##
betaNamesM < c("Int", "Age1", "Age2", "Sex", "Race")
betaNamesI < c("Int", "Age1", "Age2", "Sex", "Race", "SexRace")
## Twophase design stratified by age
## * sample 50 from each of 6 phase I strata
## * show primary output (% bias, 95% CP, relative uncertainty)
##
## Not run:
ocAge < tpsSim(B=1000, betaTruth=fitM$coef, X=XM, N=Ohio$N, strata=2,
nII0=c(50,50,50), nII1=c(50,50,50), betaNames=betaNamesM,
monitor=100)
ocAge
## End(Not run)
## All possible balanced twophase designs
## * 250 controls and 250 cases
## * only show the relative uncertainty output
##
## Not run:
ocAll < tpsSim(B=1000, betaTruth=fitM$coef, X=XM, N=Ohio$N, strata=0,
nII=c(250, 250), betaNames=betaNamesM, monitor=100)
ocAll
## End(Not run)
## Twophase design stratified by race
## * balanced solely on outcome
## * only show the relative uncertainty output
##
## Not run:
ocRace < tpsSim(B=1000, betaTruth=fitI$coef, X=XI, N=Ohio$N, strata=4,
nII0=c(200, 50), nII1=c(200, 50), betaNames=betaNamesI,
monitor=100)
ocRace
## End(Not run)
## Comparison of two casecontrol designs
## * 240 controls and 260 cases
## * 240 controls and 260 cases
## * only show the relative uncertainty output
##
## Not run:
ocCC < tpsSim(B=1000, betaTruth=fitM$coef, X=XM, N=Ohio$N, strata=1,
nII0=240, nII1= 260, ccDesign=c(200,300),
betaNames=betaNamesM, monitor=100)
ocCC
## End(Not run)
## Illustration of setting where one of the covariates is continuous
## * restrict to black and white children born in 2003
## * dichotomize smoking, mothers age, weight gain during pregnancy and weight weight
## * note the use of 'etaTerms' to restrict to specific variables
## (the majority of which are created)
## * note the use of 'strata=list(11,12)' to simultaneously investigate stratification by
##  11th column in XM: derived 'smoker' variable
##  12th column in XM: derived 'teen' variable
##
## Warning: takes a long time!
## Not run:
data(infants)
##
infants < infants[infants$year == 2003,]
##
infants$race[!is.element(infants$race, c(1,2))] < NA ## White/Black = 0/1
infants$race < infants$race  1
infants < na.omit(infants)
##
infants$smoker < as.numeric(infants$cignum > 0)
infants$teen < as.numeric(infants$mage < 20)
infants$lowgain < as.numeric(infants$gained < 20)
infants$lbw < as.numeric(infants$weight < 2500)
infants$weeks < (infants$weeks  36) / 4 ## estimate a 4week contrast
##
fitM < glm(death ~ smoker + teen + race + male + lowgain + lbw + weeks,
data=infants, family=binomial)
betaM < fitM$coef
XM < cbind(Int=1, infants)
etaM < c("Int", "smoker", "teen", "race", "male", "lowgain", "lbw", "weeks")
##
tpsSim(B=1000, betaTruth=fitM$coef, X=XM, N=rep(1, nrow(XM)), strata=list(11,12),
expand="none", etaTerms=etaM, nII=c(1000,1000),
threshold=c(20,20), monitor=100)
## End(Not run)

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.
Please suggest features or report bugs with the GitHub issue tracker.
All documentation is copyright its authors; we didn't write any of that.