tpsSim: Simulation function for two-phase study designs.
In osDesign: Design, Planning and Analysis of Observational Studies

Description Usage Arguments Details Value Note Author(s) References Examples

Monte Carlo based evaluation of operating characteristics for estimators of the components of a logistic regression model, based on the two-phase and case-control study designs (Breslow and Chatterjee, 1999; Prentice and Pykle, 1979).

tpsSim(B=1000, betaTruth, X, N, strata, expandX="all", etaTerms=NULL,
       nII0=NULL, nII1=NULL, nII=NULL, nCC=NULL,
       alpha=0.05, threshold=c(-Inf, Inf), digits=1, betaNames=NULL,
       referent=2, monitor=NULL, cohort=TRUE, NI=NULL, returnRaw=FALSE)

`B`	The number of datasets generated by the simulation.
`betaTruth`	Regression coefficients from the logistic regression model.
`X`	Design matrix for the logistic regression model. The first column should correspond to intercept. For each exposure, the baseline group should be coded as 0, the first level as 1, and so on.
`N`	A numeric vector providing the sample size for each row of the design matrix, `X`.
`strata`	A list of numeric vectors indicating which columns of the design matrix, `X`, are used to form the phase I stratification schemes. `strata=1` specifies the intercept and is, therefore, equivalent to a case-control study. `strata=0` indicates all possible stratified two-phase sampling schemes. `strata=list(2,3)` indicates 2 two-phase designs (that stratify on the 2nd and 3rd columns, separately) are to be considered.
`expandX`	Character vector indicating which columns of `X` to expand as a series of dummy variables. Useful when at least one exposure is continuous (and should not be expanded). Default is ‘all’; other options include ‘none’ or character vector of column names. See Details, below.
`etaTerms`	Character vector indicating which columns of `X` are to be included in the model. See Details, below.
`nII0`	A numeric vector of sample sizes at phase II for controls. The length must correspond to the number of unique values for the `strata` specification.
`nII1`	A numeric vector of sample sizes at phase II for cases. The length must correspond to the number of unique values for the `strata` specification.
`nII`	A pair of numbers providing the sample sizes for controls and cases at phase II. This is only used when simulating all stratified two-phase sampling schemes (i.e., `strata=0`).
`nCC`	A pair of sample sizes at phase II for controls and cases in a case-control design. If left `NULL`, the values case-control sample sizes are taken as the sums of `n1` and `n0`, respectively.
`alpha`	Type I error rate assumed for the evaluation of coverage probabilities and power.
`threshold`	An interval that specifies truncation of the Monte Carlo sampling distribution of each estimator.
`digits`	Integer indicating the precision to be used for the output.
`betaNames`	An optional character vector of names for the regression coefficients, `betaTruth`.
`referent`	An numeric value specifying which estimator is taken as the referent (denominator) for the relative uncertainty calculation. 1=CD, 2=CC, 3=WL, 4=PL, 5=ML (see Details below).
`monitor`	Numeric value indicating how often `tpsSim()` reports real-time progress on the simulation, as the `B` datasets are generated and evaluated. The default of `NULL` indicates no output.
`cohort`	Logical flag. TRUE indicates phase I is drawn as a cohort; FALSE indicates phase I is drawn as a case-control sample.
`NI`	A pair of integers providing the outcome-specific phase I sample sizes when the phase I data are drawn as a case-control sample. The first element corresponds to the controls and the second to the cases.
`returnRaw`	Logical indicator of whether or not the raw coefficient and standard error estimates for each of the design/estimator combinations should be returned.

A simulation study is performed to evaluate the operating characteristics of various designs/estimators for betaTruth:

(a) complete data maximum likelihood (CD)
(b) case-control maximum likelihood (CC)
(c) two-phase weighted likelihood (WL)
(d) two-phase pseudo- or profile likelihood (PL)
(e) two-phase maximum likelihood (ML)

The operating characteristics are evaluated using the Monte Carlo sampling distribution of each estimator. The latter is generated using the following steps:

(i) Specify the (joint) marginal exposure distribution of underlying population, using X and N.
(ii) Simulate outcomes for all sum(N) individuals in the population, based on an underlying logistic regression model specified via betaTruth.
(iii) Evaluate the CD estimator on the basis of the complete data.
(iv) Sample either (a) ccDesign controls and cases or (b) sum(n0) controls and sum(n1) cases, (without regard to the strata variable) and evaluate the CC estimator.
(v) Stratify the population according to outcome and the strata argument, to form the phase I data.
(vi) Sample n0 controls and n1 cases from their respective phase I strata.
(vii) Evaluate the WL, PL and ML estimators.
(viii) Repeat steps (ii)-(vii) B times.

Both the CD and CC estimators are evaluated using the generic glm function. The three two-phase estimators are based on the tps function.

The correspondence between betaTruth and X, specifically the ordering of elements, is based on successive use of factor to each column of X which is expanded via the expandX argument. Each exposure that is expanded must conform to a 0, 1, 2, ... integer-based coding convention.

The etaTerms argument is useful when only certain columns in X are to be included in the model. In the context of the two-phase design, this might be the case if phase I stratifies on some surrogate exposure and a more detailed/accurate measure is to be included in the main model.

When evaluating operating characteristics, some simulated datasets may result in unusually large or small estimates. Particularly, when the the case-control/phase II sample sizes are small. In some settings, it may be desirable to truncate the Monte Carlo sampling distribution prior to evaluating operating characteristics. The threshold argument indicates the interval beyond which point estimates are ignored. The default is such that all B datasets are kept.

NOTE: In some settings, the current implementation of the ML estimator returns point estimates that do not satisfy the phase I and/or phase II constraints. If this is the case a warning is printed and the 'fail' elements of the returned list is set to TRUE. An example of this is phenomenon is given the help file for tps. When this occurs, tpsSim() considers ML estimation for the particular dataset to have failed.

tpsSim() returns an object of class 'tpsSim', a list containing all the input arguments, as well list results with the following components:

`betaMean`	Mean of the Monte Carlo sampling distribution for each regression coefficient estimator.
`betaMeanBias`	Bias based on the mean, calculated as `betaMean` - `betaTruth`.
`betaMeanPB`	Percent bias based on mean, calculated as ((`betaMean` - `betaTruth`) / `betaTruth`) x 100. If a regression coefficient is zero, percent bias is not calculated and an NA is returned.
`betaMedian`	Median of the Monte Carlo sampling distribution for each regression coefficient estimator.
`betaMedianBias`	Bias based on the median, calculated as `betaMedian` - `betaTruth`.
`betaMedianPB`	Percent bias based on median, calculated as ((`betaMedian` - `betaTruth`) / `betaTruth`) x 100. If a regression coefficient is zero, median percent bias is not calculated and an NA is returned.
`betaSD`	Standard deviation of the Monte Carlo sampling distribution for each regression coefficient estimator.
`betaMSE`	Mean squared error of the Monte Carlo sampling distribution for each regression coefficient estimator.
`seMean`	Mean of the Monte Carlo sampling distribution for the standard error estimates reported by glm().
`seRatio`	Ratio of the mean reported standard error to the standard deviation of the Monte Carlo sampling distribution for each regression coefficient estimator. The ratio is multiplied by 100.
`betaCP`	Coverage probability for Wald-based confidence intervals, evaluated on the basis of an `alpha` type I error rate.
`betaPower`	Power against the null hypothesis that the regression coefficient is zero for a Wald-based test with an `alpha` type I error rate.
`betaRU`	The ratio of the standard deviation of the Monte Carlo sampling distribution for each estimator to the standard deviation of the Monte Carlo sampling distribution for the estimator corresponding to `refDesign`. The ratio is multiplied by 100.

Also returned is an object failed which is a vector consisting of the number of datasets excluded from the power calculations (i.e. set to NA), for each simulation performed. For the evaluation of general operating characteristics, the four reasons are: (1) lack of convergence indicated by NA point estimates returned by glm or tps; (2) lack of convergence indicated by NA standard error point estimates returned by glm or tps; (3) exclusion on the basis of the threshold argument; and (4) for the ML estimator only, the phase I and/or phase II constraints are not satisfied.

A generic print method provides formatted output of the results.

Sebastien Haneuse, Takumi Saegusa

Prentice, R. and Pyke, R. (1979) "Logistic disease incidence models and case-control studies." Biometrika 66:403-411.

Breslow, N. and Chatterjee, N. (1999) "Design and analysis of two phase studies with binary outcome applied to Wilms tumour prognosis." Applied Statistics 48:457-468.

Haneuse, S. and Saegusa, T. and Lumley, T. (2011) "osDesign: An R Package for the Analysis, Evaluation, and Design of Two-Phase and Case-Control Studies." Journal of Statistical Software, 43(11), 1-29.

##
data(Ohio)

## Design matrix that forms the basis for model and 
## phase I strata specification
##
XM <- cbind(Int=1, Ohio[,1:3])      ## main effects only
XI <- cbind(XM, SbyR=XM[,3]*XM[,4]) ## interaction between sex and race

## 'True' values for the underlying logistic model
##
fitM <- glm(cbind(Death, N-Death) ~ factor(Age) + Sex + Race, data=Ohio,
            family=binomial)
fitI <- glm(cbind(Death, N-Death) ~ factor(Age) + Sex * Race, data=Ohio,
            family=binomial)
##
betaNamesM <- c("Int", "Age1", "Age2", "Sex", "Race")
betaNamesI <- c("Int", "Age1", "Age2", "Sex", "Race", "SexRace")

## Two-phase design stratified by age
##   * sample 50 from each of 6 phase I strata
##   * show primary output (% bias, 95% CP, relative uncertainty)
##

ocAge <- tpsSim(B=1000, betaTruth=fitM$coef, X=XM, N=Ohio$N, strata=2,
                nII0=c(50,50,50), nII1=c(50,50,50), betaNames=betaNamesM,
                monitor=100)
ocAge

## All possible balanced two-phase designs
##  * 250 controls and 250 cases
##  * only show the relative uncertainty output
##

ocAll <- tpsSim(B=1000, betaTruth=fitM$coef, X=XM, N=Ohio$N, strata=0,
                nII=c(250, 250), betaNames=betaNamesM,
                monitor=100)
ocAll

## Two-phase design stratified by race
##  * balanced solely on outcome
##  * only show the relative uncertainty output
##

ocRace <- tpsSim(B=1000, betaTruth=fitI$coef, X=XI, N=Ohio$N, strata=4,
                 nII0=c(200, 50), nII1=c(200, 50), betaNames=betaNamesI,
                 monitor=100)
ocRace

## Comparison of two case-control designs
##  * 240 controls and 260 cases
##  * 240 controls and 260 cases
##  * only show the relative uncertainty output
##

ocCC <- tpsSim(B=1000, betaTruth=fitM$coef, X=XM, N=Ohio$N, strata=1,
               nII0=240, nII1= 260, nCC=c(200,300),
               betaNames=betaNamesM,
               monitor=100)
ocCC

## Illustration of setting where one of the covariates is continuous
##  * restrict to black and white children born in 2003
##  * dichotomize smoking, mothers age, weight gain during pregnancy and weight weight
##  * note the use of 'etaTerms' to restrict to specific variables (the majority of which
##    are created)
##  * note the use of 'strata=list(11,12)' to simultaneously investigate stratification by
##     - 11th column in XM: derived 'smoker' variable
##     - 12th column in XM: derived 'teen' variable
##
## Warning: takes a long time!
##

data(infants)
##
infants <- infants[infants$year == 2003,]
##
infants$race[!is.element(infants$race, c(1,2))] <- NA  ## White/Black = 0/1
infants$race <- infants$race - 1
infants <- na.omit(infants)
##
infants$smoker  <- as.numeric(infants$cignum > 0)
infants$teen    <- as.numeric(infants$mage < 20)
infants$lowgain <- as.numeric(infants$gained < 20)
infants$lbw     <- as.numeric(infants$weight < 2500)
infants$weeks   <- (infants$weeks - 36) / 4       ## estimate a 4-week contrast
##
fitM  <- glm(death ~ smoker + teen + race + male+ lowgain + lbw + weeks,
             data=infants,
             family=binomial)
betaM <- fitM$coef
XM    <- cbind(Int=1, infants)
etaM  <- c("Int", "smoker", "teen", "race", "male", "lowgain", "lbw", "weeks")
##
tpsSim(B=1000, betaTruth=fitM$coef, X=XM, N=rep(1, nrow(XM)), strata=list(11,12),
       expand="none", etaTerms=etaM, nII=c(1000,1000),
       threshold=c(-20,20),
       monitor=100)