data_generator: Functions for Simulating Data
In suzhesuzhe/GEM: Generated Effect Modifier

Description Usage Arguments Details Value Examples

When investigating the properties of GEM, the following three data generators are used in various simulations. They are designed to construct three specific types of data sets in the case of two treatment groups. See more detail in E Petkova, T Tarpey, Z Su, and RT Ogden. Generated effect modifiers (GEMs) in randomized clinical trials. Biostatistics, (First published online: July 27, 2016). doi: 10.1093/biostatistics/kxw035.

data_generator1(d, R2, v2, n, co, beta1, inter)

data_generator2(n, co, R2, bet, inter)

data_generator3(n, co, bet, inter)

`d`	A scalar indicating the effect size of the GEM when the data is generated under a GEM model
`R2`	A scalar indicating the proportion of explained variance R^2 for the entire data set
`v2`	A scalar indicating the proportion of explained variance R^2 for the first treatment group
`n`	A scalar indicating the number of observation in each treatment group, assumed to be the same.
`co`	A p by p positive semidefinite matrix indicating the covariance matrix of the covariates
`beta1`	A vector of length p giving the regression coefficients for the first treatment group
`inter`	A vector of length 2 recording the intercepts β_{10},β_{20} for the two treatment groups respectively
`bet`	A list with two elements, each a vector of length p, giving the regression coefficients for the two treatment groups respectively

data_generator1 is used to create data where the outcome is a linear function of the covariates

y_j = β_{j0} + Xβ_j + ε, j = 1, 2,

and the coffcicients of covariates β are proportional between two treatment groups: β_2 = b * β_1. This type of data set matches perfectly with the motivation of GEM algorithm. β_1 is set as an argument of the function while β_2 = b * β_1 is derived by controling R^2 of the whole data and the effect size. See more detail in Kraemer, H. C. (2013). Discovering, comparing, and combining moderators of treatment on outcome after randomized clinical trials: a parametric approach. Statistics in medicine, 32(11), 1964-1973.

data_generator2 is similar to the first one except that the coefficients of the covariates are not necessarily proportional. Hence two \bold{β}'s should be specified as arguments of the function.

data_generator3 constructs a data set where the outcome under each treatment condition is given for all subjects. In addition, no error is added to the mean outcome. This generator is useful for obtaining the "true" value of a treatment decision. This data generator is similar to data generator2

y_j = β_{j0} + Xβ_j, j = 1,2.

The output from these functions are different:

For the function data_generator1

dat A data frame with first and second column as treatment group index and outcome respectively, and each of the remaining columns as a covariate.
bet A list with two elements, each a vector of length p, giving the regression coefficients for the two treatment groups respectively
error_12 A vector of length three represeting the standard deviation of ε, the explained variance by the linear part for the first and second treatment group respectively.

For the function data_generator2

dat A data frame with first and second column as treatment group index and outcome respectively, and each of the remaining columns as a covariate.
bet list with two elements, each a vector of length p, giving the regression coefficients for the two treatment groups respectively
error A scalar represeting the standard deviation of ε

For the function data_generator3

y0 Outcome vector under the first treatment assignment
y1 Outcome vector under the second treatment assignment
X Design matrix for the covariates
oracle Average of the outcome if each subject takes the optimal treatment assignment
invOracle Average of the outcome if each subject does not take the optimal treatment assignment

#constructing the covariance matrix
co <- matrix(0.2, 30, 30)
diag(co) <- 1
dataEx <- data_generator1(d = 0.3, R2 = 0.5, v2 = 1, n = 3000, 
                           co = co, beta1 = rep(1,30),inter = c(0,0))
#check the R squared of the simluated data set
dat <- dataEx[[1]]
summary(lm(V2~factor(trt)*(V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14+V15+V16+
V17+V18+V19+V20+V21+V22+V23+V24+V25+V26+V27+V28+V29+V30+V31+V32),data=dat))

bigData <- data_generator3(n = 10000,co = co,bet =dataEx[[2]], inter = c(0,0))