A data set containing the binary outcome and 1028 predictor variables of 400 artificial AML patients.
A data frame with 400 rows and 1029 variables:
binary outcome representing refractory status.
4 binary variables representing variables with a known influence on the outcome.
5 continuous variables representing clinical variables.
19 binary variables representing mutations.
1000 continuous variables representing gene expression data.
We generated the data in the following way: We took the empirical correlation of 1028 variables related to
315 AML patients. This correlation served as a correlation matrix when generating 1028 multivariate
normally distributed variables with the R function
rmvnorm. Because we didn't have a positive
definite matrix, we took the nearest positive definite matrix according to the function
The variables that should be binary were dichotomized, so that their marginal probabilities corresponded to
the marginal probabilities they were based on.
The coefficients were defined by
beta_b1 <- c(0.8, 0.8, 0.6, 0.6)
beta_b2 <- c(rep(0.5,3), rep(0,2))
beta_b3 <- c(rep(0.4, 4), rep(0,15))
beta_b4 <- c(rep(0.5, 5), rep(0.3, 5), rep(0,990)).
We included them in the vector
beta <- c(beta_b1, beta_b2, beta_b3, beta_b4) and calculated
the probability through
pi = exp(β*x) / (1 + exp(β*x))
where x denotes our data matrix
with 1028 predictor variables. Finally we got the outcome through
pl_out <- rbinom(400, size = 1, p = pi).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.