createData: Simulate Data
In ptycho: Bayesian Variable Selection with Hierarchical Priors

View source: R/gendata.R

createData

R Documentation

Simulate Data

Description

Create a data object suitable for ptycho.all.

Usage

createData(X, y, omega = NULL, beta = NULL)
createDataBayesModel(mode = c("exchange","pleiotropy","gene"), n, p, q,
                     nreps, tau.min, tau.max, G)
createPubData(mode = c("tinysim","ptychoIn",
                       "exchange","pleiotropy","gene",
                       "actualGeno","actualPheno","corTest",
                       "fixedOmega","uniformEffects"),
              X=NULL, y=NULL, var.detail=NULL, variants=NULL)

Arguments

`X`	Design matrix or alist specifying how to generate such a matrix. If a list, the first entry is a function name and the second is a list of arguments to the function. In `createPubData`, `X` is ignored unless `mode` is “actualGeno”, “actualPheno”, or `corTest`.
`y`	Numeric vector or matrix or list with the following components: `nreps` Number of replicates to simulate `q` Number of responses to generate for each replicate `sd` Standard deviation of the simulated noise In `createPubData`, `y` is ignored unless `mode` is “actualPheno”.
`omega`	Numeric vector or matrix or list specifying how to generate a list with the component `omega`; see Details for its meaning. If this is a list, the first entry is a function name and the second is a list of arguments to the function, which will be prepended by the number of rows in output `X` and the number of columns in output `y`. Only used if `y` is a list.
`beta`	List specifying how to generate a matrix of effect sizes. The first entry of the list is a function name and the second is a list of arguments to the function, which will be prepended by a matrix specifying the variables selected and `y$sd`. Only used if `y` is a list.
`n`	Number of observations to simulate
`p`	Number of covariates to simulate
`q`	Number of responses to simulate for each replicate
`nreps`	Number of replicates to simulate
`mode`	String specifying type of dataset to create: `tinysim` Simulated data included with this package; equivalent to mode `pleiotropy` except that the dataset is tiny, with `n=100`, `p=10`, `q=5`, and `nreps=10` `ptychoIn` Simulated data included with this package; equivalent to mode `gene` except that the dataset is tiny, with `n=3000`, `p=10`, `q=1`, and `nreps=1` `exchange` Create orthogonal `X` and exchangeable variants; `n=5000`, `p=50`, `q=5`, and `nreps=100` `pleiotropy` Create orthogonal `X`, and several variants have nonzero effects on multiple responses; `n=5000`, `p=50`, `q=5`, and `nreps=100` `gene` Create orthogonal `X`, and each group of variants typically has either several or no variants that effect a response; `n=5000`, `p=50`, `q=5`, and `nreps=100` `actualGeno` Simulate responses for input `X` `corTest` Simulate `q=2` responses for input `X`. There will be 10 replicates with the first variant in argument `variants` causal for both responses, 10 with the second variant causal, and 20 with variant `i` causal for response `i`. No other variant will be causal. `actualPheno` Put input `X` and `y` into data object `fixedOmega` Create orthogonal `X`, and each variant has a certain probability of a nonzero effect size `uniformEffects` Same as mode `fixedOmega` except that effect sizes are uniformly rather than normally distributed For `createDataBayesModel`, `mode` must be one of “exchange”, “pleiotropy”, or “gene”.
`tau.min`, `tau.max`	Endpoints of uniform distribution from which to draw `tau`
`G`	Number of groups of covariates; unused if `mode` is not “gene”
`var.detail`	Data frame with row names same as column names of `X`; must have columns “MAF” and “GENE”. Ignored unless `mode` is “actualGeno”.
`variants`	Character vector containing names of two columns of `X`; ignored unless `mode` is “corTest”.

Details

We describe createData and then describe its wrappers createDataBayesModel and createPubData.

Although createData can form the data object required by ptycho.all when X and y are input, it primarily exists to simplify simulating data from Y=X\beta+\epsilon, where \epsilon is normal with mean zero and specified standard deviation and \beta is sparse with entries simulated as specified.

The function generates a specified number of replicates, all of which use the same design matrix X. If this matrix is not input, then its argument must specify a function call to generate it. In either case, suppose X has n rows and p columns.

If the input y is numeric, then it will be used for the lone replicate. If it is a matrix, it must have n rows; let q be its number of columns. If input y is a numeric vector, it must have n entries and will be cast as a matrix with q=1 column. Otherwise, input y is a list specifying, along with the arguments omega and beta, how to simulate the response(s). Because it is useful in analysis of the estimation of the marginal posterior distribution, the returned object always contains, regardless of how X and y are specified, a matrix eta2 with (j,k) entry equal to \mathbf{x}_j^T \mathbf{y}_k / (n \mathbf{y}_k^T \mathbf{y}_k)

If y is to be simulated, the first step is to choose the probability that each covariate is associated with each reponse as specified by the input argument omega. If this argument is a matrix, it must have size p-by-q. If it is not a matrix but is numeric, it will be passed to matrix to create a matrix of the correct size. Otherwise, the matrix for each replicate will be generated by calling the function whose name is given by omega[[1]] with argument list (p, q, omega[[2]]). This function must return a list with component omega set to a p-by-q matrix; the list may also contain additional components. The package contains several functions whose names start with “createOmega” that might guide users in writing their own functions.

The next step is to draw a p-by-q matrix indic.var whose (j,k) entry is equal to one with probability omega[j,k] and zero otherwise. This matrix will be drawn until all column sums are positive.

For each entry in indic.var that is equal to one, the effect size must be drawn. This is done by calling the function whose name is given by beta[[1]] with argument list (indic.var, y$sd, beta[[2]]). This function must return a list with component beta set to a p-by-q matrix; the list may also contain additional components. If indic.var[j,k] is zero, then beta[j,k] should be zero. The package contains functions whose names start with “createBeta” that might guide users in writing their own functions.

Finally, an n-by-q matrix of noise is drawn from N(0,\sigma^2), where \sigma is the input noise.sd, and added to X\beta to obtain y. The column names of each response matrix generated will be y1, y2, and so forth.

The function createPubData generates the data sets used in Stell and Sabatti (2015). For mode equal to “exchange”, “pleiotropy”, or “geno”, it calls createData via createDataBayesModel; otherwise, it calls createData directly. These functions also serve as additional examples of the use of createData. For reproducibility, createPubData first sets the random seed to 1234, except that it is set to 4 when mode equals “ptychoIn” and it does not set it when mode equals “corTest”.

In createDataBayesModel, if mode is “exchange”, then one \omega \sim \mbox{Beta}(12,48) is drawn independently for each trait. If mode is “pleiotropy”, then one probability of association for a trait is drawn from Beta(16,55) for each data set, that probability is used to draw indic.grp for each variant, and then the probability of nonzero indic.var[j,k] is drawn from Beta(48,12) for each nonzero indic.grp[j]. Finally, if mode is “gene”, the process is analogous to pleiotropy except that each trait is simulated independently.

Value

List containing:

`X`	Design matrix
`q`	Number of columns in each response
`noise.sd`	Standard deviation of the simulated noise; `NULL` if input `y` is numeric
`omega`	Input `omega`
`beta`	Input `beta`
`replicates`	List of length `y$nreps` (length 1 if `y` is numeric), each entry of which is a list with the following components: `omega` Matrix containing probabilities of association between covariates and responses; row names are `colnames(X)` and column names are `colnames(y)`; `NULL` if input `y` is numeric `indic.var` Matrix containing ones for associations and zeros otherwise; row and column names are same as for `omega`; `NULL` if input `y` is numeric `beta` Matrix of effect sizes; row and column names are same as for `omega`; `NULL` if input `y` is numeric `y` Response matrix `eta2` Matrix with row names equal to `colnames(X)` and column names equal to `colnames(y)` For `createDataBayesModel` with `mode` that uses a second level of indicator variables, each entry in the `replicate` list also has components `omega.grp` and `indic.grp` containing the intermediate steps of drawing the second-level indicator variable before drawing `omega`. If the argument `beta` to `createData` is “createBetaNormal” (which it is when called by `createDataBayesModel`), then each replicate will also have a component `tau` giving the value drawn by a call to `runif(1, tau.min, tau.max)`.

Author(s)

Laurel Stell and Chiara Sabatti
Maintainer: Laurel Stell <lstell@stanford.edu>

References

Stell, L. and Sabatti, C. (2015) Genetic variant selection: learning across traits and sites, arXiv:1504.00946.

Examples

### EXAMPLE 1
data(tinysim)
# Data generated with mode equal to pleiotropy, so indic.grp exists and
# has an entry for each column in X.
colnames(tinysim$X)
tinysim$replicates[[5]]$indic.grp
# X4, X6, and X9 are associated with some responses.
tinysim$replicates[[5]]$indic.var

### EXAMPLE 2
# Generate miniature data set with information shared across covariates.
set.seed(1234)
tiny1 <- createDataBayesModel(mode="gene", n=100, p=10, q=5, nreps=10,
                              tau.min=0.045, tau.max=0.063, G=2)
# A covariate can only have indic.var=1 if the group it belongs to has
# indic.grp=1.  For example,indic.grp[1,4]=0 implies
# indic.var[groups$group2var[1],4]=0.
tiny1$replicates[[1]]$indic.grp
tiny1$omega[[2]]$groups$group2var[1]
tiny1$replicates[[1]]$indic.var

### EXAMPLE 3
# Alternatively, call createData directly
groups <- createGroupsSim(G=2, p=10)
omegaargs <- list(indic.grp.shape1=16, indic.grp.shape2=55,
                  shape1=48, shape2=12, groups=groups)
betaargs <- list(tau.min=0.045, tau.max=0.063)
set.seed(1234)
tiny2 <- createData(X=list("createOrthogonalX", list(n=100, p=10)),
                    y=list(nreps=10, q=5, sd=1),
                    omega=list("createOmegaCrossVars", omegaargs),
                    beta=list("createBetaNormal", betaargs))
identical(tiny1, tiny2)
### SEE THE CODE FOR createPubData FOR MORE EXAMPLES.

ptycho documentation built on April 12, 2025, 1:15 a.m.