simulateSNPglm: Simulation of SNP data
In scrime: Analysis of High-Dimensional Categorical Data Such as SNP Data

Description Usage Arguments Details Value Author(s) References See Also Examples

Simulates SNP data. Interactions of some of the simulated SNPs are then used to specify either a binary or a quantitative response by a logistic or linear regression model, respectively.

1
2
3

simulateSNPglm(n.obs = 1000, n.snp = 50, list.ia = NULL, list.snp = NULL, 
   beta0 = -0.5, beta = 1.5, maf = 0.25, sample.y = TRUE, p.cutoff = 0.5, 
   err.fun = NULL, rand = NA, ...)

`n.obs`	number of observations that should be generated.
`n.snp`	number of SNPs that should be generated.
`list.ia`	a list consisting of numeric vectors (or values) specifying the genotypes of the SNPs that should be explanatory for the response. Each of these vectors must be composed of some of the numbers -3, -2, -1, 1, 2, 3, where 1 denotes the homozygous reference genotype, 2 the heterozygous genotype, and 3 the homozygous variant genotype, and a minus before these numbers means that the corresponding SNP should be not of this genotype. If, e.g., one of the vectors is given by `c(1, -1, -3)` and the corresponding vector in `list.snp` is `c(5, 7, 8)`, then the corresponding interaction used in the regression model to specify the response is `(SNP5 == 1) & (SNP7 != 1) & (SNP8 != 3)`. For more details, see Details. Must be specified if `list.snp` is specified. If both `list.ia` and `list.snp` are `NULL`, then the interactions shown in the Details section are used.
`list.snp`	a list consisting of numeric vectors specifying the SNPs that compose the interactions used in the regression model. Each of these vectors must have the same length as the corresponding vector in `list.ia`, and must consist of integers between 1 and `n.snp`, where the integer i corresponds to the ith column of the simulated SNP matrix. If `list.ia` is specified but not `list.snp`, then the first n SNPs are used to generate the interactions, where n is the total number of values in `list.ia`. For the case that both `list.ia` and `list.snp` are not specified, see Details.
`beta0`	a numeric value specifying the intercept of the regression model.
`beta`	a non-negative numeric value or vector of the same length as `list.ia` (i.e.\ one numeric value for each interaction) specifying the parameters in the regression model.
`maf`	either an integer, or a vector of length 2 or `n.snp` specifying the minor allele frequency. If an integer, all the SNPs will have the same minor allele frequency. If a vector of length `n.snp`, each SNP will have the minor allele frequency specified in the corresponding entry of `maf`. If length 2, then `maf` is interpreted as the range of the minor allele frequencies, and for each SNP, a minor allele frequency will be randomly drawn from a uniform distribution with the range given by `maf`.
`sample.y`	should the values of the response in the logistic regression model be randomly drawn using the probabilities of the respective observations for being a case? If `FALSE`, then the response value of an observation is 1 if its probability for being a case is larger than `p.cutoff`, and otherwise the observation is classified as a control (i.e.\ the value of the response is 0). Ignored if `err.fun` is specified.
`p.cutoff`	a probability, i.e.\ a numeric value between 0 and 1, naming the cutoff for an observation to be called a case if `sample.y = FALSE`. For details, see `sample.y`. Ignored if `sample.y = TRUE` or `err.fun` is specified.
`err.fun`	a function for generating the error that is added to the linear model to determine the value of the (quantitative) response. If `NULL`, a logistic regression model is fitted. If specified, a linear model is fitted. Therefore, this argument is used to differ between the two types of models. The specified function must have as first argument the number of values that should be generated and as output a vector consisting of these values. Further arguments can also be specified because of `...` in `simulateSNPglm`. If, e.g., `err.fun = rnorm`, then `rnorm(n.obs)` will be called to generate `n.obs` observations from a standard normal function.
`rand`	a numeric value for setting the random number generator in a reproducible state.
`...`	further arguments of the function specified by `err.fun`.

simulateSNPglm first simulates a matrix consisting of n.obs observations and n.snp SNPs, where the minor allele frequencies of these SNPs are given by maf.

Note that all SNPs are currently simulated independently of each other such that they are unlinked.

Afterwards, the response is determined by a regression model using the specifications of list.ia, list.snp, beta0 and beta. Depending on whether err.fun is specified or not, a linear or a logistic regression model is used, respectively, i.e.\ the response Y is continuous or binary.

By default, a logistic regression model

logit(Prob(Y = 1)) = beta0 + beta[1] * L1 + beta[2] * L2 + ...

is fitted, since err.fun = NULL.

If both list.ia and list.snp are NULL, then interactions similar to the one considered, e.g., in Nunkesser et al.\ (2007) or Schwender et al.\ (2007) are used, i.e.\

L1 = (SNP6 != 1) & (SNP7 == 1)\ \ \ and

L2 = (SNP3 == 1) & (SNP9 == 1) & (SNP10 == 1),

by setting list.ia = list(c(-1, 1), c(1, 1, 1)) and list.snp = list(c(6, 7), c(3, 9, 10)).

Using the above model Prob(Y = 1) is computed for each observation, and its value of the response is determined either by a random draw from a Bernoulli distribution using this probability (if sample.y = TRUE), or by evaluating if Prob(Y = 1) > p.cutoff (if sample.y = FALSE).

If err.fun is specified, then the linear model

Y = beta0 + beta[1] * L1 + beta[2] * L2 + ... + error

is used to determine the values of the response Y, where the values for error are given by the output of a call of err.fun.

An object of class simSNPglm consisting of

`x`	a matrix with `n.obs` rows and `n.snp` columns containing the simulated SNP values.
`y`	a vector of length `n.obs` composed of the values of the response.
`beta0`	the value of the intercept.
`beta`	the vector of parameters.
`ia`	a character vector naming the explanatory interactions.
`maf`	a vector of length `n.snp` composed of the minor allele frequencies.
`prob`	a vector of length `n.obs` consisting of the values of Prob(Y = 1) (if `err.fun = NULL`).
`err`	a vector of length `n.obs` composed of the values of the error (if `err.fun` is specified).
`p.cutoff`	the value of `p.cutoff` (if `err.fun = NULL` and `sample.y = FALSE`).
`err.call`	a character string naming the call of the error function (if `err.fun` is specified).

Holger Schwender, holger.schwender@udo.edu

Nunkesser, R., Bernholt, T., Schwender, H., Ickstadt, K. and Wegener, I.\ (2007). Detecting High-Order Interactions of Single Nucleotide Polymorphisms Using Genetic Programming. Bioinformatics, 23, 3280-3288.

Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.

simulateSNPs, summary.simSNPglm, simulateSNPcatResponse

## Not run: 
# The simulated data set described in Details.

sim1 <- simulateSNPglm()
sim1

# A bit more information: Table of probabilities of being a case
# vs. numbers of cases and controls.

summary(sim1)

# Calling an observation a case if its probability of being
# a case is larger than 0.5 (the default for p.cutoff).

sim2 <- simulateSNPglm(sample.y = FALSE)
summary(sim2)

# If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3) and
# ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions
# (or variables) that are explanatory for the response,
# list.ia and list.snp are specified as follows.

list.ia <- list(c(-2, 1), 3, c(-1,3))
list.snp <- list(c(4, 3), 5, c(12,9))

# The binary response and the data set consisting of 
# 600 observations and 25 SNPs, where the minor allele
# frequency of each SNP is randomly drawn from a
# uniform distribution with minimum 0.1 and maximum 0.4,
# is then generated by

sim3 <- simulateSNPglm(n.obs = 600, n.snp = 25,
  list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4))
sim3

summary(sim3)
  
# If the response should be quantitative, err.fun has
# to be specified. To use a normal distribution with mean 0
# (default in rnorm) and a standard deviation of 2 
# as the distribution of the error, call

simulateSNPglm(err.fun = rnorm, sd = 2)
 


## End(Not run)