EBAM Analysis for Categorical Data

Description

Generates the required statistics for an Empirical Bayes Analysis of Microarrays (EBAM) of categorical data such as SNP data.

Should not be called directly, but via ebam(..., method = chisq.ebam).

This function replaces cat.ebam.

Usage

1
2
3
4
5
chisq.ebam(data, cl, approx = NULL, B = 100, n.split = 1, 
   check.for.NN = FALSE, lev = NULL, B.more = 0.1, B.max = 50000,
   n.subset = 10, fast = FALSE, n.interval = NULL, df.ratio = 3,
   df.dens = NULL, knots.mode = NULL, type.nclass = "wand",
   rand = NA)

Arguments

data

a matrix, data frame, or list. If a matrix or data frame, then each row must correspond to a variable (e.g., a SNP), and each column to a sample (i.e.\ an observation). If the number of observations is huge it is better to specify data as a list consisting of matrices, where each matrix represents one group and summarizes how many observations in this group show which level at which variable. These matrices can be generated using the function rowTables from the package scrime. For details on how to specify this list, see the examples section on this man page, and the help for rowChisqMultiClass in the package scrime.

cl

a numeric vector of length ncol(data) indicating to which class a sample belongs. Must consist of the integers between 1 and c, where c is the number of different groups. Needs only to be specified if data is a matrix or a data frame.

approx

should the null distribution be approximated by a ChiSquare-distribution? Currently only available if data is a matrix or data frame. If not specified, approx = FALSE is used, and the null distribution is estimated by employing a permutation method.

B

the number of permutations used in the estimation of the null distribution, and hence, in the computation of the expected z-values.

n.split

number of chunks in which the variables are splitted in the computation of the values of the test statistic. Currently, only available if approx = TRUE and data is a matrix or data frame. By default, the test scores of all variables are calculated simultaneously. If the number of variables or observations is large, setting n.split to a larger value than 1 can help to avoid memory problems.

check.for.NN

if TRUE, it will be checked if any of the genotypes is equal to "NN". Can be very time-consuming when the data set is high-dimensional.

lev

numeric or character vector specifying the codings of the levels of the variables/SNPs. Can only be specified if data is a matrix or a data frame. Must only be specified if the variables are not coded by the integers between 1 and the number of levels. Can also be a list. In this case, each element of this list must be a numeric or character vector specifying the codings, where all elements must have the same length.

B.more

a numeric value. If the number of all possible permutations is smaller than or equal to (1+B.more)*B, full permutation will be done. Otherwise, B permutations are used.

B.max

a numeric value. If the number of all possible permutations is smaller than or equal to B.max, B randomly selected permutations will be used in the computation of the null distribution. Otherwise, B random draws of the group labels are used.

n.subset

a numeric value indicating in how many subsets the B permutations are divided when computing the permuted z-values. Please note that the meaning of n.subset differs between the SAM and the EBAM functions.

fast

if FALSE the exact number of permuted test scores that are more extreme than a particular observed test score is computed for each of the variables/SNPs. If TRUE, a crude estimate of this number is used.

n.interval

the number of intervals used in the logistic regression with repeated observations for estimating the ratio f0/f (if approx = FALSE), or in the Poisson regression used to estimate the density of the observed z-values (if approx = TRUE). If NULL, n.interval is set to 139 if approx = FALSE, and estimated by the method specified by type.nclass if approx = TRUE.

df.ratio

integer specifying the degrees of freedom of the natural cubic spline used in the logistic regression with repeated observations. Ignored if approx = TRUE.

df.dens

integer specifying the degrees of freedom of the natural cubic spline used in the Poisson regression to estimate the density of the observed z-values. Ignored if approx = FALSE. If NULL, df.dens is set to 3 if the degrees of freedom of the appromimated null distribution, i.e.\ the ChiSquare-distribution, are less than or equal to 2, and otherwise df.dens is set to 5.

knots.mode

if TRUE the df.dens - 1 knots are centered around the mode and not the median of the density when fitting the Poisson regression model. Ignored if approx = FALSE. If not specified, knots.mode is set to TRUE if the degrees of freedom of the approximated null distribution, i.e.\ tht ChiSquare-distribution, are larger than or equal to 3, and otherwise knots.mode is set to FALSE. For details on this density estimation, see denspr.

type.nclass

character string specifying the procedure used to compute the number of cells of the histogram. Ignored if approx = FALSE or n.interval is specified. Can be either "wand" (default), "scott", or "FD". For details, see denspr.

rand

numeric value. If specified, i.e. not NA, the random number generator will be set into a reproducible state.

Details

For each variable, Pearson's Chi-Square statistic is computed to test if the distribution of the variable differs between several groups. Since only one null distribution is estimated for all variables as proposed in the original EBAM application of Efron et al. (2001), all variables must have the same number of levels/categories.

Value

A list containing statistics required by ebam.

Warning

This procedure will only work correctly if all SNPs/variables have the same number of levels/categories.

Author(s)

Holger Schwender, holger.schw@gmx.de

References

Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2001). Empirical Bayes Analysis of a Microarray Experiment, JASA, 96, 1151-1160.

Schwender, H. and Ickstadt, K. (2008). Empirical Bayes Analysis of Single Nucleotide Polymorphisms. BMC Bioinformatics, 9, 144.

Schwender, H., Krause, A., and Ickstadt, K. (2003). Comparison of the Empirical Bayes and the Significance Analysis of Microarrays. Technical Report, SFB 475, University of Dortmund, Germany.

See Also

EBAM-class,ebam, chisq.stat

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
## Not run: 
  # Generate a random 1000 x 40 matrix consisting of the values
  # 1, 2, and 3, and representing 1000 variables and 40 observations.
  
  mat <- matrix(sample(3, 40000, TRUE), 1000)
  
  # Assume that the first 20 observations are cases, and the
  # remaining 20 are controls.
  
  cl <- rep(1:2, e=20)
  
  # Then an EBAM analysis for categorical data can be done by
  
  out <- ebam(mat, cl, method=chisq.ebam, approx=TRUE)
  out
  
  # approx is set to TRUE to approximate the null distribution
  # by the ChiSquare-distribution (usually, for such a small
  # number of observations this might not be a good idea
  # as the assumptions behind this approximation might not
  # be fulfilled).
  
  # The same results can also be obtained by employing
  # contingency tables, i.e. by specifying data as a list.
  # For this, we need to generate the tables summarizing
  # groupwise how many observations show which level at
  # which variable. These tables can be obtained by
  
  library(scrime)
  cases <- rowTables(mat[, cl==1])
  controls <- rowTables(mat[, cl==2])
  ltabs <- list(cases, controls)
  
  # And the same EBAM analysis as above can then be 
  # performed by 
  
  out2 <- ebam(ltabs, method=chisq.ebam, approx=TRUE)
  out2

## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.