assocTest | R Documentation |
Method for performing a kernel-based association test given a genotype, VCF file, or kernel matrix
## S4 method for signature 'GenotypeMatrix,NullModel'
assocTest(Z, model, ranges,
kernel=c("linear.podkat", "localsim.podkat",
"quadratic.podkat", "linear.SKAT",
"localsim.SKAT", "quadratic.SKAT"),
width=1000, weights=NULL, weightFunc=betaWeights(),
method=NULL, adj=c("automatic", "none", "force"),
pValueLimit=0.05)
## S4 method for signature 'matrix,NullModel'
assocTest(Z, model, method=NULL,
adj=c("automatic", "none", "force"), pValueLimit=0.05)
## S4 method for signature 'TabixFile,NullModel'
assocTest(Z, model, ranges,
kernel=c("linear.podkat", "localsim.podkat",
"quadratic.podkat", "linear.SKAT",
"localsim.SKAT", "quadratic.SKAT"),
cl=NULL, nnodes=1, batchSize=20,
noIndels=TRUE, onlyPass=TRUE, na.limit=1, MAF.limit=1,
na.action=c("impute.major", "omit"),
MAF.action=c("invert", "omit", "ignore"),
sex=NULL, weightFunc=betaWeights(), width=1000,
method=NULL, adj=c("automatic", "none", "force"),
pValueLimit=(0.1 / length(ranges)), tmpdir=tempdir(),
displayProgress=TRUE)
## S4 method for signature 'character,NullModel'
assocTest(Z, model, ...)
Z |
an object of class |
model |
an object of class |
ranges |
an object with genomic regions to be tested; may be
an object of class |
kernel |
determines the kernel that should be used for association testing (see Subsection 9.2 of the package vignette for details) |
width |
tolerance radius parameter for position-dependent kernels “linear.podkat”, “quadratic.podkat”, and “localsim.podkat”; must be single positive numeric value; ignored for kernels “linear.SKAT”, “quadratic.SKAT”, and “localsim.SKAT” (see Subsection 9.2 of the package vignette for details) |
weights |
for the method with signature
|
weightFunc |
function for computing variant weights from minor allele
frequencies (MAFs); see |
method |
identifies the method for computing the p-values. If the
null model is of type “logistic” and small sample correction
is applied (see argument |
adj |
whether or not to use small sample correction for
logistic models (binary trait with covariates). The choice
“none” turns off small sample correction. If “force”
is chosen, small sample correction is turned on unconditionally.
If “automatic” is chosen (default), small sample correction
is turned on if the number of samples does not exceed 2,000. This
argument is ignored for any type of model except “logistic”
and small sample correction is switched off. For details how to
train a null model for small sample correction, see
|
pValueLimit |
if the null model is of type “bernoulli”,
|
cl |
if |
nnodes |
if |
batchSize |
parameter which determines how many
regions of |
noIndels |
if |
onlyPass |
if |
na.limit |
all variants with a missing value ratio above this
threshold in the VCF file |
MAF.limit |
all variants with a minor allele frequency (MAF) above this
threshold in the VCF file |
na.action |
if “impute.major”, all missing values will
be imputed by major alleles before association testing. If
“omit”, all columns containing missing values in the VCF file
|
MAF.action |
if “invert”, all columns with an MAF exceeding 0.5 will be inverted in the sense that all minor alleles will be replaced by major alleles and vice versa. If “omit”, all variants in the VCF file with an MAF greater than 0.5 are ignored. If “ignore”, no action is taken and MAFs greater than 0.5 are kept as they are. |
sex |
if |
tmpdir |
if computations are parallelized over multiple client
processes (see arguments |
displayProgress |
if |
... |
all other parameters are passed on to the |
The assocTest
method is the main function of the podkat
package. For a given genotype and a null model, it performs the actual
association test(s).
For null models of types “linear” and
“logistic” (see NullModel
and
nullModel
), a variance component score test is used
(see Subsection 9.1 of the package vignette for details).
The test relies on the choice of a particular kernel to measure the
pairwise similarities of genotypes. The choice of the kernel can be
made with the kernel
argument (see computeKernel
and Subsection 9.2 of the package vignette for more details).
For null models of type “linear”, the test statistic follows
a mixture of chi-squares distribution. For models of typ
“logistic”, the test statistic approximately follows a mixture
of chi-squares distribution. The computation of p-values for a given
mixture of chi-squares can be done according to Davies (1980) (which is
the default), according to Liu et al. (2009), or using a
modified method similar to the one suggested by
Liu et al. (2009) as implemented in the SKAT package,
too. Which method is used can be controlled using the method
argument. If method according to Davies (1980) fails,
assocTest
resorts to the method by Liu et al. (2009).
See also Subsection 9.1 of the package vignette for more details.
For null models of type “logistic”, the assocTest
method also offers the small sample correction suggested by
Lee et al. (2012). Whether small sample correction is applied,
is controlled by the adj
argument. The additional adjustment
of higher moments as suggested by Lee et al. (2012) is
performed whenever resampled null model residuals are available in
the null model model
(slot res.resampled.adj
, see
NullModel
). In this case, the method
argument controls how the excess kurtosis of test statistics sampled
from the null distribution are computed. The
default setting “unbiased” computes unbiased estimates by
using the exact expected value determined from the mixture components.
The settings “population” and
“sample” use almost unbiased and biased sample statistics,
respectively. The choice “SKAT” uses the same method as
implemented in the SKAT package. See Subsection 9.5 of the
package vignette for more details.
If the null model is of type “bernoulli”, the test statistic follows a mixture of Bernoulli distributions. In this case, an exact p-value is determined that is computed as the probability to observe a test statistic for random Bernoulli-distributed traits (under the null hypothesis) that is at least as large as the observed test statistic. For reasons of computational complexity, this option is limited to sample numbers not larger than 100. See Subsection 9.1 of the package vignette for more details.
The podkat package offers multiple interfaces for association
testing all of which require the second argument model
to be a
NullModel
object.
The simplest method is to call assocTest
for an object
of class GenotypeMatrix
as first argument
Z
. If the ranges
argument is not supplied, a single
association test is performed using the entire genotype matrix
contained in Z
and an object of class
AssocTestResult
is returned. In this case,
all variants need to reside on the same chromosome (compare with
computeKernel
). If the ranges
argument is
specified, each region in ranges
is tested separately and the
result is returned as an AssocTestResultRanges
object.
As said, the simplest method is to store the entire genotype in a
GenotypeMatrix
object and to call assocTest
as described above. This approach has the shortcoming that the entire
genotype must be read (e.g. from a VCF file) and kept in memory as a
whole. For large studies, in particular, whole-genome studies, this is
not feasible. In order to be able to cope with large studies, the
podkat package offers an interface that allows for reading from
a VCF file piece by piece without the need to read and store the entire
genotype at once. If Z
is a TabixFile
object or the name of a VCF file, assocTest
reads from the file
in batches of batchSize
regions, performs the association tests
for these regions, and returns the results as an
AssocTestResultRanges
object. This sequential
batch processing can also be parallelized. The user can either set up
a cluster him-/herself and pass the
SOCKcluster
object as
cl
argument. If the cl
is NULL
, users can
leave the setup of the cluster to assocTest
. In this case, the
only thing necessary is to determine the number of R client processes
by the nnodes
argument. The variant with the VCF interface
supports the same pre-processing and filter arguments as
readGenotypeMatrix
to control which variants are actually
taken into account and how to handle variants with MAFs greater than
50%.
If the argument Z
is a numeric matrix, Z
is
interpreted as a kernel matrix
\boldmath{K}
. Then a single association test is
performed as described above and the result is returned as an
AssocTestResult
object. This allows the user to
use a custom kernel not currently implemented in the podkat
package. The assocTest
function assumes that row and column
objects in the kernel matrix are in the same order. It does not
perform any check whether row and column names are the same or whether
the kernel matrix is actually positive semi-definite. Users should be
aware that running the function for invalid kernels matrices,
i.e. for a matrix that is not positive semi-definite, produces
meaningless results and may even lead to unexpected errors.
Finally, note that the samples in the null model model
and
in the genotype (GenotypeMatrix
object or VCF file)
need not be aligned to each other. If both the samples in model
and in the genotype are named (i.e. row names are defined
for Z
if it is a GenotypeMatrix
object; VCF files
always contain sample names anyway), assocTest
checks if all
samples in model
are present in the genotype. If so, it
selects only those samples from the genotype that occur in the null
model. If not, it quits with an error. If either the samples in the
null model or the genotypes are not named, assocTest
assumes
that the samples are aligned to each other. This applies only if
the number of samples in the null model and the number of genotypes
are the same or if the number of genotypes equals the number of
samples in the null model plus the number of samples that were omitted
from the null model when it was trained (see
NullModel
and nullModel
).
Otherwise, the function quits with an error.
An analogous procedure is applied if the kernel matrix interface is
used (signature matrix,NullModel
).
an object of class AssocTestResult
or
AssocTestResultRanges
(see details above)
Ulrich Bodenhofer
https://github.com/UBod/podkat
Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82-93. DOI: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.ajhg.2011.05.029")}.
Lee, S., Emond, M. J., Bamshad, M. J., Barnes, K. C., Rieder, M. J., Nickerson, D. A., NHLBI Exome Sequencing Project - ESP Lung Project Team, Christiani, D. C., Wurfel, M. M., and Lin, X. (2012) Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet. 91, 224-237. DOI: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.ajhg.2012.06.007")}.
Davies, R. B. (1980) The distribution of a linear combination of
\chi^2
random
variables. J. R. Stat. Soc. Ser. C-Appl. Stat. 29,
323-333.
Liu, H., Tang, Y., and Zhang, H. (2009) A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput. Stat. Data Anal. 53, 853-856.
AssocTestResult
,
AssocTestResultRanges
, nullModel
,
NullModel
, computeKernel
,
weightFuncs
, readGenotypeMatrix
,
GenotypeMatrix
,
plot
, qqplot
, p.adjust
,
filterResult
## load genome description
data(hgA)
## partition genome into overlapping windows
windows <- partitionRegions(hgA)
## load genotype data from VCF file
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
Z <- readGenotypeMatrix(vcfFile)
## read phenotype data from CSV file (continuous trait + covariates)
phenoFile <- system.file("examples/example1lin.csv", package="podkat")
pheno.c <- read.table(phenoFile, header=TRUE, sep=",")
## train null model with all covariates in data frame 'pheno'
model.c <- nullModel(y ~ ., pheno.c)
## perform association test
res <- assocTest(Z, model.c, windows)
print(res)
## perform association test using the VCF interface
res <- assocTest(vcfFile, model.c, windows, batchSize=100)
print(res)
## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")
## read phenotype data from CSV file (binary trait + covariates)
phenoFile <- system.file("examples/example1log.csv", package="podkat")
pheno.b <-read.table(phenoFile, header=TRUE, sep=",")
## train null model with all covariates in data frame 'pheno'
model.b <- nullModel(y ~ ., pheno.b)
## perform association test
res <- assocTest(Z, model.b, windows)
print(res)
## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.