sumSTAAR: variant-Set Test for Association using Annotation infoRmation...
In sumFREGAT: Fast Region-Based Association Tests on Summary Statistics

View source: R/sumSTAAR.R

sumSTAAR

R Documentation

variant-Set Test for Association using Annotation infoRmation on summary statistics

Description

STAAR procedure developed by Li et al. (2020) adapted to use on summary statistics, with extentions

Usage


sumSTAAR(score.file, gene.file, genes = 'all', cor.path = 'cor/',
tests = c('BT', 'SKAT', 'ACAT'), beta.par.matrix = rbind(c(1, 1), c(1, 25)), 
prob.causal = 'all', phred = TRUE, n = NA, mac.threshold = NA,
approximation = TRUE, write.file = FALSE, staar.output = TRUE, quiet = FALSE)

Arguments

`score.file`	name of data file generated by `prep.score.files()`.
`gene.file`	can be a name of a custom text file listing genes in refFlat format. Values "hg19" and "hg38" can be set to use default gene files containing protein-coding genes with start and stop positions from corresponding build. Mind that the same build should be used in `score.file` and `gene.file`.
`genes`	character vector of gene names to be analyzed. Can be "chr1", "chr2" etc. to analyze all genes on a corresponding chromosome. If not set, function will attempt to analyze all genes listed in `gene.file`.
`cor.path`	path to a folder with correlation matrix files (one file per each gene to be analyzed). Correlation matrices in text format are allowed, though ".RData" is preferable as computationally efficient. Each file should contain a square matrix with correlation coefficients (r) between genetic variants of a gene. An example of correlation file format: "snpname1" "snpname2" "snpname3" ... "snpname1" 1 0.018 -0.003 ... "snpname2" 0.018 1 0.081 ... "snpname3" -0.003 0.081 1 ... ... If genotypes are available, matrices can be generated as follows: `cor.matrix <- cor(g)` `save(cor.matrix, file = paste0(geneName, ".RData"))` where `g` is a genotype matrix (nsample x nvariants) for a given gene with genotypes coded as 0, 1, 2 (coding should be exactly the same that was used to generate GWAS statistics). If genotypes are not available, correlations can be approximated through reference samples. Reference matrices from 1KG European sample can be downloaded at http://mga.bionet.nsc.ru/sumFREGAT/. Names of correlation files should be constructed as "geneName.RData" (e.g. "ABCG1.RData", "ADAMTS1.RData", etc.) for ".RData" format or "geneName.txt" for text format. Example corfiles can be found as: system.file("testfiles/CFH.cor", package = "sumFREGAT") system.file("testfiles/CFH.RData", package = "sumFREGAT")
`tests`	a character vector with gene-based methods to be applied. By default, three methods are used: 'BT', 'SKAT', and 'ACAT'. Other weighted tests ('SKATO', 'PCA', 'FLM') can be included. Tests that do not imply weighting ('simpleM', 'minP', 'sumchi'), if listed, will be calculated once and combined along with other tests into the total sumSTAAR p-value. ACAT performs with `gen.var.weights = "af"` for consistency with STAAR.
`beta.par.matrix`	an (n x 2) matrix with rows corresponding to n `beta.par` values used sequentially. Default value `rbind(c(1, 1), c(1, 25))` means that `beta.par = c(1, 1)` and `beta.par = c(1, 25)` will be applied. `beta.par` are two positive numeric shape parameters in the beta distribution to assign weights for each genetic variant as a function of minor allele frequency (MAF).
`prob.causal`	a character vector to define a set of annotations to be used. Annotations should be initially passed to `prep.score.files` (see `prep.score.files()` function) as input file columns "PROB", "PROB1", "PROB2", "PROB3" etc. The same naming should be used for `prob.causal` argument. By default, all "PROB" columns will be used. Annotations are expected to be in PHRED scale, set `phred = FALSE` to use simple probabilities.
`phred`	a logical value indicating whether probabilities are in PHRED scale. If `TRUE` (default for `sumSTAAR()`), PHRED-to-probability transformation will be performed for values indicated in `prob.causal`.
`n`	size of the sample on which summary statistics were obtained. Should be assigned if 'PCA' or 'FLM' are included in `tests`.
`mac.threshold`	an integer number. In ACAT, scores with MAC <= 10 will be combined using Burden test. MACs are calculated from MAFs, `n` must be set to apply `mac.threshold`. The original STAAR procedure performs with `mac.threshold = 10`.
`approximation`	logical value indicating whether approximation should be used for SKAT, SKATO, PCA and FLM.
`write.file`	output file name. If specified, output for all tests (as it proceeds) will be written to corresponding files.
`staar.output`	logical value indicating whether extensive output format should be used (see Details).
`quiet`	`quiet = TRUE` suppresses excessive output from reading vcf file genewise.

Details

The STAAR procedure has been recently proposed by Li et al. (2020) and described in detail therein. It calculates a set of P values using a range of gene-based tests, beta distribution weights parameters, and weighting by each of 10 functional annotations. The P values are then combined using Cauchy method (see ACATO() function and Liu, Y. et al., (2019)).

Value

With staar.output = FALSE returns a data table with a single P value for each test (combinations of differently weighted and unweighted iterations of the same test) and a total sumSTAAR P value. ACAT-O method is used to combine P values (Liu, Y. et al., 2019).

If staar.output = TRUE the function returns a data frame of size (n.genes x (n.tests x n.beta.pars x (n.annotations + 1) + n.tests + 2)) containing P values for combined tests and all individual tests. The output is analogous to that of original STAAR procedure. For example, by default, the output will contain columns:

gene # gene symbol
BT.1.1.PROB0 # P value for burden test with beta.par = c(1, 1) and no annotations
BT.1.1.PROB1 # P value for burden test with beta.par = c(1, 1) and weighted by first annotation
BT.1.1.PROB2 # P value for burden test with beta.par = c(1, 1) and weighted by second annotation
...
BT.1.1.PROB10 # P value for burden test with beta.par = c(1, 1) and weighted by tenth annotation
BT.1.1.STAAR # combined P value of all burden tests with beta.par = c(1, 1)
BT.1.25.PROB0 # P value for burden test with beta.par = c(1, 25) and no annotations
BT.1.25.PROB1 # P value for burden test with beta.par = c(1, 25) and weighted by first annotation
...
BT.1.25.PROB10 # P value for burden test with beta.par = c(1, 25) and weighted by tenth annotation
BT.1.25.STAAR # combined P value of all burden tests with beta.par = c(1, 25)
SKAT.1.1.PROB0 # P value for SKAT with beta.par = c(1, 1) and no annotations
SKAT.1.1.PROB1 # P value for SKAT with beta.par = c(1, 1) and weighted by first annotation
...
SKAT.1.1.PROB10 # P value for SKAT with beta.par = c(1, 1) and weighted by tenth annotation
SKAT.1.1.STAAR # combined P value of all SKATs with beta.par = c(1, 1)
SKAT.1.25.PROB0 # P value for SKAT with beta.par = c(1, 25) and no annotations
...
SKAT.1.25.PROB10 # P value for SKAT with beta.par = c(1, 25) and weighted by tenth annotation
SKAT.1.25.STAAR # combined P value of all SKATs with beta.par = c(1, 25)
ACAT.1.1.PROB0 # P value for ACAT with beta.par = c(1, 1) and no annotations
ACAT.1.1.PROB1 # P value for ACAT with beta.par = c(1, 1) and weighted by first annotation
...
ACAT.1.1.PROB10 # P value for ACAT with beta.par = c(1, 1) and weighted by tenth annotation
ACAT.1.1.STAAR # combined P value of all ACATs with beta.par = c(1, 1)
ACAT.1.25.PROB0 # P value for ACAT with beta.par = c(1, 25) and no annotations
...
ACAT.1.25.PROB10 # P value for ACAT with beta.par = c(1, 25) and weighted by tenth annotation
ACAT.1.25.STAAR # combined P value of all ACATs with beta.par = c(1, 25)
sumSTAAR.ACAT_O # combined P value of all 'PROB0' gene-based tests (without weighting by annotations)
sumSTAAR.STAAR_O # combined P value of all gene-based tests

References

Belonogova et al. (2022) SumSTAAR: A flexible framework for gene-based association studies using GWAS summary statistics. PLOS Comp Biol. https://doi.org/10.1371/journal.pcbi.1010172 Li X., et al. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nature Genetics. DOI: 10.1038/s41588-020-0676-4.
Liu Y. et al. (2019) ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410-421.

Examples


cor.path <- system.file("testfiles/", package = "sumFREGAT")
score.file <- system.file("testfiles/CFH.prob.vcf.gz",
	package = "sumFREGAT")
sumSTAAR(score.file, prob.causal = "PROB", gene.file = "hg19",
	genes = "CFH", cor.path, quiet = TRUE)

## Not run: 

score.file <- system.file("testfiles/CFH.prob.phred.vcf.gz",
	package = "sumFREGAT")
res <- sumSTAAR(score.file, gene.file = "hg19", genes = "CFH",
	cor.path, quiet = TRUE)

## End(Not run)

sumFREGAT documentation built on June 7, 2022, 9:06 a.m.