Methods for genetic association, such as genome-wide association studies (GWAS), are wideley used to test for SNP level relationships for traits and disease. Accounting for multiple-comparisons of such techniques typically calls for conservative Bonferroni adjustements for the number of SNPs included in the analysis (often ~10^6). Commonly referred to as the Minimum P approach, a class level signal is often given by the minimum p-value of all SNPs within a class, where class is defined as as a meaningful segment of the genome, such as a protein coding gene. Therefore, only classes with a single strong SNP level signal are deemed significant. In this package we present tools for implementing an alternative method, known as Genetic Class Association Testing (GenCAT), which utilizing statistical signal from every SNP accross an entire class for each statistical test. Therefore, adjustments for multiple testing need only account for the number of classes used in analysis. This is performed by estimating the distribution of independent test statistics, via an eigen decomposition, using the SNP-level test statistics and genotypic correlation between SNPs. Class level test statistics are then calculated as the sum of the squared values of these independent signals. These test statistics follow a central $chi^2$-distribution, with degree of freedom equal to the number of independent signals within a class. More information on the GenCAT approach can be found at (Qian et al., 2015).
In this tutorial, we use SNP level results from the CARDIoGRAM consortium GWA meta-analysis (Schunkert et al., 2011). We define classes a protein coding genes, based on boundaries adapted from the KnownCanonical track from UCSC Genome Browser, build GRCh 37 (hg 19).
For the sake of time and memory, only select SNPs from chromosomes 13-15 are used.
library(GenCAT)
The class level approach is carried out by the GenCAT
function, which requires a six column data frame including information for SNP ID, effect allele (the allele which determines the direction of the test statistic from the SNP level association test), other allele, SNP level test statistic, chromosome number (1-22), and class label.
This package includes the function, map2class
, which can be used to assign class labels to SNPs based on their genomic coordinates. The first argument, coords
, is a data frame which defines the boundaries of each class in the genome. The second argument, SNPs
, defines the genomic coordinates of each SNP. A third argument, extend.boundary
, can also be used to specify class boundary extension. Refer to the help page (?GenCAT::map2class
) for specific formatting requirements.
In this example we map the CARDIoGRAM SNPs to protein coding genes, extending the class boundaries by 5000.
data('CardioData') data('coords') print(head(coords)) print(head(CardioData)) CardioMapped<-map2class(coords, CardioData, extend.boundary = 5000) print(head(CardioMapped))
GenCAT
In order to perform the class level test, genotype data must be provided from a reference population. If available, this can simply be the genotype data used in the initial SNP level analysis. Otherwise, the genotype data should come from a similar representative population. This data is be specified using the genoData
and snpInfo
arguments. The former must be a SnpMatrix
object, defined by the snpStats package. The latter defines the chromosome number, and allele assignments for each SNP. If using the read.plink
or read.pedfile
functions to read in genotype data, these objects can be obtained respectively from the genotype
and map
elements of the resulting list object.
It is recommended that adequate SNP and sample level filtering be carried out on genotype data. SNP pairs with high missingness (low call rate), can result in missing correlations. If this happens GenCAT
will return an error.
For this example, we use SNPwise filtered genotype data from 1000 Genomes from 99 individuals of northern and western european ancestry (CEU).
data('geno') genoData<-geno$genotypes snpInfo<-geno$map colnames(snpInfo)<-c('chr', 'SNP', 'gen.dist', 'position', 'A1', 'A2') GenCATtest <- GenCAT(CardioMapped, genoData=genoData, snpInfo = snpInfo)
The GenCAT
function is run in parallel using functionality from the doParallel package. This will support parallel processing in both Unix based and Windows machines. The number of processes to run can be specified by the workers
argument (set to 2 by default).
The pcCutoff
(set to 0.95 be default) argument specifies the proportion of the variability in the SNP-wise genotype correlation matrices for which to account. Using a value less than 1 will result in a dimension reduction, meaning the number of independent test statistics estimated by the eigendecomposition can be less than the total number of SNPs in a class. Refer to (Qian et al., 2015) for more information.
The GenCAT
function creates a list object of class GenCATtest
, which has five elements.
print(str(GenCATtest))
The GenCAT
element contains the results of the class level test, including test statistic(CsumT
), p-value(CsumP
), and number of independent signals after the dimension reduction(N_obs
). The Used
element contains information for SNPs included in class level testing. The notFound
and unMatched
elements contain information for SNPs which were removed because they either couldn't be found in reference population, or the there was one or more allele assignment that wasn't in the reference population's allele assignments for that SNP. The last element, TransStats
contains the independent test statistic estimates for each class.
The GenCATtest
object created by the GenCAT
function can be immediately used with the GenCAT_manhattan
function. This will plot the $-log_{10}(p-value)$ for each class. Giving a value to the sigThresh
argument will draw a horizontal line at $-log_{10}$ of this value. Aditionally, classes with strong signal can be highlighted and labeled by setting arguments, highlightPosi
and labelPosi
, to TRUE
.
GenCAT_manhattan(GenCATtest, sigThresh = (0.05/nrow(GenCATtest$GenCAT)), highlightPosi = TRUE, labelPosi = TRUE)
Qian J, Nunez S, Reed E, Reilly MP, Foulkes AS (2016) A Simple Test of Class-Level Genetic Association Can Reveal Novel Cardiometabolic Trait Loci. PLoS ONE 11(2): e0148218.
Rosenbloom, K. R., Armstrong, J., Barber, G. P., Casper, J., Clawson, H., Diekhans, M., Dreszer, T. R., Fujita, P. A., Guruvadoo, L., Haeussler, M., Harte, R. A., Heitner, S., Hickey, G., Hinrichs, A. S., Hubley, R., Karolchik, D., Learned, K., Lee, B. T., Li, C. H., Miga, K.H. , Nguyen, N., Paten, B., Raney, B.J., Smit, A. F., Speir, M. L., Zweig, A. S., Haussler, D., Kuhn, R. M., Kent, W.J. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 2015 Jan;43(Database issue):D670-81. PMID: 25428374; PMC: PMC4383971
Schunkert, H., Konig, I. R., Kathiresan, S., Reilly, M. P. and et. al. (2011). Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43 333–338.
The 1000 Genomes Project Consortium. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422), 56–65.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.