inst/markdown/methods/germline-colocalization.md

Title: Germline Colocalization

Descriptions: Gene expression and splice quantitative trait locus analysis, and Colocalization: We performed eQTL and sQTL analyses in TCGA and used summary statistics from the GTEx datasets to search for potential candidate genes. We excluded the HLA and IL17RA loci since SNPs at these loci are known eQTLs for genes that are part of the immune trait. For the significant and suggestive SNPs, we tested all genes within +/- 1MB for eQTL and all transcipts within +/- 500KB for sQTL. We used a shorter range for sQTLs with the assumption that SNPs affecting splicing are likely to act at a shorter distance.

TCGA dataset: RNA-seq gene expression and splicing data were downloaded from the NIH Genomics Data Commons (https://gdc.cancer.gov/about-data/publications/pancanatlas and https://gdc.cancer.gov/about-data/publications/PanCanAtlas-Splicing-2018 (Kahles et al., 2018). For the sQTL analysis, we considered the following splicing categories: 3’, 5’, exome skipping, intron retention, and mutually exclusive exon events quantified by the Percent Spliced In (PSI) (Kahles et al., 2018). Only splicing events with more than 800 non-missing observations (~10% of the total data) were considered. Association analyses between either gene expression or PSI and the imputed SNPs were performed using linear regression using age, sex, PC1-7, and cancer type as covariates. We calculated FDR for each SNP separately, under the assumption that the SNP was already either significant or suggestive, and thus we had to correct for each of the genes at the locus but not all of the other SNPs (Table S5). We then selected the SNP-gene expression (eQTL) or SNP-gene splicing (sQTL) pairs with FDR p < 0.1 for further colocalization analysis.

GTEx dataset: We downloaded all summary statistics for expression quantitative loci (eQTL - GTEx_Analysis_v8_eQTL_all_associations), and splicing quantitative loci (sQTL - GTEx_Analysis_v8_sQTL_all_associations) from GTEx project (https://console.cloud.google.com/storage/browser/gtex-resources) using the results from the latest version of the GTEx database (Version 8). For each SNP that had a genome-wide significant or suggestive association with one of the 33 immune traits by GWAS, we extracted all of the association statistics from the summary statistics for eQTL within +/- 1MB and for sQTLs within +/- 500 KB from all tissues in the GTEx summary statistics dataset. We then calculated FDR for each SNP, correcting for all of the genes at the locus across all tissues as we did for TCGA. For eQTL and/or sQTLs that had FDR p < 0.1, we pursued colocalization as below. TCGA GWAS summary statistics are annotated in Build 37, GTEx QTL summary stats are annotated in Build 38, when appropriate, liftover from Build 38 to 37 are provided using R/Bionconductor packages AnnotationHub (v2.12.1) (AH14150 chain file) and rtracklayer (v1.40.6). In the GTEx summary file (Tale S5) we annotated both Build 37 and Build 38 positions.

Colocalization analysis: We performed colocalization posterior probability (CLPP) analysis using eCAVIAR (Hormozdiari et al., 2016) on both TCGA and GTEx results. eCAVIAR computes a posterior probability of causality based on association data and LD structure for the eQTL/sQTL and the trait GWAS and then calculates the joint probability of both of these being causal. It requires both summary statistics from GWAS and from the eQTL/sQTL analysis and the LD matrix of SNPs used in both analyses. For TCGA, we began with all SNPs that had FDR p < 0.1 with at least one gene and/or transcript and computed the eQTL and sQTL associations for the surrounding SNPs from the index SNP for that same gene/transcript using the same approach as outlined above. For GTEx, we began with SNP-gene expression or SNP-gene splicing pairs that met our FDR p < 0.1 criteria and extracted the eQTL and sQTL results for the surrounding SNPs from the summary results. For the GWAS and TCGA analyses, we calculated the genotype correlation (r) at each locus from the genotype data. For the GTEx analysis, we downloaded the individual genotype data from dbGAP for GTEx participants (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v8.p2) and calculated genotypic correlation between SNPs in R. We then ran eCAVIAR separately for each FDR p < 0.1 eQTL and sQTL association from TCGA and GTEx, considering models at each locus that assume one or two causal variants. For each SNP-gene expression or SNP-gene splicing category pair, 200 SNPs (+/- 100 SNPs) around the index SNP were included in colocalization analysis. The CLPP of each SNP being causal was calculated, and also a regional CLPP by summing all 201 SNP CLPPs. We used a posterior probability of > 0.01 to consider plausible colocalization, including both the 1 and 2 locus model and considering the sum of the posterior probability SNPs in the colocalization results.

Expanded region criteria for colocalization: Since eCAVIAR identified multiple genes at the same locus for many loci that have plausible colocalization within a +/- 100 SNP boundary, we sought stronger evidence for colocalization at the loci where eCAVIAR found colocalization by examining an expanded region. We reasoned that a gene or transcript that is causal for the immune trait should not be more strongly associated with another SNP in the region that has little or no evidence of association with the immune trait. Therefore, for each gene or splice variant that had plausible colocalization by eCAVIAR, we performed an expanded region search (+/- 1MB for eQTLs and +/- 500KB for sQTLs) to see if we can identify one or more SNPs that had a stronger effect in the eQTL/sQTL analysis in the same tissue/dataset, which we called “counter-evidence” SNPs. If eCAVIAR produced plausible evidence of colocalization (posterior prob>0.01) and we could find no SNPs that met our counter-evidence criteria in the expanded region, we considered the expanded region evidence for colocalization as strong. If we did find SNPs that met our counter-evidence criteria in the expanded region, then we compared the significance level for the eQTL/sQTL association of the counter-evidence SNP vs. the eQTL/sQTL association with index SNP (associated with the immune trait). If the counter-evidence SNP association with eQTL or sQTL had a neg log10 p value that was less or equal than 1.5 higher than the index SNP (GWAS significant SNP for the immune trait), then we considered the expanded region evidence as intermediate. If the difference in -log10 p values was >1.5, we considered the expanded region analysis to be negative. To visualize the colocalization in the expanded region, we generated plots that show the -log10 p QTL vs. -log10 p GWAS for all of the GWAS significant SNPs with CLPP > 0.01. The plots included the association p values for all of the SNPs at +/- 1MB for eQTL and at +/- 500KB for sQTL from the gene which had a CLPP > 0.01. These plots are available at Figshare (GTEX expanded region analysis plots: https://doi.org/10.6084/m9.figshare.13089341; ; TCGA expanded region analysis plots: https://doi.org/10.6084/m9.figshare.13090031. . We color-coded these plots with the LD, based on the LD matrix from the TCGA. Counter-SNPs are found in the top left corner of these plots (i.e. strong association with the eQTL or sQTL but no association with the immune trait). Conversely if there were no counter-SNPs, then the strongest SNPs for association with the immune trait were also the strongest SNPs for the association with the eQTL/sQTL.

Reference Listing

Contributors: Rosalyn Sayaman, Donglei Hu, Mohamad Saad, Elad Ziv, Davide Bedognetti



CRI-iAtlas/iatlas-app documentation built on Feb. 7, 2025, 9:02 p.m.