knitr::opts_chunk$set(echo = TRUE, eval=TRUE)

Despite our fine-mapping efforts, there remained considerable uncertainty of causal variants in most loci. Even if the causal variants are known, assigning target genes can be difficult due to long-range regulation of enhancers.

Our gene mapping procedure prioritizes target genes:

(1) For every putative causal SNP, we assign a weight to each nearby gene, considering multiple ways the SNP may affect a gene. The weight of a gene can be viewed as the probability that the SNP affects that gene.

(2) For each gene, we then aggregate the causal evidence of all SNPs likely targeting this gene, expressed as the weighted sum of the PIPs of all these SNPs. To ensure that the causal evidence of a variant is not counted multiple times when it targets multiple genes, we normalize the SNP-to-gene weights in this calculation. The resulting “gene PIP” approximates the probability of a gene being causal. Similar to variant-level fine-mapping, we also define a “credible gene set”, the set of genes that capture the causal signal at a locus with high probability.

The weights of SNP-gene pairs reflect the strength of biological evidence linking SNPs to genes. For a SNP in an exon/promoter or in a regulatory region linked to a particular gene ("enhancer loops"), we assign a weight of 1 to that gene.

When a SNP cannot be linked to any gene in these ways, its target genes are assigned using a distance weighted function so that nearby genes receive higher weights.

See the Methods section in our paper for details.

knitr::include_graphics("../man/figures/gene.mapping.diagram.png")

In this tutorial, we show the gene mapping procedure using the fine-mapping result from our AFib study.

Input data

Our gene mapping procedure takes the following input data:

Load R packages

library(mapgen)

Fine-mapping summary statistics

Here we use fine-mapping summary statistics from our AFib study.

Keep SNPs with PIP > 1e-5, rename columns, and save as a GRanges object.

finemapstats <- readRDS(system.file("extdata", "AF.finemapping.sumstats.rds", package = "mapgen"))
finemapstats <- process_finemapping_sumstats(finemapstats, 
                                             snp = 'snp', 
                                             chr = 'chr', 
                                             pos = 'pos', 
                                             pip = 'susie_pip', 
                                             pval = 'pval', 
                                             zscore = 'zscore', 
                                             cs = 'cs', 
                                             locus = 'locus',  
                                             pip.thresh = 1e-5)

# For simplicity, only keep the following columns in the steps below
cols.to.keep <- c('snp','chr','pos', 'pip', 'pval', 'zscore', 'cs', 'locus')
finemapstats <- finemapstats[, cols.to.keep]
head(finemapstats, 3)

Genomic annotations

To link SNPs to genes, we use a hierarchical approach to assign the following genomic annotation categories:

  1. Exons/(active) promoters category
  2. Enhancer loops category (ABC, PC-HiC, etc.)
  3. Enhancer regions category (open chromatin regions, etc.)
  4. Introns/UTRs category

Gene annotations

You can download gene annotations (GTF file) from GENCODE, then make gene annotations using the GTF file.

gtf_file <- '/project2/xinhe/shared_data/gencode/gencode.v19.annotation.gtf.gz'
genomic.annots <- make_genomic_annots(gtf_file)

We included gene annotations (hg19) in the package, downloaded from GENCODE release 19.

genomic.annots <- readRDS(system.file("extdata", "genomic.annots.hg19.rds", package = "mapgen"))

gene.annots <- genomic.annots$genes

Open chromatin regions (OCRs)

OCRs <- readRDS(system.file("extdata", "OCRs.hg19.gr.rds", package = "mapgen"))

Exons/promoters category

In this example, we include exons and active promoters in this category for gene mapping. We defined active promoters as promoters overlapping with OCRs (overlap at least 100 bp).

active_promoters <- IRanges::subsetByOverlaps(genomic.annots$promoters, 
                                              OCRs, 
                                              minoverlap = 100)
genomic.annots$exons_promoters <- list(exons = genomic.annots$exons, 
                                       active_promoters = active_promoters)

If you don't have OCR data, you can simply use exons and promoters.

genomic.annots$exons_promoters <- list(exons = genomic.annots$exons, 
                                       promoters = genomic.annots$promoters)

Enhancer loops category

We need chromatin loop data, such as PC-HiC or ABC scores, with the following columns, and save as a GRanges object.

Promoter-capture HiC (PC-HiC)

Here we use promoter-capture HiC (PC-HiC) data from cardiomyocytes (CMs). You may skip this if you do not have relevant PC-HiC data.

pcHiC <- readRDS(system.file("extdata", "pcHiC.CM.gr.rds", package = "mapgen"))
pcHiC <- pcHiC[pcHiC$gene_name %in% gene.annots$gene_name, ] # restrict to protein coding genes
head(pcHiC, 3)

ABC scores

The process_ABC() function processes a data frame of ABC scores from Nasser et al. Nature 2021, and converts it to a GRanges object.

Here we use ABC scores from heart ventricle (from Nasser et al. Nature 2021). You may skip this if you do not have relevant ABC scores.

ABC <- data.table::fread(system.file("extdata", "heart_ventricle-ENCODE_ABC.tsv.gz", package = "mapgen"))
ABC <- process_ABC(ABC, full.element = TRUE)
ABC <- ABC[ABC$gene_name %in% gene.annots$gene_name, ] # restrict to protein coding genes
head(ABC, 3)

Enhancer regions

Here we define enhancer regions by OCRs. You can also use histone mark data to define enhancer regions.

genomic.annots$enhancer_regions <- OCRs[OCRs$peakType!="Promoter",]

Considering the fact that PC-HiC and HiC loops may miss contacts between close regions due to technical reasons, here we also consider enhancer regions within 20 kb of active promoters as "nearby interactions".

nearby20kb <- nearby_interactions(genomic.annots$enhancer_regions,
                                  active_promoters, 
                                  max.dist = 20000)

We include ABC, PC-HiC and nearby interactions in the "enhancer loop" category.

genomic.annots$enhancer_loops <- list(ABC = ABC, pcHiC = pcHiC, nearby20kb = nearby20kb)
summary(genomic.annots)

Run gene mapping

Parameters:

gene.mapping.res <- compute_gene_pip(finemapstats, 
                                     genomic.annots, 
                                     intron.mode = FALSE, 
                                     d0 = 50000,
                                     exon.weight = 1, 
                                     loop.weight = 1)
head(gene.mapping.res)
table(gene.mapping.res$category)

Gene-level result table

gene.pip.res <- extract_gene_level_result(gene.mapping.res, gene.annots)
head(gene.pip.res)
cat(sprintf("%d genes with PIP >= 0.8", 
            length(gene.pip.res$gene_name[gene.pip.res$gene_pip >= 0.8])))

80% credible gene sets

gene.cs <- gene_cs(gene.mapping.res, by.locus = TRUE, gene.cs.coverage = 0.8)
head(gene.cs)

Gene Manhattan plot

label genes with gene PIP > 0.8.

gene_manhattan_plot(gene.pip.res, gene.pip.thresh = 0.8)

Note: in some cases where a gene spans two nearby LD blocks and/or is linked to SNPs in multiple credible sets (L > 1), the gene PIP may exceed 1, which can be interpreted as the expected number of causal variants targeting the gene (see Methods of our paper for details).



kevinlkx/mapgen documentation built on March 31, 2024, 11:03 p.m.