We implemented gchromVAR as an R
package for performing enrichment analyses of genome-wide chromatin accessibility data with summary statistics (per-variant posterior-probabilities) from common-variant associaitons.
This vignette covers the usage of gchromVAR with standard processed input data from the hematopoesis interrogation that we've previously described. This package was designed to provide new functionality by generalizing the chromVAR framework. Notably, while chromVAR requires binary annotations (such as the presence or absence of transcription factor binding motifs), generalized- or genetic-chromVAR relaxes this assumption and allows for continuous annotations using the computeWeightedDeviations
function. For more detailed documentation covering different options for various inputs and additional applications of the chromVAR package, see the additional vignettes on the chromVAR documentation website.
library(chromVAR) library(gchromVAR) library(BuenColors) library(SummarizedExperiment) library(data.table) library(BiocParallel) library(BSgenome.Hsapiens.UCSC.hg19) set.seed(123) register(MulticoreParam(2))
After loading the required packages as shown in the preceding code chunk, note also that the time-consuming functionality in gchromVAR can be parallelized to the number of available cores on your machine. The rest of this vignette will involve using these imported functions and packages to score GWAS enrichments for 16 traits in 18 cell types.
Below is an overview highlighting the cell types with the corresponding colors used throughout this vignette available through BuenColors. Under our schematic of the differentation hierarchy, the 16 terminal blood cell traits that we analyzed with common variant associations are shown.
Figure 1. Overview of hematopoesis with cell types (top; note colors) and GWAS traits (bottom) explored thus far with gchromVAR.
Also, note that the traits are positioned under the hierarchy in accordance with our current understanding of where these cell traits manifest after differentiating. One might hypothesize that the terminal cell types may be the most enriched for those particular traits (e.g. erythroblasts should be enriched for GWAS signal for traits like red blood cell count). One may also expect late-stage progenitors (i.e. the cell types that occur before terminal differentation that may be multi-potent like CMPs or MEPs). With this in mind, we are ready to explore the gchromVAR results.
For using ATAC-seq or other chromatin accessibility data for gchromVAR, we recommend using the proatac pipeline to generate 1 set of consensus, equal-wdith peaks spanning the cellular phenotypes and synthesizing a singular counts table. These data for hematopoesis are already available in the extdata
of gchromVAR. The format of the quantitative chromatin data for the key functionality is a RangedSummarizedExperiment, which we construct using the counts matrix and peak definitions. The last line of code also adds the GC content per peak, which is used in the background model.
countsFile <- paste0(system.file('extdata/paper',package='gchromVAR'), "/hemeCounts.tsv.gz") peaksFile<- paste0(system.file('extdata/paper',package='gchromVAR'), "/hemePeaks.bed.gz") peaksdf <- data.frame(fread(paste0("zcat < ", peaksFile))) peaks <- makeGRangesFromDataFrame(peaksdf, seqnames = "V1", start.field = "V2", end.field = "V3") counts <- data.matrix(fread(paste0("zcat < ", countsFile))) SE <- SummarizedExperiment(assays = list(counts = counts), rowData = peaks, colData = DataFrame(names = colnames(counts))) SE <- addGCBias(SE, genome = BSgenome.Hsapiens.UCSC.hg19)
Next, we want to overlap the GWAS summary statistics with our accessibility peaks. To facilitate this, we've created the importBedScore
function, which will iteratively build another RangedSummarizedExperiment where the same peaks define the row space and the columns are defined as various GWAS annotations. The counts are quantitative measures of the total fine-mapped posterior probabilities of the variants overlapping those peaks.
files <- list.files(system.file('extdata/paper/PP001',package='gchromVAR'), full.names = TRUE, pattern = "*.bed$") head(read.table(files[1])) ukbb <- importBedScore(rowRanges(SE), files, colidx = 5) str(ukbb)
With these two RSE
objects created (SE
for chromatin counts and ukbb
for GWAS data), we can now run the core gchromVAR function, computeWeightedDeviations
.
The functional execution having followed the previous steps is straight forward--
ukbb_wDEV <- computeWeightedDeviations(SE, ukbb) zdf <- reshape2::melt(t(assays(ukbb_wDEV)[["z"]])) zdf[,2] <- gsub("_PP001", "", zdf[,2]) colnames(zdf) <- c("ct", "tr", "Zscore") head(zdf)
The later steps just show re-formatted output for convenience. The last table, zdf
provides a data frame of the Cell type (ct
), trait (tr
) and enrichment z-score for each of the 16x18 combinations.
A statistical way of determining how our results faired is to determine the relative positioning of lineage-specific enrichments. For each trait, we define the appropriate set of lineage-specific tissues (from Figure 1), and then annotate the results data frame. First, we define the four major lineages in hematopoesis:
Ery <- c("HSC", "MPP", "CMP", "MEP", "Ery") Meg <- c("HSC", "MPP", "CMP", "MEP", "Mega") Mye <- c("HSC", "MPP", "CMP", "LMPP", "GMP-A", "GMP-B", "GMP-C", "Mono", "mDC") Lymph <- c("HSC", "MPP", "LMPP", "CLP", "NK", "pDC", "CD4", "CD8", "B")
Next, we annotate the data frame work lineage-specific ("LS") enrichments:
zdf$LS <- (zdf$tr == "BASO_COUNT" & zdf$ct %in% Mye) + (zdf$tr == "EO_COUNT" & zdf$ct %in% Mye) + (zdf$tr == "NEUTRO_COUNT" & zdf$ct %in% Mye) + (zdf$tr == "MONO_COUNT" & zdf$ct %in% Mye) + (zdf$tr == "WBC_COUNT" & zdf$ct %in% c(Mye, Lymph)) + (zdf$tr == "LYMPH_COUNT" & zdf$ct %in% Lymph) + (zdf$tr == "PLT_COUNT" & zdf$ct %in% Meg) + (zdf$tr == "MPV" & zdf$ct %in% Meg) + (zdf$tr == "MEAN_RETIC_VOL" & zdf$ct %in% Ery) + (zdf$tr == "HGB" & zdf$ct %in% Ery) + (zdf$tr == "HCT" & zdf$ct %in% Ery) + (zdf$tr == "MCH" & zdf$ct %in% Ery) + (zdf$tr == "MCV" & zdf$ct %in% Ery) + (zdf$tr == "MCHC" & zdf$ct %in% Ery) + (zdf$tr == "RBC_COUNT" & zdf$ct %in% Ery) + (zdf$tr == "RETIC_COUNT" & zdf$ct %in% Ery)
Finally, we can compute a Mann-Whitney Rank-sum statistic to determine the relative enrichment of the gchromVAR results compared to the non-lineage specific measures:
zdf$gchromVAR_pvalue <- pnorm(zdf$Zscore, lower.tail = FALSE) gchromVAR_ranksum <- sum(1:dim(zdf)[1]*zdf[order(zdf$gchromVAR_pvalue, decreasing = FALSE), "LS"]) permuted <- sapply(1:10000, function(i) sum(1:dim(zdf)[1] * sample(zdf$LS, length(zdf$LS)))) pnorm((mean(permuted) - gchromVAR_ranksum)/sd(permuted), lower.tail = FALSE)
From this statistic, we can see that lineage-specific enrichments for much stronger for gchromVAR than we'd expect by chance. We quantified this for several related methods on the initial landing page, demonstrating the utility of our approach.
Finally, we can do a simple Bonferonni-correction threshold to see that nearly all of the enrichments from gchromVAR are lineage specific. Moverover, these enrichments corroborate known biology in hematopoesis.
zdf[zdf$gchromVAR_pvalue < 0.05/288 , ]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.