VSE calculates the enrichment of associated variant set (AVS) for an array of genomic regions. The AVS is the collection of disjoint LD blocks computed from a list of disease associated SNPs and their linked (LD) SNPs. VSE generates a null distribution of matched random variant sets (MRVSs) from 1000 Genome Project Phase III data that are identical to AVS, LD block by block. It then computes the enrichment of AVS intersecting with user provided genomic features (e.g., histone marks or transcription factor binding sites) compared with the null distribution.
The Genomewide association studies have blossomed recently. These studies have outputted genetic variants that are significantly associated to the patients with the related condition. These genetic variants are genetic predispositions to that particular condition. It is extremely important to check if the genetic predispositions for a particular disease are enriched in any particular genomic features. VSE computes exactly that. With a given list of genetic predisitions along with the closely linked SNPs and a list of genomic features, VSE computes if the AVS is significantly enriched in any given genomic feature. This provides a great primer on what features are etiological important for that disease.
Given that you have a raggr output named "ld.csv" and a sample sheet named "sampleSheet.csv", you can run the following commands for VSE. Read through to learn more.
library("VSE") #Load ld ld <- loadLd("ld.csv", type="raggr") #Make AVS avs <- makeAVS(ld) #Generate null mrvs <- makeMRVS(avs, bgSize=100, mc.cores=8) #Load sample sheet samples <- read.csv("sampleSheet.csv", header=TRUE) #Check intersection matrix intersect_matrix <- intersectMatrix(avs, regions=samples, \ col=c("white","grey10"), scale="none", margins=c(5,5), \ cexRow = 1, cexCol = 0.5, Rowv=NA, Colv=NA) intersect_matrix #Run VSE vse <- variantSetEnrichment(avs, mrvs, samples) #Check normality of null par.original <- par(no.readonly=TRUE); par(mfrow=c(ceiling(length(samples$Peaks)/3), 3), mai=c(1,1,0.5,0.1)) VSEqq(vse) par(par.original) #Get summary result VSESummary(vse) #Visualize VSEplot(vse, las=2, pch=20)
As an example, we are going to analyze the enrichment of Breast Cancer AVS across histone marks available for BCa cells, MCF7. We have collected ChIP-seq data for H3K4me1, H3K27me3, H3K27ac, H3K36me3, H3K4me3 and DNAase-seq (DHS) for MCF7 from ENCODE project. We have also extracted the genetic predispositions (tag SNPs) for BCa from GWAS catalog. The following vignette will show step by step instruction to analyze the enrichment of BCa AVS in the above-mentioned histone marks in MCF7.
The associated SNPs (tag SNPs) for a particular disease can be downloaded from GWAS catalog or other GWAS studies. The tag SNPs are typically representative of their respective LD blocks, and any SNP that are in close linkage disequilibrium has the potential to be causal SNP. Thus, it is important to consider all SNPs that are high LD with the tag SNPs. The easiest and recommended way to calculate the LD blocks is to use the webtool http://raggr.usc.edu. Upload your list of SNP and compute the LD blocks using 1000 Genome Project, Phase III, Oct 2014, Hg19 data, All European population with default setting for other parameters. Save the "Union of results" from the result page and unzip the file. We have done this for BCa tag SNPs obtained from GWAS catalog and got the union result. The resulting file has been provided with this package as an example data. Let's load this data into VSE:
library("VSE") bca.ld <- loadLd(file.path(system.file("extdata", "ld_BCa_raggr.csv", package="VSE")), type="raggr") #bca.ld is a GRanges object bca.ld
At the first step, a disjoint list of associated variant set (AVS) must be constructed, in which, one SNP is present in only one LD block to avoid inflation of enrichment. The disjoint LD blocks are generated by making a network of all SNPs where each node is a SNP and each EDGE is the linkage disequilibrium. A neighborhood in the network is a disjointed LD block.
bca.avs <- makeAVS(bca.ld) #Check the size of each LD block avs.size <- avsSize(bca.avs) head(avs.size) library(ggplot2) ggplot(avs.size, aes(x=tagID, y=Size)) + geom_bar(stat="identity") + coord_flip() + theme_classic() + theme(axis.text.y = element_text(size=5))
bca.avs is a GRangesList object. It can be converted to dataframe or a bed file for other use.
bca.avs
To compute the enrichment of AVS in genomic features, a null set must be constructed. MRVSs are identical to AVS block by block. The argument bgSize denotes the size of the null. Higher the null size, more accurate the analysis. In typical cases, a background size of 100 is sufficient. This is the most time consuming part, but once the MRVSs are computed for a particular AVS, it can be used for several genomic features. The argument mc.cores denotes the the number of cores to be used for this step.
Note for Windows user: You must use mc.cores = 1
bca.mrvs.200 <- makeMRVS(bca.avs, bgSize=200, mc.cores = 8)
Since generating mrvs takes considerably longer preiod, it is ideal to save the MRVS for future use. MRVSs are constant for a particular AVS, thus it is not necessary to regenerate for the same AVS.
# Save MRVS for future use save(bca.mrvs.200, file="bca.mrvs.200.Rda")
We have generated, saved and added the MRVS (size = 200) data for BCa AVS with the package as example. You can load it by:
# Load MRVS load(file.path(system.file("extdata", "bca.mrvs.200.Rda", package="VSE"))) class(bca.mrvs.200)
Once the AVS and MRVSs are ready, you can do intersection analysis of AVS across a set of genomic features. The genomic features could be peak coordinates from ChIP-seq data or any other genomic coordinates in bed format. For example, we have added the ChIP-seq peak data for 5 histone marks and DNAse-seq data for MCF7 obtained from ENCODE.
The list of genomic features must be prepared in a sample sheet, similar to what used for ChIPQC or DiffBind package. For example, we have added a function loadSampleRegions()
which will download 6 region coordinates in bed format and a sample sheet from www.hansenhelab.org/VSE/sample_regions/. If you download these files in a location other the default below, please change the path in the Peaks
column in sampleSheet.csv
.
#Downloading sample regions sampleSheet_path <- loadSampleRegions() #Loading sample sheet samples <- read.csv(sampleSheet_path, header=TRUE) samples
The intersection of each LD block in the AVS with the genomic features can be visualized.
bca.intersect <- intersectMatrix(bca.avs, regions=samples, col=c("white","grey10"), scale="none", margins=c(5,5), cexRow = 1, cexCol = 0.5, Rowv=NA, Colv=NA) bca.intersect
To calculate enrichment, variantSetEnrichment function requires three arguments - an AVS GRangesList object outputted by makeAVS function, a bca.mrvs list outputted by makeMRVS function, and a region dataframe with genomic features. The enrichment can be run by:
bca.vse <- variantSetEnrichment(bca.avs, bca.mrvs.200, samples)
variantSetEnrichment output is a list of three matrices, in which, the second matrix is the normalized null distribution. It is essential that the null distribution is a normal distribution. Normality of the null can be confirmed using a QQ-plot.
par.original <- par(no.readonly=TRUE); par(mfrow=c(ceiling(length(samples$Peaks)/3), 3), mai=c(1,1,0.5,0.1)) VSEqq(bca.vse) par(par.original)
The normality of the distribution is confirmed using Kolmogorv-Smirnov test with the null hypothesis of that the null distribution is different than a normal distribution. The significance of the test is outputted underneath the plots. Any distribution with significant difference (p-value < 0.05) should not be considered for the subsequent analysis. However, VSE transforms the null distribution by Box-cox power transformation which approaches the normality of the data. It is very unlikely that the null distribution will not be normal if the background size is sufficiently large (over 100).
The VSE result can be summarized by:
bca.vse.res <- VSESummary(bca.vse) bca.vse.res
Or a visualization of the result can be be outputted by:
VSEplot(bca.vse, las=2, pch=20, cex=1, cex.main=0.6, padj = 0.05, main="BCa AVS in MCF7 genomic features")
There are several factors to be considered while using VSE:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.