Variant Set Enrichment

VSE calculates the enrichment of associated variant set (AVS) for an array of genomic regions. The AVS is the collection of disjoint LD blocks computed from a list of disease associated SNPs and their linked (LD) SNPs. VSE generates a null distribution of matched random variant sets (MRVSs) from 1000 Genome Project Phase III data that are identical to AVS, LD block by block. It then computes the enrichment of AVS intersecting with user provided genomic features (e.g., histone marks or transcription factor binding sites) compared with the null distribution.


The Genomewide association studies have blossomed recently. These studies have outputted genetic variants that are significantly associated to the patients with the related condition. These genetic variants are genetic predispositions to that particular condition. It is extremely important to check if the genetic predispositions for a particular disease are enriched in any particular genomic features. VSE computes exactly that. With a given list of genetic predisitions along with the closely linked SNPs and a list of genomic features, VSE computes if the AVS is significantly enriched in any given genomic feature. This provides a great primer on what features are etiological important for that disease.


Given that you have a raggr output named "ld.csv" and a sample sheet named "sampleSheet.csv", you can run the following commands for VSE. Read through to learn more.

#Load ld
ld <- loadLd("ld.csv", type="raggr")
#Make AVS
avs <- makeAVS(ld)
#Generate null
mrvs <- makeMRVS(avs, bgSize=100, mc.cores=8)
#Load sample sheet
samples <- read.csv("sampleSheet.csv", header=TRUE)
#Check intersection matrix
intersect_matrix <- intersectMatrix(avs, regions=samples, \
col=c("white","grey10"), scale="none", margins=c(5,5), \
cexRow = 1, cexCol = 0.5, Rowv=NA, Colv=NA)
#Run VSE
vse <- variantSetEnrichment(avs, mrvs, samples)
#Check normality of null
par.original <- par(no.readonly=TRUE);
par(mfrow=c(ceiling(length(samples$Peaks)/3), 3), mai=c(1,1,0.5,0.1))
#Get summary result
VSEplot(vse, las=2, pch=20)

Running VSE

As an example, we are going to analyze the enrichment of Breast Cancer AVS across histone marks available for BCa cells, MCF7. We have collected ChIP-seq data for H3K4me1, H3K27me3, H3K27ac, H3K36me3, H3K4me3 and DNAase-seq (DHS) for MCF7 from ENCODE project. We have also extracted the genetic predispositions (tag SNPs) for BCa from GWAS catalog. The following vignette will show step by step instruction to analyze the enrichment of BCa AVS in the above-mentioned histone marks in MCF7.

Loading BCa genetic predispositions

The associated SNPs (tag SNPs) for a particular disease can be downloaded from GWAS catalog or other GWAS studies. The tag SNPs are typically representative of their respective LD blocks, and any SNP that are in close linkage disequilibrium has the potential to be causal SNP. Thus, it is important to consider all SNPs that are high LD with the tag SNPs. The easiest and recommended way to calculate the LD blocks is to use the webtool Upload your list of SNP and compute the LD blocks using 1000 Genome Project, Phase III, Oct 2014, Hg19 data, All European population with default setting for other parameters. Save the "Union of results" from the result page and unzip the file. We have done this for BCa tag SNPs obtained from GWAS catalog and got the union result. The resulting file has been provided with this package as an example data. Let's load this data into VSE:

bca.ld <- loadLd(file.path(system.file("extdata", 
#bca.ld is a GRanges object

Making AVS

At the first step, a disjoint list of associated variant set (AVS) must be constructed, in which, one SNP is present in only one LD block to avoid inflation of enrichment. The disjoint LD blocks are generated by making a network of all SNPs where each node is a SNP and each EDGE is the linkage disequilibrium. A neighborhood in the network is a disjointed LD block.

bca.avs <- makeAVS(bca.ld)
#Check the size of each LD block
avs.size <- avsSize(bca.avs)
ggplot(avs.size, aes(x=tagID, y=Size)) + 
  geom_bar(stat="identity") + 
  coord_flip() + 
  theme_classic() + 
  theme(axis.text.y = element_text(size=5))

bca.avs is a GRangesList object. It can be converted to dataframe or a bed file for other use.


Constructing Matched Random Variant Sets(MRVSs)

To compute the enrichment of AVS in genomic features, a null set must be constructed. MRVSs are identical to AVS block by block. The argument bgSize denotes the size of the null. Higher the null size, more accurate the analysis. In typical cases, a background size of 100 is sufficient. This is the most time consuming part, but once the MRVSs are computed for a particular AVS, it can be used for several genomic features. The argument mc.cores denotes the the number of cores to be used for this step.

Note for Windows user: You must use mc.cores = 1

bca.mrvs.200 <- makeMRVS(bca.avs, bgSize=200, mc.cores = 8)

Since generating mrvs takes considerably longer preiod, it is ideal to save the MRVS for future use. MRVSs are constant for a particular AVS, thus it is not necessary to regenerate for the same AVS.

# Save MRVS for future use
save(bca.mrvs.200, file="bca.mrvs.200.Rda")

We have generated, saved and added the MRVS (size = 200) data for BCa AVS with the package as example. You can load it by:

# Load MRVS

Loading genomic features

Once the AVS and MRVSs are ready, you can do intersection analysis of AVS across a set of genomic features. The genomic features could be peak coordinates from ChIP-seq data or any other genomic coordinates in bed format. For example, we have added the ChIP-seq peak data for 5 histone marks and DNAse-seq data for MCF7 obtained from ENCODE.

The list of genomic features must be prepared in a sample sheet, similar to what used for ChIPQC or DiffBind package. For example, we have added a function loadSampleRegions() which will download 6 region coordinates in bed format and a sample sheet from If you download these files in a location other the default below, please change the path in the Peaks column in sampleSheet.csv.

#Downloading sample regions
sampleSheet_path <- loadSampleRegions()
#Loading sample sheet
samples <- read.csv(sampleSheet_path, header=TRUE)

Check intersection of each LD block across genomic features

The intersection of each LD block in the AVS with the genomic features can be visualized.

bca.intersect <- intersectMatrix(bca.avs, 
                                 cexRow = 1, 
                                 cexCol = 0.5, 

Calculate enrichment

To calculate enrichment, variantSetEnrichment function requires three arguments - an AVS GRangesList object outputted by makeAVS function, a bca.mrvs list outputted by makeMRVS function, and a region dataframe with genomic features. The enrichment can be run by:

bca.vse <- variantSetEnrichment(bca.avs, bca.mrvs.200, samples) 

variantSetEnrichment output is a list of three matrices, in which, the second matrix is the normalized null distribution. It is essential that the null distribution is a normal distribution. Normality of the null can be confirmed using a QQ-plot.

par.original <- par(no.readonly=TRUE);
par(mfrow=c(ceiling(length(samples$Peaks)/3), 3), 

The normality of the distribution is confirmed using Kolmogorv-Smirnov test with the null hypothesis of that the null distribution is different than a normal distribution. The significance of the test is outputted underneath the plots. Any distribution with significant difference (p-value < 0.05) should not be considered for the subsequent analysis. However, VSE transforms the null distribution by Box-cox power transformation which approaches the normality of the data. It is very unlikely that the null distribution will not be normal if the background size is sufficiently large (over 100).

Save results

The VSE result can be summarized by:

bca.vse.res <- VSESummary(bca.vse)

Or a visualization of the result can be be outputted by:

        padj = 0.05, 
        main="BCa AVS in MCF7 genomic features")


There are several factors to be considered while using VSE:

  1. VSE is sensitive to the number of tagSNPs. From our trial and error tests, too low number of tagSNPs (below 15) provide imprecise result.
  2. The quality of ChIP-seq data is very important. We recommend users to confirm the quality of the ChIP-seq data and to only use data that are of good quality to avoid false enrichment. There are tools like ChIPQC or Chillin for quality control of ChIP-seq data.
  3. Make sure that you use r2 cutoff of 0.8 when calculating the SNPs in LD.
  4. Make sure that the input files are from the same genomic build (e.g., both SNPs and genomic features are in Hg19).
  5. Null size is a critical factor for reproducible results. Higher the null size better the normalcy of the distribution. 500 is recommended though it was longer time to compute. It is wiser to compute large MRVS once and save.

Try the VSE package in your browser

Any scripts or data that you put into this service are public.

VSE documentation built on May 2, 2019, 4 p.m.