knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
rCOGS is a package for integrating GWAS summary statistics with the p romoter c apture Hi-C (pcHi-C) contact maps in order to prioritise genes for functional followup. COGS stands for C apture Hi-C O mnibus G ene S core and works best with pcHi-C datasets processed using CHiCAGO.
The hardest part of running rCOGS is making sure source files are available and in the correct format. With all files make sure they are relevant to the population that you are studying and use the correct genome build coordinates. Unless explicitly stated all coords are GRCh37 and non-zero based.
Full support files that are suitable for use with data from Javierre et al. are available in ./inst/extdata/
. Note that this vignette uses cutdown versions of these files filtered to use only chr22 to speed things up.
Depending on usage rCOGS requires the following files:
These could be computed using data from the International HapMap project or you can use more recent alternative regions from Berisa et al.. Note that your LD file should match the population that you are studying! If you are using a more targeted platform such as ImmunoChip then you should use the LD regions used to design the platform. What is key is that these regions are non overlapping, it does not matter if they are not contiguous (you will just lose associations that don't overlap). The format is non-zero based BED format as follows
| chr | start | end | |-----|---------|---------| | 1 | 1 | 888659 | | 1 | 888660 | 1891263 | | 1 | 1891264 | 2299651 |
In order to model whether a given SNP is causal rCOGS needs an estimate of the Minor allele Frequency (MAF) in a control set of samples. If you are using your own data you will have this information to hand. If on the other hand you are using summary statistics downloaded from a public resource such as GWAS Catalog you can use a reference set of genotypes such as 1KGenomes again it is important to make sure that these are relevant to the population that you are studying. The format is as follows and each variant to assessed should have a MAF in this file otherwise it will be excluded. Again position are non-zero based.
| chr | pos | maf | |-----|--------|----------| | 1 | 713914 | 0.033325 | | 1 | 714310 | 0.026845 | | 1 | 715265 | 0.037027 |
These are the univariate p-values obtained from GWAS analysis, nowadays the fitting of regression models is mainly performed in PLINK and thus this output format is supported. If however you have used an alternative method to assess association then it should have the following format.
| chr | pos | p | |-----|--------|-------| | 1 | 768253 | 0.617 | | 1 | 781845 | 0.891 | | 1 | 787606 | 0.871 | | 1 | 787844 | 0.977 |
By far the most complicated data file this should have the following format with data taken from Javierre et al.. It is an enhanced version of the peakMatrix format output by CHiCAGO.
| ensg | name | biotype | strand | baitChr | baitStart | baitEnd | baitID | baitName | oeChr | oeStart | oeEnd | oeID | oeName | dist | Monocytes | Macrophages_M0 | Macrophages_M1 | Macrophages_M2 | Neutrophils | Megakaryocytes | Endothelial_precursors | Erythroblasts | Foetal_thymus | Naive_CD4 | Total_CD4_MF | Total_CD4_Activated | Total_CD4_NonActivated | Naive_CD8 | Total_CD8 | Naive_B | Total_B | |-----------------|--------|----------------|--------|---------|-----------|----------|--------|--------------|-------|-----------|-----------|--------|--------|--------|------------------|------------------|------------------|------------------|------------------|------------------|------------------------|------------------|-------------------|------------------|------------------|---------------------|------------------------|------------------|-------------------|------------------|------------------| | ENSG00000000003 | TSPAN6 | protein_coding | - | X | 99894205 | 99902395 | 813869 | SRPX2;TSPAN6 | X | 100023790 | 100032508 | 813903 | . | 129849 | 1.2768325567372 | 3.36925757955278 | 3.50427556762941 | 5.77885715736681 | 0.71946319470011 | 1.57426075233114 | 0.935151981213395 | 2.07176252723988 | 0.518991005066328 | 1.35844284079167 | 1.80051450777455 | 2.14482233107626 | 2.81450087489112 | 1.0646936885784 | 0.329294941260539 | 2.31949322698618 | 2.91843729769059 | | ENSG00000000003 | TSPAN6 | protein_coding | - | X | 99894205 | 99902395 | 813869 | SRPX2;TSPAN6 | X | 100038149 | 100039492 | 813905 | . | 140521 | 2.22145436128498 | 2.34713825641161 | 1.94777355869935 | 1.8326996671119 | 1.19340351282302 | 4.71363557745529 | 1.75066400673428 | 3.10545714323329 | 1.8506228303492 | 1.52337854148901 | 1.92401193124468 | 1.4553080801738 | 0.959541835366562 | 1.47904009071165 | 2.33301812072822 | 5.09250926114187 | 3.5849349597336 |
This file represents the output of an in silico restriction digest for the restriction enzyme employed in the pcHi-C experiment. At the time that this was written this was Hind III. These regions should be zero based and match those in the pcHi-C file above, additionally, the fragid column should refer to the baitID and oeID columns in the pcHi-C dataset above.
| chr | start | end | fragid | |-----|-------|-------|--------| | 1 | 1 | 16007 | 1 | | 1 | 16008 | 24571 | 2 | | 1 | 24572 | 27981 | 3 | | 1 | 27982 | 30429 | 4 |
This file is a list of all the fragments for which capture probes were designed and depends on what annotation you are using. The fragid column again should be compatible with the restriction digest file and pcHi-C file. The ensg should also map to the ensg file in pcHi-C data file. Note that a fragment can contain more than one promoter !
| fragid | ensg | |--------|-----------------| | 218 | ENSG00000272438 | | 218 | ENSG00000230699 | | 219 | ENSG00000241180 | | 220 | ENSG00000268179 | | 220 | ENSG00000223764 | | 220 | ENSG00000187634 | | 223 | ENSG00000187961 |
rCOGS integrates functional information where possible. To do this it needs a catalogue of exonic variants. This can be obtained from a tool such as VEP and needs to have the following format:
| chr | pos | ensg | |-----|----------|-----------------| | 1 | 99883719 | ENSG00000000003 | | 1 | 99883962 | ENSG00000000003 | | 1 | 99884017 | ENSG00000000003 | | 1 | 99884166 | ENSG00000000034 |
You will need to have the devtool R package installed
library(devtools) install_github("ollyburren/rCOGS")
library(rCOGS) library(GenomicRanges) library(magrittr) library(knitr)
Here we load in ld regions - these are checked and output as a GRanges object
This is how to build objects - note due to size you will need to download these supporting objects into ./inst/extdata/
ld.gr <- load_ld_regions('../inst/extdata/hamap_1cM_recomb.bed') ld.gr
data("ld.gr",package="rCOGS") ld.gr
maf.DT <- load_ref_maf('../inst/extdata/uk10k_reference_maf.tab',min.maf=0.05) head(maf.DT)
data("maf.DT",package="rCOGS") head(maf.DT)
N.B here we have a study with 4036 cases and 6959 controls.
t1d.DT <- load_gwas('../inst/extdata/ncooper_t1d_gwas.tab',maf.DT,ld.gr,n.cases=5913,n.controls=8829) head(t1d.DT)
data("t1d.DT",package="rCOGS") head(t1d.DT)
Notice that not all variants are imported, some are ommited as they are below the MAF filter (5%) used in the previous step others don't overlap an LD regions.
note the biotype.filter argument. This can be a vector of biotypes or left blank if everything is required.
feature.sets <- make_pchic('../inst/extdata/javierre_pchic_ensembl75.tab',biotype.filter='protein_coding') names(feature.sets)
data("feature.sets",package="rCOGS") names(feature.sets)
Note that we obtain a list where each element is a data.table
head(feature.sets[[1]])
All Hi-C technologies struggle when assessing interactions between adjacent restricition fragments as it is impossible to resolve what is due to stochastic Brownian motion and which are short range interactions. To offset this COGS has a routine that creates a `Virtual' promter region that consists of a baited fragment and the adjacent 5' and 3' fragments. It takes the digest and design file as arguments.
We add this to the list of regions.
## get a list of all genes included all.genes <- lapply(feature.sets,function(g) unique(g$ensg)) %>% do.call('c',.) %>% unique feature.sets[['VProm']] <- make_vprom('../inst/extdata/hindIII_digest_grch37.tab','./inst/extdata/javierre_design_ensembl75.tab',all.genes)
returns a genomic ranges object
csnps.gr <- make_csnps('../inst/extdata/coding_snps_vep_ensembl75.tab') head(csnps.gr)
data("csnps.gr",package = "rCOGS") head(csnps.gr)
The fundamental unit of a pcHi-C map is a restriction fragment this saves us a lot of computation as we can precompute the support for each fragment to be causal once and then combine to get scores.
digest.gr <- load_digest('../inst/extdata/hindIII_digest_grch37.tab') head(digest.gr)
data("digest.gr",package="rCOGS") head(digest.gr)
MHC/HLA has an extremely complex and long LD structure and the genetic finemapping method employed is unable to effectively deal with it, thus it is best to exclude and analyse separately.
nrow(t1d.DT) mhc.idx <- which(t1d.DT$chr==6 & between(t1d.DT$pos,25e6,35e6)) if(length(mhc.idx)>0) t1d.DT <- t1d.DT[-mhc.idx,] nrow(t1d.DT)
overall.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets) head(overall.scores[order(cogs,decreasing = TRUE),]) %>% kable
target_tissue<-c('Total_CD4_Activated','Total_CD4_NonActivated','Naive_CD4','Total_CD4_MF','Naive_CD8','Total_CD8') tcell.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets,target_tissue) %>% head(.,) head(tcell.scores[order(cogs,decreasing = TRUE),]) %>% kable
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.