knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

What is rCOGS ?

rCOGS is a package for integrating GWAS summary statistics with the p romoter c apture Hi-C (pcHi-C) contact maps in order to prioritise genes for functional followup. COGS stands for C apture Hi-C O mnibus G ene S core and works best with pcHi-C datasets processed using CHiCAGO.

What does rCOGS require ?

The hardest part of running rCOGS is making sure source files are available and in the correct format. With all files make sure they are relevant to the population that you are studying and use the correct genome build coordinates. Unless explicitly stated all coords are GRCh37 and non-zero based.

Full support files that are suitable for use with data from Javierre et al. are available in ./inst/extdata/. Note that this vignette uses cutdown versions of these files filtered to use only chr22 to speed things up.

Depending on usage rCOGS requires the following files:

approximate linkage disequilibrium independent region files

These could be computed using data from the International HapMap project or you can use more recent alternative regions from Berisa et al.. Note that your LD file should match the population that you are studying! If you are using a more targeted platform such as ImmunoChip then you should use the LD regions used to design the platform. What is key is that these regions are non overlapping, it does not matter if they are not contiguous (you will just lose associations that don't overlap). The format is non-zero based BED format as follows

| chr | start | end | |-----|---------|---------| | 1 | 1 | 888659 | | 1 | 888660 | 1891263 | | 1 | 1891264 | 2299651 |

Minor allele frequency estimates in controls

In order to model whether a given SNP is causal rCOGS needs an estimate of the Minor allele Frequency (MAF) in a control set of samples. If you are using your own data you will have this information to hand. If on the other hand you are using summary statistics downloaded from a public resource such as GWAS Catalog you can use a reference set of genotypes such as 1KGenomes again it is important to make sure that these are relevant to the population that you are studying. The format is as follows and each variant to assessed should have a MAF in this file otherwise it will be excluded. Again position are non-zero based.

| chr | pos | maf | |-----|--------|----------| | 1 | 713914 | 0.033325 | | 1 | 714310 | 0.026845 | | 1 | 715265 | 0.037027 |

Trait association statistics

These are the univariate p-values obtained from GWAS analysis, nowadays the fitting of regression models is mainly performed in PLINK and thus this output format is supported. If however you have used an alternative method to assess association then it should have the following format.

| chr | pos | p | |-----|--------|-------| | 1 | 768253 | 0.617 | | 1 | 781845 | 0.891 | | 1 | 787606 | 0.871 | | 1 | 787844 | 0.977 |

pcHi-C data

By far the most complicated data file this should have the following format with data taken from Javierre et al.. It is an enhanced version of the peakMatrix format output by CHiCAGO.

| ensg | name | biotype | strand | baitChr | baitStart | baitEnd | baitID | baitName | oeChr | oeStart | oeEnd | oeID | oeName | dist | Monocytes | Macrophages_M0 | Macrophages_M1 | Macrophages_M2 | Neutrophils | Megakaryocytes | Endothelial_precursors | Erythroblasts | Foetal_thymus | Naive_CD4 | Total_CD4_MF | Total_CD4_Activated | Total_CD4_NonActivated | Naive_CD8 | Total_CD8 | Naive_B | Total_B | |-----------------|--------|----------------|--------|---------|-----------|----------|--------|--------------|-------|-----------|-----------|--------|--------|--------|------------------|------------------|------------------|------------------|------------------|------------------|------------------------|------------------|-------------------|------------------|------------------|---------------------|------------------------|------------------|-------------------|------------------|------------------| | ENSG00000000003 | TSPAN6 | protein_coding | - | X | 99894205 | 99902395 | 813869 | SRPX2;TSPAN6 | X | 100023790 | 100032508 | 813903 | . | 129849 | 1.2768325567372 | 3.36925757955278 | 3.50427556762941 | 5.77885715736681 | 0.71946319470011 | 1.57426075233114 | 0.935151981213395 | 2.07176252723988 | 0.518991005066328 | 1.35844284079167 | 1.80051450777455 | 2.14482233107626 | 2.81450087489112 | 1.0646936885784 | 0.329294941260539 | 2.31949322698618 | 2.91843729769059 | | ENSG00000000003 | TSPAN6 | protein_coding | - | X | 99894205 | 99902395 | 813869 | SRPX2;TSPAN6 | X | 100038149 | 100039492 | 813905 | . | 140521 | 2.22145436128498 | 2.34713825641161 | 1.94777355869935 | 1.8326996671119 | 1.19340351282302 | 4.71363557745529 | 1.75066400673428 | 3.10545714323329 | 1.8506228303492 | 1.52337854148901 | 1.92401193124468 | 1.4553080801738 | 0.959541835366562 | 1.47904009071165 | 2.33301812072822 | 5.09250926114187 | 3.5849349597336 |

Restriction fragment digest file

This file represents the output of an in silico restriction digest for the restriction enzyme employed in the pcHi-C experiment. At the time that this was written this was Hind III. These regions should be zero based and match those in the pcHi-C file above, additionally, the fragid column should refer to the baitID and oeID columns in the pcHi-C dataset above.

| chr | start | end | fragid | |-----|-------|-------|--------| | 1 | 1 | 16007 | 1 | | 1 | 16008 | 24571 | 2 | | 1 | 24572 | 27981 | 3 | | 1 | 27982 | 30429 | 4 |

pcHi-C design/annotation file

This file is a list of all the fragments for which capture probes were designed and depends on what annotation you are using. The fragid column again should be compatible with the restriction digest file and pcHi-C file. The ensg should also map to the ensg file in pcHi-C data file. Note that a fragment can contain more than one promoter !

| fragid | ensg | |--------|-----------------| | 218 | ENSG00000272438 | | 218 | ENSG00000230699 | | 219 | ENSG00000241180 | | 220 | ENSG00000268179 | | 220 | ENSG00000223764 | | 220 | ENSG00000187634 | | 223 | ENSG00000187961 |

Coding SNPs

rCOGS integrates functional information where possible. To do this it needs a catalogue of exonic variants. This can be obtained from a tool such as VEP and needs to have the following format:

| chr | pos | ensg | |-----|----------|-----------------| | 1 | 99883719 | ENSG00000000003 | | 1 | 99883962 | ENSG00000000003 | | 1 | 99884017 | ENSG00000000003 | | 1 | 99884166 | ENSG00000000034 |

Installing COGS

You will need to have the devtool R package installed

library(devtools)
install_github("ollyburren/rCOGS")
library(rCOGS)
library(GenomicRanges)
library(magrittr)
library(knitr)

loading recombination data

Here we load in ld regions - these are checked and output as a GRanges object

This is how to build objects - note due to size you will need to download these supporting objects into ./inst/extdata/

ld.gr <- load_ld_regions('../inst/extdata/hamap_1cM_recomb.bed')
ld.gr
data("ld.gr",package="rCOGS")
ld.gr

loading MAF data filter at 5%

maf.DT <- load_ref_maf('../inst/extdata/uk10k_reference_maf.tab',min.maf=0.05)
head(maf.DT)
data("maf.DT",package="rCOGS")
head(maf.DT)

load association data

N.B here we have a study with 4036 cases and 6959 controls.

t1d.DT <- load_gwas('../inst/extdata/ncooper_t1d_gwas.tab',maf.DT,ld.gr,n.cases=5913,n.controls=8829)
head(t1d.DT)
data("t1d.DT",package="rCOGS")
head(t1d.DT)

Notice that not all variants are imported, some are ommited as they are below the MAF filter (5%) used in the previous step others don't overlap an LD regions.

load in pcHi-C map

note the biotype.filter argument. This can be a vector of biotypes or left blank if everything is required.

feature.sets <- make_pchic('../inst/extdata/javierre_pchic_ensembl75.tab',biotype.filter='protein_coding')
names(feature.sets)
data("feature.sets",package="rCOGS")
names(feature.sets)

Note that we obtain a list where each element is a data.table

head(feature.sets[[1]])

create Virtual Promoter regions

All Hi-C technologies struggle when assessing interactions between adjacent restricition fragments as it is impossible to resolve what is due to stochastic Brownian motion and which are short range interactions. To offset this COGS has a routine that creates a `Virtual' promter region that consists of a baited fragment and the adjacent 5' and 3' fragments. It takes the digest and design file as arguments.

We add this to the list of regions.

## get a list of all genes included
all.genes <- lapply(feature.sets,function(g) unique(g$ensg)) %>% do.call('c',.) %>% unique
feature.sets[['VProm']] <- make_vprom('../inst/extdata/hindIII_digest_grch37.tab','./inst/extdata/javierre_design_ensembl75.tab',all.genes)

load in coding SNPS

returns a genomic ranges object

csnps.gr <- make_csnps('../inst/extdata/coding_snps_vep_ensembl75.tab')
head(csnps.gr)
data("csnps.gr",package = "rCOGS")
head(csnps.gr)

load Digest file

The fundamental unit of a pcHi-C map is a restriction fragment this saves us a lot of computation as we can precompute the support for each fragment to be causal once and then combine to get scores.

digest.gr <- load_digest('../inst/extdata/hindIII_digest_grch37.tab')
head(digest.gr)
data("digest.gr",package="rCOGS")
head(digest.gr)

Optional(Remove MHC)

MHC/HLA has an extremely complex and long LD structure and the genetic finemapping method employed is unable to effectively deal with it, thus it is best to exclude and analyse separately.

nrow(t1d.DT)
mhc.idx <- which(t1d.DT$chr==6 & between(t1d.DT$pos,25e6,35e6))
if(length(mhc.idx)>0)
    t1d.DT <- t1d.DT[-mhc.idx,]
nrow(t1d.DT)

compute overall cogs score

overall.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets)
head(overall.scores[order(cogs,decreasing = TRUE),]) %>% kable

Compute a T cell specific score

target_tissue<-c('Total_CD4_Activated','Total_CD4_NonActivated','Naive_CD4','Total_CD4_MF','Naive_CD8','Total_CD8')
tcell.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets,target_tissue) %>% head(.,)
head(tcell.scores[order(cogs,decreasing = TRUE),]) %>% kable


ollyburren/rCOGS documentation built on Aug. 5, 2020, 2:13 p.m.