In ollyburren/rCOGS: Computes gene prioritisation scores based on promoter capture Hi-C data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

What is rCOGS ?

rCOGS is a package for integrating GWAS summary statistics with the p romoter c apture Hi-C (pcHi-C) contact maps in order to prioritise genes for functional followup. COGS stands for C apture Hi-C O mnibus G ene S core and works best with pcHi-C datasets processed using CHiCAGO.

What does rCOGS require ?

The hardest part of running rCOGS is making sure source files are available and in the correct format. With all files make sure they are relevant to the population that you are studying and use the correct genome build coordinates. Unless explicitly stated all coords are GRCh37 and non-zero based.

Full support files that are suitable for use with data from Javierre et al. are available in ./inst/extdata/. Note that this vignette uses cutdown versions of these files filtered to use only chr22 to speed things up.

Depending on usage rCOGS requires the following files:

approximate linkage disequilibrium independent region files

These could be computed using data from the International HapMap project or you can use more recent alternative regions from Berisa et al.. Note that your LD file should match the population that you are studying! If you are using a more targeted platform such as ImmunoChip then you should use the LD regions used to design the platform. What is key is that these regions are non overlapping, it does not matter if they are not contiguous (you will just lose associations that don't overlap). The format is non-zero based BED format as follows

| chr | start | end | |-----|---------|---------| | 1 | 1 | 888659 | | 1 | 888660 | 1891263 | | 1 | 1891264 | 2299651 |

Minor allele frequency estimates in controls

In order to model whether a given SNP is causal rCOGS needs an estimate of the Minor allele Frequency (MAF) in a control set of samples. If you are using your own data you will have this information to hand. If on the other hand you are using summary statistics downloaded from a public resource such as GWAS Catalog you can use a reference set of genotypes such as 1KGenomes again it is important to make sure that these are relevant to the population that you are studying. The format is as follows and each variant to assessed should have a MAF in this file otherwise it will be excluded. Again position are non-zero based.

| chr | pos | maf | |-----|--------|----------| | 1 | 713914 | 0.033325 | | 1 | 714310 | 0.026845 | | 1 | 715265 | 0.037027 |

Trait association statistics

These are the univariate p-values obtained from GWAS analysis, nowadays the fitting of regression models is mainly performed in PLINK and thus this output format is supported. If however you have used an alternative method to assess association then it should have the following format.

| chr | pos | p | |-----|--------|-------| | 1 | 768253 | 0.617 | | 1 | 781845 | 0.891 | | 1 | 787606 | 0.871 | | 1 | 787844 | 0.977 |

pcHi-C data

By far the most complicated data file this should have the following format with data taken from Javierre et al.. It is an enhanced version of the peakMatrix format output by CHiCAGO.

| ensg | name | biotype | strand | baitChr | baitStart | baitEnd | baitID | baitName | oeChr | oeStart | oeEnd | oeID | oeName | dist | Monocytes | Macrophages_M0 | Macrophages_M1 | Macrophages_M2 | Neutrophils | Megakaryocytes | Endothelial_precursors | Erythroblasts | Foetal_thymus | Naive_CD4 | Total_CD4_MF | Total_CD4_Activated | Total_CD4_NonActivated | Naive_CD8 | Total_CD8 | Naive_B | Total_B | |-----------------|--------|----------------|--------|---------|-----------|----------|--------|--------------|-------|-----------|-----------|--------|--------|--------|------------------|------------------|------------------|------------------|------------------|------------------|------------------------|------------------|-------------------|------------------|------------------|---------------------|------------------------|------------------|-------------------|------------------|------------------| | ENSG00000000003 | TSPAN6 | protein_coding | - | X | 99894205 | 99902395 | 813869 | SRPX2;TSPAN6 | X | 100023790 | 100032508 | 813903 | . | 129849 | 1.2768325567372 | 3.36925757955278 | 3.50427556762941 | 5.77885715736681 | 0.71946319470011 | 1.57426075233114 | 0.935151981213395 | 2.07176252723988 | 0.518991005066328 | 1.35844284079167 | 1.80051450777455 | 2.14482233107626 | 2.81450087489112 | 1.0646936885784 | 0.329294941260539 | 2.31949322698618 | 2.91843729769059 | | ENSG00000000003 | TSPAN6 | protein_coding | - | X | 99894205 | 99902395 | 813869 | SRPX2;TSPAN6 | X | 100038149 | 100039492 | 813905 | . | 140521 | 2.22145436128498 | 2.34713825641161 | 1.94777355869935 | 1.8326996671119 | 1.19340351282302 | 4.71363557745529 | 1.75066400673428 | 3.10545714323329 | 1.8506228303492 | 1.52337854148901 | 1.92401193124468 | 1.4553080801738 | 0.959541835366562 | 1.47904009071165 | 2.33301812072822 | 5.09250926114187 | 3.5849349597336 |

Restriction fragment digest file

This file represents the output of an in silico restriction digest for the restriction enzyme employed in the pcHi-C experiment. At the time that this was written this was Hind III. These regions should be zero based and match those in the pcHi-C file above, additionally, the fragid column should refer to the baitID and oeID columns in the pcHi-C dataset above.

| chr | start | end | fragid | |-----|-------|-------|--------| | 1 | 1 | 16007 | 1 | | 1 | 16008 | 24571 | 2 | | 1 | 24572 | 27981 | 3 | | 1 | 27982 | 30429 | 4 |

pcHi-C design/annotation file

This file is a list of all the fragments for which capture probes were designed and depends on what annotation you are using. The fragid column again should be compatible with the restriction digest file and pcHi-C file. The ensg should also map to the ensg file in pcHi-C data file. Note that a fragment can contain more than one promoter !

| fragid | ensg | |--------|-----------------| | 218 | ENSG00000272438 | | 218 | ENSG00000230699 | | 219 | ENSG00000241180 | | 220 | ENSG00000268179 | | 220 | ENSG00000223764 | | 220 | ENSG00000187634 | | 223 | ENSG00000187961 |

Coding SNPs

rCOGS integrates functional information where possible. To do this it needs a catalogue of exonic variants. This can be obtained from a tool such as VEP and needs to have the following format:

| chr | pos | ensg | |-----|----------|-----------------| | 1 | 99883719 | ENSG00000000003 | | 1 | 99883962 | ENSG00000000003 | | 1 | 99884017 | ENSG00000000003 | | 1 | 99884166 | ENSG00000000034 |

Installing COGS

You will need to have the devtool R package installed

library(devtools)
install_github("ollyburren/rCOGS")

library(rCOGS)
library(GenomicRanges)
library(magrittr)
library(knitr)

loading recombination data

Here we load in ld regions - these are checked and output as a GRanges object

This is how to build objects - note due to size you will need to download these supporting objects into ./inst/extdata/

ld.gr <- load_ld_regions('../inst/extdata/hamap_1cM_recomb.bed')
ld.gr

data("ld.gr",package="rCOGS")
ld.gr

loading MAF data filter at 5%

maf.DT <- load_ref_maf('../inst/extdata/uk10k_reference_maf.tab',min.maf=0.05)
head(maf.DT)

data("maf.DT",package="rCOGS")
head(maf.DT)

load association data

N.B here we have a study with 4036 cases and 6959 controls.

t1d.DT <- load_gwas('../inst/extdata/ncooper_t1d_gwas.tab',maf.DT,ld.gr,n.cases=5913,n.controls=8829)
head(t1d.DT)

data("t1d.DT",package="rCOGS")
head(t1d.DT)

Notice that not all variants are imported, some are ommited as they are below the MAF filter (5%) used in the previous step others don't overlap an LD regions.

load in pcHi-C map

note the biotype.filter argument. This can be a vector of biotypes or left blank if everything is required.

feature.sets <- make_pchic('../inst/extdata/javierre_pchic_ensembl75.tab',biotype.filter='protein_coding')
names(feature.sets)

data("feature.sets",package="rCOGS")
names(feature.sets)

Note that we obtain a list where each element is a data.table

head(feature.sets[[1]])

create Virtual Promoter regions

All Hi-C technologies struggle when assessing interactions between adjacent restricition fragments as it is impossible to resolve what is due to stochastic Brownian motion and which are short range interactions. To offset this COGS has a routine that creates a `Virtual' promter region that consists of a baited fragment and the adjacent 5' and 3' fragments. It takes the digest and design file as arguments.

We add this to the list of regions.

## get a list of all genes included
all.genes <- lapply(feature.sets,function(g) unique(g$ensg)) %>% do.call('c',.) %>% unique
feature.sets[['VProm']] <- make_vprom('../inst/extdata/hindIII_digest_grch37.tab','./inst/extdata/javierre_design_ensembl75.tab',all.genes)

load in coding SNPS

returns a genomic ranges object

csnps.gr <- make_csnps('../inst/extdata/coding_snps_vep_ensembl75.tab')
head(csnps.gr)

data("csnps.gr",package = "rCOGS")
head(csnps.gr)

load Digest file

The fundamental unit of a pcHi-C map is a restriction fragment this saves us a lot of computation as we can precompute the support for each fragment to be causal once and then combine to get scores.

digest.gr <- load_digest('../inst/extdata/hindIII_digest_grch37.tab')
head(digest.gr)

data("digest.gr",package="rCOGS")
head(digest.gr)

Optional(Remove MHC)

MHC/HLA has an extremely complex and long LD structure and the genetic finemapping method employed is unable to effectively deal with it, thus it is best to exclude and analyse separately.

nrow(t1d.DT)
mhc.idx <- which(t1d.DT$chr==6 & between(t1d.DT$pos,25e6,35e6))
if(length(mhc.idx)>0)
    t1d.DT <- t1d.DT[-mhc.idx,]
nrow(t1d.DT)

compute overall cogs score

overall.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets)
head(overall.scores[order(cogs,decreasing = TRUE),]) %>% kable

Compute a T cell specific score

target_tissue<-c('Total_CD4_Activated','Total_CD4_NonActivated','Naive_CD4','Total_CD4_MF','Naive_CD8','Total_CD8')
tcell.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets,target_tissue) %>% head(.,)
head(tcell.scores[order(cogs,decreasing = TRUE),]) %>% kable

ollyburren/rCOGS documentation built on Aug. 5, 2020, 2:13 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ollyburren/rCOGS
Computes gene prioritisation scores based on promoter capture Hi-C data

In ollyburren/rCOGS: Computes gene prioritisation scores based on promoter capture Hi-C data

What is rCOGS ?

What does rCOGS require ?

approximate linkage disequilibrium independent region files

Minor allele frequency estimates in controls

Trait association statistics

pcHi-C data

Restriction fragment digest file

pcHi-C design/annotation file

Coding SNPs

Installing COGS

loading recombination data

loading MAF data filter at 5%

load association data

load in pcHi-C map

create Virtual Promoter regions

load in coding SNPS

load Digest file

Optional(Remove MHC)

compute overall cogs score

Compute a T cell specific score

R Package Documentation

Browse R Packages

We want your feedback!

ollyburren/rCOGS Computes gene prioritisation scores based on promoter capture Hi-C data

In ollyburren/rCOGS: Computes gene prioritisation scores based on promoter capture Hi-C data

What is rCOGS ?

What does rCOGS require ?

approximate linkage disequilibrium independent region files

Minor allele frequency estimates in controls

Trait association statistics

pcHi-C data

Restriction fragment digest file

pcHi-C design/annotation file

Coding SNPs

Installing COGS

loading recombination data

loading MAF data filter at 5%

load association data

load in pcHi-C map

create Virtual Promoter regions

load in coding SNPS

load Digest file

Optional(Remove MHC)

compute overall cogs score

Compute a T cell specific score

R Package Documentation

Browse R Packages

We want your feedback!

ollyburren/rCOGS
Computes gene prioritisation scores based on promoter capture Hi-C data