In BIMSBbioinfo/reg2gene: Predicting the target genes for regulatory elements

library(knitr)
library(reg2gene)
library(InteractionSet)
library(GenomicRanges)
library(rmarkdown)

opts_chunk$set(warning = FALSE,
               message= FALSE,
               fig.align='center',
               fig.path='Figures', 
               dev='png',
               fig.show='hold', 
               cache=FALSE)

Introduction

reg2gene R package was build to perform two main categories of tasks:

1. to build gene expression ~ enhancer activity models -

to associate target genes to regulatory elements genome-wide.

2. to annotate user provided genomic regions to genes -

TASK 1: Building models & predicting target genes set of reg2gene functions allows users to perform:

1. DATA INTEGRATION - e.g. to extract information stored in bigWig files (across different cell types) to:
2. a. quantify (enhancer) activity (regulatory potential) for genomic regions of interest (enhancer regions), and
3. b. quantify gene expression for genes of interest, and/or
4. c. to combine these informations into one object used further for modelling
1. DATA MODELLING (including performance assessment) - model of gene expression - enhancer activity using different algorithms, and combine different models with meta-analysis and voting procedure. Additionally, one can benchmark results and obtain confusion matrix.
1. DATA VISUALIZATION - Reported regulatory regions - genes associations can be visualized as loops in the context of the genome.

Schematic representation of quantification and data integration:

#knitr::include_graphics("https://github.com/BIMSBbioinfo/reg2gene/blob/master/vignettes/Figures/QuantificationDataIntegrationSimplified.png")

knitr::include_graphics("/data/akalin/Projects/AAkalin_reg2gene/reg2gene/vignettes/Figures/QuantificationDataIntegrationSimplified.png")

TASK 2: Annotate user provided genomic regions to the genes

One can annotate regions of interest (ChIP-Seq peaks or similarily defined genomic regions) to the genes they are associated with, based on the result of modelling enhancer-gene associations.

Optionally, one can as well associate genes with diseases reported in the disease-gene databases.

This vignette describes necessary input data for reg2gene package and demonstrates following functionalities :

1) How to quantify gene expression?
2) How to quantify enhancer activity?
3) How to prepare data prior modelling procedure?
4) How to run models of gene expression ~ enhancer activity?
5) How to benchmark reported enhancer-gene associations?
6) How to assess modelling performance?
7) How to visualize reported enhancer-gene associations?
8) How to annotate user provided genomic regions to the genes?

Building models with reg2gene

How to quantify gene expression?

FUNCTION OF CHOICE: bwToGeneExp()

BACKGROUND

Before running a modelling analysis, for genes and enhancers of interest, gene expression and enhancer activity need to be quatified (and normalized) -> Data integration step of the analysis.

IN DETAIL:

In short, this function firstly quantifies exon expressions over pre-defined exon regions using signal from RNA-Seq tracks (bigwig files), then it sums over all exons of a gene of interest to obtain levels of gene expression. With this function, gene expression levels can be quantified over a set of samples, cell types or conditions (list of .bigWig files).

DESCRIPTION

For a gene of interest and across all input samples (.bw files) individaully,the expression is firstly calculated for all exon regions definded in the input GRanges object (exons argument, details below). For quantificatons of enhancer activity over this exon regions, a ScoreMatrixBin() from the genomation package is used.

This is followed by per gene expression quantification as follows:

$Gene Expression=\sum_{i=1}^n (Exon Expression_n * ExonLength_n)/(GeneLength)$

$GeneLength=\sum_{i=1}^n ExonLength_n$

e.g. mean exon expression scores are multiplied by the exon length,summed together and divided by the gene length(a sum of all exon lengths for that gene) After quantifying gene expression, a normalization step ("quantile" or "ratio") can be runned.

NOTES: This function might be extended to work with BAM files in the future, however now it works only with relevant .bigWig files RNA-Seq .bw files can originate from stranded,unstranded or mixed libraries.

bwToGeneExp() OUTPUT:

When quantification using bwToGeneExp() is performed using toy examples of input GRanges objects (regTSS_toy) and a list of bigWig files (regActivityInputExample) a following result is obtained:

test.bw <- system.file("extdata", "test.bw",package = "reg2gene")
test2.bw <- system.file("extdata", "test2.bw",package = "reg2gene")
regTSS_toy <- GRanges(c(rep("chr1",2),"chr2",rep("chr1",3)),
                      IRanges(c(1,7,9,15,1,15),c(4,8,14,20,4,20)),
                                            c(rep("+",3),rep("-",3)))
regTSS_toy$reg <-  regTSS_toy[c(1,1,3,5,5,5)]
regTSS_toy$name2 <- regTSS_toy$name <- paste0("TEST_Reg",
                                        c(1,1,3,5,5,5))


 bwToGeneExp(exons = regTSS_toy,target = c(test.bw,test2.bw))

For each gene, information is quantified first across exons and then summed per gene to quantify gene expression levels.

In the following toy example there are 3 genes: TEST_Reg1, TEST_Reg3 and TEST_Reg5 with different number of exons:

TEST_Reg1 has 2 exons,

TEST_Reg3 has 1 exons,

TEST_Reg3 has 5 exons,

Notice that output GRanges object now has 3 genomic ranges reported which correspond to the toy genes analyzed: TEST_Reg1,TEST_Reg3 and TEST_Reg5. For these genes TSS is location is reported in the genomic ranges object.

Additionally, if exons input is of GInteractions class object, the same output GRanges object is obtained:

require(InteractionSet)

test.bw <- system.file("extdata", "test.bw",package = "reg2gene")
 test2.bw <- system.file("extdata", "test2.bw",package = "reg2gene")

 regTSS_toy <- GRanges(c(rep("chr1",2),"chr2",rep("chr1",3)),
                       IRanges(c(1,7,9,15,1,15),c(4,8,14,20,4,20)),
                                             c(rep("+",3),rep("-",3)))
  regTSS_toy$reg <-  regTSS_toy[c(1,1,3,5,5,5)]
  regTSS_toy$name2 <- regTSS_toy$name <- paste0("TEST_Reg",c(1,1,3,5,5,5))

 # if exons input is of GInteractions class object, the same output is obtained

 exons= GInteractions(regTSS_toy,regTSS_toy$reg)
    exons$name=regTSS_toy$name
    exons$name2=regTSS_toy$name2

     exons

     print("Which results in:")

    bwToGeneExp(exons = regTSS_toy,target = c(test.bw,test2.bw))

How to quantify enhancer activity?

FUNCTION OF CHOICE: regActivity()

BACKGROUND

Before running a modelling analysis, for genes and enhancers of interest, gene expression and enhancer activity need to be quatified (and normalized) -> Data integration step of the analysis.

IN DETAIL

Enhancer activity can be quantified based on the bigWig tracks of ChIP-Seq signals for histone modifications (such as H3K27ac and H3K4me1, but for any other histone modification as well), of DNase-seq signals and of bisulfite-sequencing results for DNA methylation for any pre-defined GRanges object which contains info about enhancer regions in the genome.

To quantify enhancer activity regActivity() is used, and this funcion performs across a set of samples.

DESCRIPTION

For a regulatory regions of interest and across all input samples (.bw files) individaully the activity is calculated using ScoreMatrixBin() from the genomation package. Regulatory activity is measured by averaging logFC for histone modification ChIP-seq profiles, or DNAse signal, or DNA methylation per base. A before, relevant bigWig files are required to calculate activity,but function might be extended to work with BAM files in the future.

An output of a function is a GRanges object where its meta-columns correspond to the calculated activity levels per input sample. Column names correspond to provided sample ids or names.

regActivity() OUTPUT:

test.bw <- system.file("extdata", "test.bw",package = "reg2gene")
test2.bw <- system.file("extdata", "test2.bw",package = "reg2gene")
regTSS_toy <- GRanges(c(rep("chr1",4),rep("chr2",2)),
                      IRanges(c(1,7,9,15,1,15),c(4,8,14,20,4,20)),
                                            c(rep("+",3),rep("-",3)))
regTSS_toy$reg <-  regTSS_toy[c(1,1,3:6)]
regTSS_toy$name2 <- regTSS_toy$name <- paste0("TEST_Reg",
                                        c(1,1,3:length(regTSS_toy)))
regActivity(regTSS_toy,c(test.bw,test2.bw))

How to prepare data prior modelling procedure?

FUNCTION OF CHOICE: regActivityAroundTSS()

BACKGROUND After quantification has been done, quantified enhancer activities and gene expressions need to be stored in one GRanges object, in the following manner: for each gene include all enhancers within a certain range from the TSS (transcriptional start site) of a gene (by default it is +/-1Mb).

This is important since many enhancers do not necessarily regulate the closest gene (Mifsud et al. 2015, Schoenfelder et al 2015, Javierre et al. 2016), but they bypass nearby genes while forming long-range interactions with targeted promoters (Tolhuis et al. 2002, Lettice et al. 2003) and vice-versa, genes are commonly regulated by more than one enhancer region (Tolhuis et al. 2002).

IN DETAIL

1) Identify enhancers located within certain window from TSS 2) Create GRanges list where, per gene, gene expression & enhancer activity for nearby enhancers will be stored.

DESCRIPTION

An output of regActivityAroundTSS() is a GRangesList object that contains per gene GRanges (names for the GRangesList are unique gene ids/names). Each GRanges stores location of the corresponding TSS and regulatory regions identified around that gene. Metadata for each GRanges object in the GRangesList represents regulatory activity and gene expression quantified across a number of samples (.bw files) that have matched IDs in RNA-Seq and CHiP-Seq,DNase-Seq or bisulfite sequencing experiment.

Thus, GRanges objects have the following metadata columns:

 1. featureType: either "gene" or "regulatory" 
 2. name: name/id for gene and enhancers. Gene name could 
 be id from a database enhancer name should be in the format as follows 
 "chr:start-end" 
 3. name2: a secondary name for the feature, such as gene symbol "PAX6" etc.
 not necessary for enhancers could be NA 
 4. other columns: numeric values for gene expression or regulatory actvity.
 Column names represent sample names/ids

Importantly, only enhancers located within predefined (+/-) upstream/downstream regions of TSS are identified, extracted and reported in output (together with info about gene expression).

!!! Sample id's (corresponding to the cell types or conditions) are included in output object only if both, 1) gene expression values and 2) quantified regulatory activity are available in TSS and regActivity objects. Non-overlapping cell types are excluded.

regActivityAroundTSS() OUTPUT:

regTSS_toy <- GRReg1_toy
  regTSS_toy$bw1 <- rep(1,length(GRReg1_toy))
  regTSS_toy$bw2 <- rep(2,length(GRReg1_toy))
  regTSS_toy$bw3 <- rep(3,length(GRReg1_toy))
regReg_toy <- GRReg2_toy
   regReg_toy$bw1 <- rep(3,length(regReg_toy))
   regReg_toy$bw2 <- rep(4,length(regReg_toy))

regActivityAroundTSS(regReg_toy,regTSS_toy,upstream=1,downstream=1)

NOTE: In the previous example there are 3 genes: TEST_Reg1, TEST_Reg2, TEST_Reg3 so GRanges in GRangesList object are named accordingly.

How to run models of gene expression ~ enhancer activity?

FUNCTION OF CHOICE:associateReg2Gene()

BACKGROUND

To link enhancer activity and gene expression reg2gene functions utilize correlate and/or use other statistical approaches, elastic net and random forests, to model gene expression ~ enhancer activity across different cell types. This approach is based on the observation that enhancers show very high tissue specificity and the level of their activity correlates with gene expression (Visel et al. 2009., Ernst et al. 2011).

ANALYSIS SCOPE

All analyses can be done easily for different types of enhancer-related chromatin marks ( H3K4me1, H3K27ac, DNA methylation, DHS, etc.) and for different sources of data (Roadmap, Blueprint or in-house datasets), as well as different algorithms can be used: dcor, elastic net, random forest, spearman and pearson correlation coefficient.

Additionaly, single-model information (gene-enhancer pairs) can be aggregated by means of meta-analysis across different data-sources or by majority voting decisions across both different modelling procedures/ data-types/data-sources. This analysis is possible only when more that one modelling procedure has been performed.

IN DETAIL

The function implements five methods to associate regulatory activity to genes:

correlation (Pearson and/or Spearman),

distance correlation ("dcor"),

elasticnet and

random forests.

Correlation is implemented to capture linear individual enhancer~gene relationships. Likewise, simple linear regression on scaled data is equivalent to Pearson correlation.

Distance correlation is a metric for statistical dependence between two variables. It can capture non-linear relationships different from correlation.

Elasticnet and randomForests are methods that can capture additivity between regulatory activities and their relationship to gene expression. For details about these methods take a look at:

For all the methods, P-values are estimated by shuffling the gene expression vector B times and getting a null distribution for the estimated statistic, and comparing the original statistic to the null distribution. Estimated statistics are correlation coefficients for correlation and distance correlation methods, regression coefficients for elasticnet and gini importance scores for random forests. P-values from resampling statistics are estimated using Gamma distribution, e.g. gamma distribution based P-values from the null distribution are obtained from resampling (gamma distribution is fit to the null distribution and p-values are calculated based on that fitted distribution).

An output of this function is GInteraction object containing every potential association and between a regulatory region and TSS, and the estimated association statistic: its P-values and Q-values.

For the output of regActivityAroundTSS() object

###############################
#STEP 1.  Getting random and predefined .8 correlation

 require(GenomicRanges)
 require(doMC)
 require(glmnet)
 require(foreach)
 require(stringr)
 require(qvalue)

 ####################################
 # create example

 x <- c(2.000346,2.166255,0.7372374,0.9380581,2.423209, 
      2.599857,4.216959,2.589133,1.848172,3.039659)
 y <- c(2.866875,2.817145,2.1434456,2.9039771,3.819091,5.009990,
      5.048476,2.884551,2.780067,4.053136)
 corrM <- rbind(x,y)

 # define Granges object
  gr0 <- GRanges(seqnames=rep("chr1",2),IRanges(1:2,3:4))

    GeneInfo <- as.data.frame(matrix(rep(c("gene","regulatory"),each=3),
                ncol = 3,byrow = TRUE),stringsAsFactors=FALSE)

        colnames(GeneInfo) <- c("featureType","name","name2")

       mcols(gr0) <- DataFrame(cbind(GeneInfo,corrM))


       gr0

Run associateReg2Gene() as follows:

###############################
#STEP 1.  Getting random and predefined .8 correlation

 require(GenomicRanges)
 require(doMC)
 require(glmnet)
 require(foreach)
 require(stringr)
 require(qvalue)

 ####################################
 # create example

 x <- c(2.000346,2.166255,0.7372374,0.9380581,2.423209, 
      2.599857,4.216959,2.589133,1.848172,3.039659)
 y <- c(2.866875,2.817145,2.1434456,2.9039771,3.819091,5.009990,
      5.048476,2.884551,2.780067,4.053136)
 corrM <- rbind(x,y)

 # define Granges object
  gr0 <- GRanges(seqnames=rep("chr1",2),IRanges(1:2,3:4))

    GeneInfo <- as.data.frame(matrix(rep(c("gene","regulatory"),each=3),
                ncol = 3,byrow = TRUE),stringsAsFactors=FALSE)

        colnames(GeneInfo) <- c("featureType","name","name2")

       mcols(gr0) <- DataFrame(cbind(GeneInfo,corrM))

    print("associateReg2Gene(gr0,cores = 1,B=100)")
    associateReg2Gene(gr0,cores = 1,B=100)

This example was created to have correlation between gene expression and enhancer activity equal to 0.8. This estimated association statistic is reported as coefs and is recalculated as 0.8. Associated P-values are calculated in and reported as pval. Corresponding qval is calculated. However, q-values are set to be NA since, this toy example contains only one gene~enhancer pair, thus one p-value is calculated but q-values cannot be calculated based on only one p-value.

Importantly, this function does not report statistically significant gene-enhancer associations, whereas it reports all input gene-enhancer associations and corresponding test statistics. It is up to researcher to decide which gene-enhancer associations they consider to be statistically significant pairs.

MODELLING ADDITIONS: meta-analysis & voting

In cases when more that one modelling procedure has been performed, one can try to combine these results. This package can perform meta-analysis and voting procedure.

voting

FUNCTION OF CHOICE:voteInteractions()

BACKGROUND

When can you perform voting?

You can perform voting analysis to combine models that differ in algorithm, or method, or cohort (just as meta-analysis) used when modelling.

For example, for Roadmap H3K4me1 dataset 5 different modelling algorithms can be used: dcor, spearman, pearson, elastic net & random forests, and results of all these can be combined by voting.

The results of this voting procedure is a list of gene~enhancer associations which have been confirmed by at least two or more algorithms.

The same thing can be done for combining gene~enhancer associations which have been confirmed by at least two or more methods/experimantal type, e.g. voting by method I used to combine H3K4me1, H3K27ac, DNAme and DNaze modelling results based on for example Roadmap&dcor models.

IN DETAIL

Multiple associations output by different methods and data sets can be combined this way, and only the associations that have support in vote.threshold fraction of the datasets will be retained.

voteInteractions() selects POSITIVES (statistically associated gene~enhancer pairs) for each result of associateReg2Gene() analysis that wants to be combined by majority voting (for example results of H3K4me1 and H3K27ac associateReg2Gene() analysis).

Assessing statistically associated gene-enhancer pairs has been done by filtering the statistics (cutoff.stat) of the elements of the input list (gene-enhancer pairs) based on the defined cutoff value (cutoff.val).

For example, retain only gene-enhancer pairs that are supported by modelling reg2gene analysis using both H3K4me1 and H3K27ac to quantify enhancer activity. Or, retain only gene-enhancer pairs that are supported by modelling reg2gene analysis using pearson, spearman and elasticnet algorithms.

Schematic representation of majority voting:

#knitr::include_graphics("https://github.com/BIMSBbioinfo/reg2gene/master/vignettes/Figures/VOTING.png")

knitr::include_graphics("/data/akalin/Projects/AAkalin_reg2gene/reg2gene/vignettes/Figures/VOTING.png")

voteInteractions() OUTPUT:

is GInteractions object with gene~enhancer interactions voted above predefined threshold (reported as votes column).

require(GenomicRanges)
require(InteractionSet)

gr2 <- gr <- GRanges(seqnames=rep("chr1",3),IRanges(1:3,3:5))
   x <- 1:5
   y <- 2:6
   z <- 10:14
   a <- rep(0,length(x))


   GeneInfo <- as.data.frame(matrix(c(rep("gene",3),rep("regulatory",6)),
               ncol = 3,byrow = TRUE),stringsAsFactors=FALSE)
               colnames(GeneInfo) <- c("featureType","name","name2")

 mcols(gr) <- DataFrame(cbind(GeneInfo,rbind(x,y,z)))
 mcols(gr2) <- DataFrame(cbind(GeneInfo,rbind(x,y,a)))

 # create associateReg2Gene output objects, GInteractions will all 
 # output results

 AssocObject <- reg2gene::associateReg2Gene(gr)
 AssocObject2 <- reg2gene::associateReg2Gene(gr2)

 # input for meta-analysis is list of such objects

 interactions <- list(AssocObject,AssocObject2)
 names(interactions) <- c("H3K4me1","H327ac")

 # Run voteInteractions
 voteInteractions(interactions, 
                  cutoff.stat="pval",
                  cutoff.val=0.05,
                  vote.threshold=0.5)

By defult then following command was runned: voteInteractions(interactions, cutoff.stat = "pval", cutoff.val = 0.05, vote.threshold = 0.5) And this example was based on the following input data:

Additionally, one can perform meta-analysis:

FUNCTION OF CHOICE:metaInteractions()

BACKGROUND

When can you perform meta-analysis?

If you have information coming from different cohorts, or produced by different organizations/data centers,and thus there is no overlap between samples, one can run meta-analysis to combine p-values from individual models [for given combination of enhancer region-gene expression]. Meta-analysis should improve reprodicibility of modelling results.

For example, I runned meta-analysis in the following case: 1) I selected enhancer regions and genes I want to model 2) For selected enhancer regions I quantified and normalized enhancer activity based on H3K4me1 .bigwig tracks for different cell types. I did that for .bigwig tracks that are reported as a result of Roadmap consortium.

However, there are other sources of publically available per cell type H3K4me1 .bigwig files (take a look at the IHEC web site). Thus, I repeated the same quantification procedure for other available individual experiments/consortiums .bigwig files: Blueprint, CEEHRC: CEMT & McGill.

Altogether, for every tested enhancer~gene associations I runned four different models -> one for each data-source. Since, there is no overlap between data coming from different sources I can meta-analyze results of each gene-enhancer model.

NOTE! When you meta-analyze, you can do that only for the corresponding combination of experimental type/method(H3K4me1) and algorithm (spearman). You do not meta-analyze results of different modelling procedures (spearman+pearson, or H3H27ac and DNAme).

IN DETAIL

metaInteractions() combines association P-values and coefficients from different data sets using Fisher's method and weigthed averaging respectively. It is useful to combine datasets produced by different research groups while avoiding problems such as batch effects.

For example, it can be used to combine Roadmap and Blueprint associateReg2Gene gene~enhancer single-model information obtained using across-cell RNA-Seq signals and CHiP-Seq H3K27ac tracks.

Aggregating single-model information (gene-enhancer pairs) by means of meta-analysis across different data-sources increases the statistical power, improves the precision and accuracy of estimates and altogether produces more robust and reproducible results (Kirkwood 1998)

After, combining P-values, q-values are calculated using the qvalue function.

Meta-analysis - simplified pictorial representation:

#knitr::include_graphics("https://github.com/BIMSBbioinfo/reg2gene/blob/master/vignettes/Figures/Meta-Analysis_Simplified.png")


knitr::include_graphics("/data/akalin/Projects/AAkalin_reg2gene/reg2gene/vignettes/Figures/Meta-Analysis_Simplified.png")

metaInteractions() OUTPUT:

is a GInteractions object with an updated statistics from the meta-analysis: meta-analyzed p-values,q-values, test statistics (coefs). As well as n or number of input cell types used while running this meta-analysis.

The output is similar to the output of associateReg2Gene() because it just combines associateReg2Gene() results across different associateReg2Gene() output results.

# creating datasets

require(GenomicRanges)
require(InteractionSet)

gr2 <- gr <- GRanges(seqnames=rep("chr1",3),IRanges(1:3,3:5))
   x <- 1:5
   y <- 2:6
   z <- 10:14
   a <- rep(0,length(x))


   GeneInfo <- as.data.frame(matrix(c(rep("gene",3),rep("regulatory",6)),
               ncol = 3,byrow = TRUE),stringsAsFactors=FALSE)
               colnames(GeneInfo) <- c("featureType","name","name2")

 mcols(gr) <- DataFrame(cbind(GeneInfo,rbind(x,y,z)))
 mcols(gr2) <- DataFrame(cbind(GeneInfo,rbind(x,y,a)))

 # create associateReg2Gene output objects, GInteractions will all 
 # output results

 AssocObject <- reg2gene::associateReg2Gene(gr)
 AssocObject2 <- reg2gene::associateReg2Gene(gr2)

 # input for meta-analysis is list of such objects

 interactions <- list(AssocObject,AssocObject2)
 names(interactions) <- c("Roadmap","Blueprint")

 # Run metaA
 metaInteractions(interactions)

Which was runned as: metaInteractions(interactions)

How to benchmark reported enhancer-gene associations?

FUNCTION OF CHOICE:benchmarkInteractions()

BACKGROUND

Associated gene-enhacer pairs (an output of associateReg2Gene()) can be easily benchmarked using the second set of linked genes and enhancers. For example, coordinates of predicted interacting regions in the genome, as resulted from chromatin conformation capture and related methods (CHiA-PET, 4C, 5C, HiC, PC-HiC) can be used to benchmark results of reg2gene gene-enhacer modelling procedure. It is so since, information about the 3D genome architecture, can be used as a proxy of enhancer-mediated regulation of gene expression (Dekker et al. 2002, Dosie et al. 2006, Simonis et al. 2006, Fullwood et al. 2009, Lieberman-Aiden et al. 2009), and thus can represent a good benchmark dataset to benchmark results of reg2gene modelling approach. Likewise, a benchmark dataset against which you can compare modelling results, can be experimentaly confirmed enhnacer~gene pairs (Visel et al. 2009), or eQTL-gene associations(GTEx Consortium).

IN DETAIL

benchmarkInteractions() takes two GInteractions objects as an input, and the first object (interactions) is benchmarked with respect to the second one (benchInteractions).

If anchor1 of the first GInteractions object (interactions) overlaps either anchor1 or anchor2 of the second GInteractions object (_benchInteractions__), then anchor2 of the first GInteractions necessarily needs to overlap the opposite pair in the benchmark dataset (2nd GInteractions object). In other words, criss-cross overlap of interacting regions is performed: if anchor1 from the benchmark dataset is overlapping anchor2 from the tested set, than anchor2 from the benchmark dataset needs to overlap anchor1 from the tested set, or vice-versa.

Schematic representation of possible benchmarking procedure

#knitr::include_graphics("https://github.com/BIMSBbioinfo/reg2gene/blob/master/vignettes/Figures/BenchSimpleE.png")

knitr::include_graphics("/data/akalin/Projects/AAkalin_reg2gene/reg2gene/vignettes/Figures/BenchSimpleE.png")

Running benchmark:

   interactions <- GInteractions(GRReg1_toy,GRReg1_toy$reg)[2]
   benchInteractions <- GInteractions(GRReg2_toy,GRReg2_toy$reg)

   # removing confusing meta-data
   mcols(interactions) <- NULL

benchmarkInteractions(interactions,
            benchInteractions,
            binary=TRUE)

In the example above,

anchor1 from the test set overlaps anchor1 from the benchmark set (and anchor2 overlaps anchor2).

Anchor 1 of test set:

tmp <- GInteractions(GRReg1_toy,GRReg1_toy$reg)[2]
mcols(tmp) <- NULL
tmp

Overlaps anchor1 of the benchmark set:

tmp <- GInteractions(GRReg1_toy,GRReg1_toy$reg)[5]
mcols(tmp) <- NULL
tmp

However, to be reported as benchmarked anchor1-anchor2 set (a tested gene-enhancer pair) an anchor2 needs to overlap anchor2 as well:

benchmarkInteractions() OUTPUT:

is equal to the input GInteractions object (interactions), BUT with added benchmark results metadata as a "Bench" column.

How to assess modelling performance?

FUNCTION OF CHOICE:confusionMatrix()

BACKGROUND

confusion matrix can be calculated to assess performance of the modelling procedure based on the external benchmark dataset and the resuls of benchmarkInteractions().

IN DETAIL

There are four different elements of the confusion matrix:

TP = (number of) true positive: reg2gene entry that was reported to be 
associated (reported gene-enhancer statistics lower than a predefined
threshold) and was benchmarked.

FP = (number of) false positive: reg2gene entry that was reported to 
be associated (reported gene-enhancer statistics lower than a predefined
threshold) BUT was not overlapped with benchmark dataset

FN = (number of) false negative: reg2gene entry that was NOT reported to
be associated (reported gene-enhancer statistics NOT lower than a predefined 
threshold) BUT is benchmarked

TN = (number of) true negative: reg2gene entry that was NOT reported to be 
associated (reported gene-enhancer statistics NOT lower than a predefined 
threshold) AND is NOT benchmarked. If no benchmark and no statistically
significant data is entered, then 0 is reported.

From previous categories confusionMatrix() calculates following matrices:

$Sensitivity = TP/(TP+FN)$

$Specificity = TN/(TN+FP)$

$Accuracy = (TP+TN)/(TP+FN+FP+TN)$

$PPV = TP/(TP+FP)$

$NPV = TN/(TN+FN)$

$F1 = (2TP)/((2TP)+FP+FN)$

Usaully, benchmarking needs to be performed since confusion matrix is based on the result of benchmarking...

However, I created this example manually:

interactionsBench <- GInteractions(GRReg1_toy,GRReg1_toy$reg)

Bench <- interactionsBench$anchor1.Bench1Exp
Filter <- interactionsBench$anchor1.Filter1Exp
      mcols(interactionsBench) <- NULL

   interactionsBench$Pval <- seq(0, 1, length.out = length(GRReg1_toy))
   interactionsBench$Bench <- Bench
   interactionsBench$Filter <- Filter

  interactionsBench

confusionMatrix should be runned as follows:

interactionsBench <- GInteractions(GRReg1_toy,GRReg1_toy$reg)

Bench <- interactionsBench$anchor1.Bench1Exp
Filter <- interactionsBench$anchor1.Filter1Exp
      mcols(interactionsBench) <- NULL

   interactionsBench$Pval <- seq(0, 1, length.out = length(GRReg1_toy))
   interactionsBench$Bench <- Bench
   interactionsBench$Filter <- Filter

confusionMatrix(interactionsBench,
                thresholdID = "Pval",
                thresholdValue = 0.05,
                benchCol = "Bench",
                prefilterCol = "Filter",
                statistics = "ConfusionMatrix")

How to visualize reported enhancer-gene associations?

FUNCTION OF CHOICE: plotInteractions()

BACKGROUND

For user provided enhancer~gene associations (imported as GInteractions) object plot associations as loops in the context of the genome.

IN DETAIL

Creating a dataset to plot: FTO/IRX3/IRX5 region

enhancers <- GRanges(rep("chr16",6),
                      IRanges(c(53112601,55531601,53777201,
                                53778801,54084001,53946467),
                              c(53114200,55533250, 53778800, 
                                53780400, 54084400 ,53947933)))

genes <- GRanges(rep("chr16",6),
                     IRanges(c(53737874, 54964773, 54320676,
                               53737874, 54964773, 54320676),
                             c(53737874, 54964773, 54320676,
                               53737874, 54964773, 54320676)))

GenomeInteractions <- GInteractions(enhancers,genes)

GenomeInteractions$name2 <- c("FTO","IRX5","IRX3")

GenomeInteractions$pval <- c(0.20857403, 0.72856090, 0.03586015,
                             0.32663439, 0.32534945, 0.03994488)

GenomeInteractions$color <- c("red","blue","grey")

plotInteractions

 plotInteractions(interactions = GenomeInteractions,
                  statistics ="pval",
                  coloring = "color")

You can choose specific gene to plot: FTO

 plotInteractions(interactions = GenomeInteractions,
                        selectGene="FTO")

Or choose specific region to plot: chr16:53112601-53114200

NOTE! This region does not necessarilly need to be equal to the regulatory regions reported in the interactions input objects, whereas it only needs to overlap some regulatory regions.

  plotInteractions(interactions = GenomeInteractions,
               selectRegulatoryRegion = "chr16:53112601-53114200")

Additionaly, benchmark datasets can be plotted as well

 benchInteractions = list(GenomeInteractions[1:3])

  plotInteractions(interactions = GenomeInteractions,
                         coloring = "color",
                         statistics = "pval",
                         benchInteractions = benchInteractions)

Annotating genomic regions to genes

How to annotate regulatory regions to genes?

FUNCTION OF CHOICE: reg2gene()

BACKGROUND

User can annotate regions of interest (ChIP-Seq peaks or any other genomic regions according to the provided enhancer~gene associations object or gene GRanges object.

Function either annotates input regions to their nearby genes or it hierarchical runs annotation procedure (when GIneraction object used as an input provids info about enhancer~gene associations) as follows: first - annotate to promoters, then remaining regions annotate to enhancers, then remaining regions nearby genes.

IN DETAIL

If interactions object is missing, function annotates the input regions to their nearby genes.

When GIneraction object - interactions - is used as an input, then hierarchical association procedure is runned as follows: promoters,enhancers, nearby genes, eg: 1) genomic regions of interest are first considered to be promoters and associated with nearby genes if they are located within a certain distance from TSS of nerbay gene (default +/-1000bp); otherwise 2) remaning genomic regions are overlapped with enhancer regions, and genes associated to that enhancer regions are reported, 3) if no overlap with either promoters nor enhancers is identified, then closest gene is reported if it is located within 1Mb 4) if no gene located within 1Mb has been identified then, this region is filtered out.

IMPORTANT! interactions can store info about enhancer~gene interactions, and in that case Anchor1 in GInteractions object needs to be regulatory region, whereas anchor2 is the location of gene/TSS.

However, it can store info about interactions regardless whether one anchor of these interactions correspond to promoters, for example HiC interactions. In that case set annotateInteractions=FALSE and interactions are firstly annotated to the genes (unannotated interactions are removed from the dataset), and windows are subsequently annotated to genes using annotated interactions object.

Showcase

Creating toy example:

# creating toy example
# 1. windows of interest
   windows <- GRanges(c("chr1:1-2", # 1. overlap prom
                                 "chr2:1-2",  # 2. overlap enh
                                 "chr3:1-2", # 3. overlap tss +/- 1,000,000
                                 "chr4:1-2")) # 4. do not overlap tss +/- 1Mb

     annotationsEnh <- GRanges(c("chr1:1-2",
                                 "chr2:1-2",
                                 "chr3:100000-100002",
                                 "chr4:10000001-10000002"))

     annotationsGenes <- GRanges(c("chr1:1-2",
                                   "chr2:100000-100002",
                                   "chr3:99999-100002",
                                   "chr4:10000001-10000002"))

     seqlengths(annotationsEnh) <- seqlengths(annotationsGenes) <- rep(10000002,
                                                                       4)
  # example of interactions
         interactions = GInteractions(annotationsEnh,annotationsGenes,
                                 name=c("gen1","gen2","gen3","gen4"))

  # example of geneAnnotations   
     geneAnnotations=second(interactions)
     mcols(geneAnnotations) <- mcols(interactions) 

     print("windows of interest")
     windows
      print("toy example of interactions object")
     interactions
      print("toy example of geneAnnotations object")
     geneAnnotations

Running reg2gene annotation function with created toy example:

  # run annotation function: 
 reg2gene(windows=windows,
          interactions=interactions, 
          geneAnnotations = geneAnnotations)

 # which regions are not identified

 reg2gene(windows=windows,
          interactions=interactions, 
          geneAnnotations = geneAnnotations,
          identified=FALSE)

 # if interactions are not available, assign interactions based solely on the 
 # proximity to promoters
 reg2gene(windows = windows,
                 interactions=NULL,
                 geneAnnotations = geneAnnotations)

In addition, if argument identified is set to be FALSE reg2gene() will report genomic regions for which corresponding genes were not identified.

 reg2gene(windows,
          geneAnnotations=geneAnnotations,
          interactions,
          identified=F)

Additional info:

Data description

GRanges input:

This package comes with 2 toy datasets, GRanges objects:

GRReg1_toy

reg2gene::GRReg1_toy

and GRReg2_toy

reg2gene::GRReg2_toy

with genome locations preselected in such way that all possible outputs of the benchmarking procedure is captured.

*Colums which correspond to the expected outcome of the benchmarking procedure are: Bench1Exp, Filter2Exp, Bench2Exp, Filter1Exp

Bench1 and Filter1 corresponds to the benchmarking of GRReg1_toy with GRReg2_toy,

whereas Bench2 and Filter2 corresponds to the benchmarking of GRReg1_toy with itself.

For example, a following object

reg2gene::GRReg1_toy[7]

is benchmarked 3 times with GRReg2_toy

GRReg2_toy[10:12]

.bigWig files as an input

A second necesarry argument for quantification functions is a named list of BigWig file paths is used as an input to link bigwig locations.

As example shows:

test.bw <- system.file("extdata", "test.bw",package = "reg2gene")
test2.bw <- system.file("extdata", "test2.bw",package = "reg2gene")

regActivityInputExample <- c(test.bw,test2.bw)  

regActivityInputExample

GInteractions objects

Input data for quantification functions are GRanges objects.

However, modelling procedure outputs, by default, GInteractions object from the InteractionSet package.

Toy or any other GRanges objects can be easily organized as GInteractions objects:

    GRReg1_toyGI <- GRReg1_toy

    GRReg1_toyGI <- GInteractions(GRReg1_toyGI,GRReg1_toyGI$reg)

    GRReg1_toyGI[1:3]

Argument description & notes for functions

bwToGeneExp() arguments and notes

INPUT DATA DETAIL description (min arguments):

1. exons

Information about exon and gene coordinates should be a GInteractions object from the InteractionSet package, with Anchor1 representing the exon locations, and Anchor2 represents location of the corresponding gene.

require(InteractionSet)

GRReg1_toyGI <- reg2gene::GRReg1_toy[1]

GRReg1_toyGI <- GInteractions(GRReg1_toyGI,GRReg1_toyGI$reg)

mcols(GRReg1_toyGI) <- mcols(GRReg1_toyGI)[1:3]

GRReg1_toyGI

Optionally, the input GRanges object that contains exon regions over which the expression will be calculated needs to contain a meta-data columns with the following information:

1) reg - GRanges object - corresponding gene location, or TSS location. 2) name (character)- ENSEMBL or other gene identifier; 3) name2(character,optional) - 2nd gene identifier,

Example:

toy <- reg2gene::GRReg1_toy[1]
mcols(toy) <- mcols(toy)[1:3]
toy

It is strongly suggested to adjust seqlengths of this object to be equal to the seqlenghts Hsapiens from the appropriate package (BSgenome.Hsapiens.UCSC.hg19 or whatever needed version).

2. target

A second necesarry argument for quantification functions is a named list of BigWig file paths is used as an input to link bigwig locations.

As example shows:

require(reg2gene)

test.bw <- system.file("extdata", "test.bw",package = "reg2gene")
test2.bw <- system.file("extdata", "test2.bw",package = "reg2gene")

regActivityInputExample <- c(test.bw,test2.bw)  

regActivityInputExample

Additional arguments:

sampleIDs argument

Additional argumets in bwToGeneExp() adjust for names of RNA-Seq libraries (argument sampleIDs); whether stranded or/and unstranded libraries are included, normalization procedure and parallel processing.

For example, sampleIDs by defult uses basenames of the paths for input .bigWig files as a unique sample ids/names. However, a vector of unique sample ids/names(.bw files) can be used as well. It is necessary that values in this vector are ordered as the .bigWig file paths are. The previous toy example unique sample ids:test and test2 can be modified with the sampleIDs argument to CellType and CellType2 as follows:

test.bw <- system.file("extdata", "test.bw",package = "reg2gene")
test2.bw <- system.file("extdata", "test2.bw",package = "reg2gene")
regTSS_toy <- GRanges(c(rep("chr1",2),"chr2",rep("chr1",3)),
                      IRanges(c(1,7,9,15,1,15),c(4,8,14,20,4,20)),
                      c(rep("+",3),rep("-",3)))
regTSS_toy$reg <-  regTSS_toy[c(1,1,3,5,5,5)]
regTSS_toy$name2 <- regTSS_toy$name <- paste0("TEST_Reg",
                                              c(1,1,3,5,5,5))


bwToGeneExp(exons = regTSS_toy,target = c(test.bw,test2.bw),
            sampleIDs=c("CellType1","CellType2"))

libStrand argument

RNA-Seq stranded and unstranded libraries allowed. BUT!!! It is crucial that forward and reverse RNA-Seq libraries are listed in one on top of other. For example,

sampleIDs <- c("/Reverse1.bw","/Forward1.bw","/Reverse2.bw","/Forward2.bw",
               "Unstranded1")

sampleIDs

needs to be accompanied with

libStrand <- c("+","-","+","-","*")

libStrand

normalize

Normalization procedure in not performed by default but 2 normalizations can be optionally used:a) "quantile"

test.bw <- system.file("extdata", "test.bw",package = "reg2gene")
test2.bw <- system.file("extdata", "test2.bw",package = "reg2gene")
regTSS_toy <- GRanges(c(rep("chr1",2),"chr2",rep("chr1",3)),
                      IRanges(c(1,7,9,15,1,15),c(4,8,14,20,4,20)),
                      c(rep("+",3),rep("-",3)))
regTSS_toy$reg <-  regTSS_toy[c(1,1,3,5,5,5)]
regTSS_toy$name2 <- regTSS_toy$name <- paste0("TEST_Reg",
                                              c(1,1,3,5,5,5))


print("bwToGeneExp(exons = regTSS_toy,target = c(test.bw,test2.bw),
      normalize=\"quantile\")")

bwToGeneExp(exons = regTSS_toy,target = c(test.bw,test2.bw),
            normalize="quantile")

If normalize argument is set to be "quantile" activity measures are quantile normalized as implemented in the normalize.quantiles() from preprocessCore package.

and b) "ratio"

test.bw <- system.file("extdata", "test.bw",package = "reg2gene")
test2.bw <- system.file("extdata", "test2.bw",package = "reg2gene")
regTSS_toy <- GRanges(c(rep("chr1",2),"chr2",rep("chr1",3)),
                      IRanges(c(1,7,9,15,1,15),c(4,8,14,20,4,20)),
                      c(rep("+",3),rep("-",3)))
regTSS_toy$reg <-  regTSS_toy[c(1,1,3,5,5,5)]
regTSS_toy$name2 <- regTSS_toy$name <- paste0("TEST_Reg",
                                              c(1,1,3,5,5,5))

print("bwToGeneExp(exons = regTSS_toy,target = c(test.bw,test2.bw),
      normalize=\"ratio\")")                                  

bwToGeneExp(exons = regTSS_toy,target = c(test.bw,test2.bw),
            normalize="ratio")

If normalize argument is set to be"ratio" then "median ratio method" implemented as estimateSizeFactorsForMatrix from DESeq2 package is used to normalize activity measures. NOTE: Normalization is runned on the level of gene expression.

summaryOperation

This argument designates which summary operation should be used over the regions Currently, only "mean" is available, but "median" or "sum" will be implemented in the future.

mc.cores An option for parallel processing using mclapply from parallel package.

regActivity() arguments and notes

min arguments

To run this function, minimaly windows GRanges object, and a list of paths to .bigwig files must be used as an input.

windows

windows argument corresponds to a GRanges object which contains regulatory regions in the genome over which the regulatory activity will be calculated. It is strongly suggested to adjust seqlengths of this object to be equal to the seqlenghts of Hsapiens from the appropriate package (BSgenome.Hsapiens.UCSC.hg19 or whatever needed version).

Example:

regTSS_toy <- GRanges(c(rep("chr1",4),rep("chr2",2)),
                      IRanges(c(1,7,9,15,1,15),c(4,8,14,20,4,20)),
                      c(rep("+",3),rep("-",3)))
regTSS_toy$reg <-  regTSS_toy[c(1,1,3:6)]
regTSS_toy$name2 <- regTSS_toy$name <- paste0("TEST_Reg",
                                              c(1,1,3:length(regTSS_toy)))


regTSS_toy

target

A second necesarry argument for regActivity quantification function is a named list of BigWig file paths is used as an input to link bigwig locations. This argument corresponds to the target argument in bwToGeneExp().

Take a look at the example for this argument at target argument in bwToGeneExp().

Other arguments

sampleIDs argument

As in the bwToGeneExp(), this argument adjusts for names of CHiP-Seq libraries. It by defult uses basenames of the paths for input .bigWig files as a unique sample ids/names. However, a vector of unique sample ids/names(.bw files) can be used as well. It is necessary that values in this vector are ordered as the .bigWig file paths are. Take a look at the example for this argument at sampleIDs argument in bwToGeneExp()

Normalization procedure

As in the bwToGeneExp(), normalization procedure in not performed by default but 2 normalizations can be optionally used: "quantile" and "ratio". If normalize argument is set to be "quantile" activity measures are quantile normalized as implemented in the normalize.quantiles() from preprocessCore package. If normalize argument is set to be"ratio" then "median ratio method" implemented as estimateSizeFactorsForMatrix from DESeq2 package is used to normalize activity measures.

Take a look at the example for this argument at normalize argument in bwToGeneExp().

summaryOperation

As in the bwToGeneExp(), this argument designates which summary operation should be used over the regions .Currently, only "mean" is available, but "median" or "sum" will be implemented in the future.

mc.cores

As in the bwToGeneExp(),an option for parallel processing using mclapply from parallel package.

regActivity() specific parameters:

isCovNA

This parameter is important when dealing with DNA methylation datasets, where uncovered bases in bigWig files do not mean zero methylation. When this is set to TRUE, uncovered bases are set to NA.

weightCol This parameter is important when genomic regions have scores other than their coverage values, such as percent methylation, conservation scores, GC content, etc. When used, a numeric column in the meta data is used as weights.

regActivityAroundTSS() arguments and notes

min arguments:

regActivity

A GRanges object with enhancer locations and quantified enhaner activity need to be used as an input,the result of the regActivity()

geneExpression

A GRanges object with TSSes and meta-columns corresponding to expression levels across different cell types or conditions; an output of the bwToGeneExp()), and regActivity

upstream,downstream

number of basepairs upstream/downstream from TSS to look for regulatory regions.

force

(DEFAULT: FALSE) This argument allows to test input enhancers and genes in 1-to-1 manner. Meaning,a gene expression of gene1 is modelled using enhancer1 enhancer activity, gene2 gene expression is modelled with enhancer2 enhancer activity, etc. Thus, input gene expression GRanges object (tss) needs to have equal length as enhancer activity GRanges object (regActivity). It overwrites upstream and downstream arguments since 1-to-1 relationship is tested.

regTSS_toy <- GRReg1_toy
regTSS_toy$bw1 <- rep(1,length(GRReg1_toy))
regTSS_toy$bw2 <- rep(2,length(GRReg1_toy))
regTSS_toy$bw3 <- rep(3,length(GRReg1_toy))
regReg_toy <- GRReg2_toy
regReg_toy$bw1 <- rep(3,length(regReg_toy))
regReg_toy$bw2 <- rep(4,length(regReg_toy))


print("Result of upstream=5,downstream=5")
regActivityAroundTSS(regReg_toy,regTSS_toy,upstream=5,downstream=5)[1]

mc.cores As in the bwToGeneExp(),an option for parallel processing using mclapply from parallel package.

associateReg2Gene() arguments and notes

min arguments

Input

a GRangesList that contains regulatory activity and gene expression results. It has to have specific columns, see above example

method

Modelling method of choice: pearson,spearman,dcor,elasticnet, or randomForest. By default it is set to be: "pearson". See Details for more.

associateReg2Gene() specific arguments:

scaleData

if TRUE (default) the the values for gene expression and regulatory activity will be scaled to have 0 mean and unit variance using base::scale function.

mc.cores An option for parallel processing using mclapply from parallel package.

B number of randomizations used to estimate P-values for coefficients returned by different methods. default 1000

asGInteractions

if TRUE (default) results are reported as GInteractions. Otherwise, a GRanges object with regulatory region coordinates and an additional column "reg" containing gene GRanges is reported:

###############################
#STEP 1.  Getting random and predefined .8 correlation

require(GenomicRanges)
require(doMC)
require(glmnet)
require(foreach)
require(stringr)
require(qvalue)

####################################
# create example

x <- c(2.000346,2.166255,0.7372374,0.9380581,2.423209, 
       2.599857,4.216959,2.589133,1.848172,3.039659)
y <- c(2.866875,2.817145,2.1434456,2.9039771,3.819091,5.009990,
       5.048476,2.884551,2.780067,4.053136)
corrM <- rbind(x,y)

# define Granges object
gr0 <- GRanges(seqnames=rep("chr1",2),IRanges(1:2,3:4))

GeneInfo <- as.data.frame(matrix(rep(c("gene","regulatory"),each=3),
                                 ncol = 3,byrow = TRUE),stringsAsFactors=FALSE)

colnames(GeneInfo) <- c("featureType","name","name2")

mcols(gr0) <- DataFrame(cbind(GeneInfo,corrM))

print("associateReg2Gene(gr0,cores = 1,B=100)")
associateReg2Gene(gr0,cores = 1,B=100,asGInteractions=FALSE)

voteInteractions() arguments and notes

(min arguments):

interactions

A list of GInteractions objects outputed from associateReg2Gene() or metaInteractions(), as shown:

require(GenomicRanges)
require(InteractionSet)

gr2 <- gr <- GRanges(seqnames=rep("chr1",3),IRanges(1:3,3:5))
x <- 1:5
y <- 2:6
z <- 10:14
a <- rep(0,length(x))


GeneInfo <- as.data.frame(matrix(c(rep("gene",3),rep("regulatory",6)),
                                 ncol = 3,byrow = TRUE),stringsAsFactors=FALSE)
colnames(GeneInfo) <- c("featureType","name","name2")

mcols(gr) <- DataFrame(cbind(GeneInfo,rbind(x,y,z)))
mcols(gr2) <- DataFrame(cbind(GeneInfo,rbind(x,y,a)))

# create associateReg2Gene output objects, GInteractions will all 
# output results

AssocObject <- reg2gene::associateReg2Gene(gr)
AssocObject2 <- reg2gene::associateReg2Gene(gr2)

# input for meta-analysis is list of such objects

interactions <- list(AssocObject,AssocObject2)
names(interactions) <- c("H3K4me1","H327ac")

interactions

Since vote.threshold was set to be 0.5, then both gene examples passed majority voting procedure.

voteInteractions() specific arguments

vote.threshold

if vote.threshold was set to be a bit higher: 0.51, then gene2 is eliminated because it was voted for in AssocObject2

require(GenomicRanges)
require(InteractionSet)

gr2 <- gr <- GRanges(seqnames=rep("chr1",3),IRanges(1:3,3:5))
x <- 1:5
y <- 2:6
z <- 10:14
a <- rep(0,length(x))


GeneInfo <- as.data.frame(matrix(c(rep("gene",3),rep("regulatory",6)),
                                 ncol = 3,byrow = TRUE),stringsAsFactors=FALSE)
colnames(GeneInfo) <- c("featureType","name","name2")

mcols(gr) <- DataFrame(cbind(GeneInfo,rbind(x,y,z)))
mcols(gr2) <- DataFrame(cbind(GeneInfo,rbind(x,y,a)))

# create associateReg2Gene output objects, GInteractions will all 
# output results

AssocObject <- reg2gene::associateReg2Gene(gr)
AssocObject2 <- reg2gene::associateReg2Gene(gr2)

# input for meta-analysis is list of such objects

interactions <- list(AssocObject,AssocObject2)
names(interactions) <- c("H3K4me1","H327ac")

# Run voteInteractions
voteInteractions(interactions, 
                 cutoff.stat="pval",
                 cutoff.val=0.05,
                 vote.threshold=0.51)

Thus vote.threshold is a value between 0 and 1, and it designates the threshold needed for fraction of votes necessary to retain an association. As default 0.5: fraction of votes should be greater than or equal to 0.5 to retain association.

cutoff.stat

Which statistics to filter:"qval" or "pval",...

cutoff.val a numeric cutoff that will be used to filter elements in the input list ( cutoff.stat ). If the input object lacks this column, every association in the object will be treated as a valid association prediction.

metaInteractions() arguments and notes

min arguments

interactions

A list of GInteractions objects outputed from associateReg2Gene() as shown:

# creating datasets

require(GenomicRanges)
require(InteractionSet)

gr2 <- gr <- GRanges(seqnames=rep("chr1",3),IRanges(1:3,3:5))
x <- 1:5
y <- 2:6
z <- 10:14
a <- rep(0,length(x))


GeneInfo <- as.data.frame(matrix(c(rep("gene",3),rep("regulatory",6)),
                                 ncol = 3,byrow = TRUE),stringsAsFactors=FALSE)
colnames(GeneInfo) <- c("featureType","name","name2")

mcols(gr) <- DataFrame(cbind(GeneInfo,rbind(x,y,z)))
mcols(gr2) <- DataFrame(cbind(GeneInfo,rbind(x,y,a)))

# create associateReg2Gene output objects, GInteractions will all 
# output results

AssocObject <- reg2gene::associateReg2Gene(gr)
AssocObject2 <- reg2gene::associateReg2Gene(gr2)

# input for meta-analysis is list of such objects

interactions <- list(AssocObject,AssocObject2)
names(interactions) <- c("Roadmap","Blueprint")

# Run metaA
interactions

specific arguments

None

benchmarkInteractions() arguments and notes

An opposite benchmarking example (to the one reported above) is when anchor1 of the test set (interactions) overlaps anchor2 of the benchmark set. Then anchor2 of the test set necessarily needs to overlap anchor1 of the benchmark set:

interactions <- GInteractions(GRReg1_toy,GRReg1_toy$reg)[6]
benchInteractions <- GInteractions(GRReg2_toy,GRReg2_toy$reg)[6]

# removing confusing meta-data
mcols(interactions) <- NULL 
mcols(benchInteractions) <- NULL

print("test set:")
interactions

print("benchmark set:")
benchInteractions

print("after benchmarking results in:")
benchmarkInteractions(interactions,
                      benchInteractions,
                      binary=TRUE)

IMPORTAT NOTE!!!

If anchor1&anchor2 both overlap only one anchor of the anchor pair in the benchmark they are not benchmarked.

However, one need to be careful when benchmarking anchors that overlap within the test set (eg enhancer overlaps gene region), because they will be benchmarked with benchmarkInteractions().

By default Bench column in the output object of the benchmarkInteractions() reportes how many times interactions is observed in the benchmark dataset.

For example, GInteractions pair

chr1 [100, 101] --- chr1 [102, 103]

can be benchmarked 3 times using following benchmark dataset:

tmp <- GInteractions(GRReg2_toy,GRReg2_toy$reg)[10:12]
mcols(tmp) <- NULL
tmp

And results in:

interactions <- GInteractions(GRReg1_toy,GRReg1_toy$reg)[7]
benchInteractions <- GInteractions(GRReg2_toy,GRReg2_toy$reg)[10:12]

mcols(interactions) <- NULL

bench <- benchmarkInteractions(interactions,
                               benchInteractions,
                               binary=FALSE) 

bench

However, benchmarkInteractions() can report as well whether gene~enhancer pair matches some pair in the benchmark dataset or not.

Then set, argument

binary

to be TRUE.

interactions <- GInteractions(GRReg1_toy,GRReg1_toy$reg)[7]
benchInteractions <- GInteractions(GRReg2_toy,GRReg2_toy$reg)[10:12]

# removing confusing meta-data
mcols(interactions) <- NULL
mcols(benchInteractions) <- NULL

interactions
benchInteractions

benchmarkInteractions(interactions,
                      benchInteractions,
                      binary=TRUE)

minimum arguments

interactions a GInteractions object to be benchmarked using

benchInteractions a second GInteractions object used as benchmark dataset. Or a list of GInteractions objects, as follows:

interactions <- GInteractions(GRReg1_toy,GRReg1_toy$reg)
benchInteractions <- GInteractions(GRReg2_toy,GRReg2_toy$reg)


benchDataList <- list(benchInteractions,interactions)
names(benchDataList) <- c("benchData1","benchData2")


benchmarkInteractions(interactions,
                      benchInteractions=benchDataList,
                      ignore.strand=TRUE,
                      binary=FALSE,
                      nCores = 1)

specific arguments:

mc.cores

An option for parallel processing using mclapply from parallel package.

ignore.strand is an argument that should be passed to findOverlaps. When set to TRUE, the strand information is ignored in the overlap analysis.

binary As mentioned before this is FALSE or TRUE, and defines whether function reports how many times tested gene~enhancer pair is present in the benchmark dataset; or whether at all this gene~enhancer pair matches some pair in the benchmark dataset or not

preFilter (default = FALSE) If TRUE, additional column Filter is added to the input object (additionally to the Bench column that is reported by default) that store info whether tested regions have any potential to be benchmarked or not. Meaning, if all regulatory region-TSS pairs [anchor1 and anchor2 from reg2Gene] do not overlap with any benchmark anchor1 or anchor2 location they will be reported to be 0 (or no potential to be benchmarked at all), otherwise it is 1 (possible to be benchmarked). Thus it selects reg2Gene regions only when both regulatory region and TSS have overlapping regions somewhere in the benchmarking set; but not necessarily overlapping regions of the same benchmark pair. This info is important to a priori remove high number of true negatives in regulatoryReg-TSS pairs, before running confusionMatrix since TN can be much more abundant then TP.

forceByName (default = FALSE) If TRUE, it forces benchmark data to have an equal gene coordinates as an input (reg2gene) dataset when gene names overlap. IMPORTANT! Gene coordinates are necessarilly a second anchor of the input reg2gene and benchInteractions objects, and column with gene names needs to be called "name".

confusionMatrix() arguments and notes

min arguments

interactionsBench

GInteractions object with added benchmark benchmarkInteractions() and OPTIONAL Filter metadata (prior this analysis, it is advised to reduce the number of true negatives by including only reg2gene entries that could be potentially benchmarked by setting argument preFilter=TRUE whil running benchmarkInteractions()).

thresholdID

Character (def:NULL), name which indicates a column where statistics of the modelling procedure is stored: for example: "pval". If not changed to the existing column name, then function will return ERROR. This column is filtered such that everything below predefined threshold is considered to be statistically significant association (set to be equal to 0), whereas everything above that threshold is 0". Details below.

statistics

This function output integer vector or a list. If "ConfusionMatrix" is requested then the output is list with following elements:TP, FP, TN, FN, Specificity, Accuracy, PPV, NPV,F1.

Function reports only PPV if statistics = "PPV" the output is:

interactionsBench <- GInteractions(GRReg1_toy,GRReg1_toy$reg)

Bench <- interactionsBench$anchor1.Bench1Exp
Filter <- interactionsBench$anchor1.Filter1Exp
mcols(interactionsBench) <- NULL

interactionsBench$Pval <- seq(0, 1, length.out = length(GRReg1_toy))
interactionsBench$Bench <- Bench
interactionsBench$Filter <- Filter

confusionMatrix(interactionsBench,
thresholdID = "Pval",
thresholdValue = 0.05,
benchCol = "Bench",
prefilterCol = "Filter",
statistics = "PPV")

specific arguments

benchCol

a column name where result of benchmarking procedure is stored. Default: Bench. A vector of 0's and 1's is expected to be stored in

prefilterCol a column name where result of filtering procedure is stored (as part of the benchmarking). Default: Filter. A vector of 0's and 1's is expected to be stored in

thresholdValue

Since associatereg2Gene() reports all input gene-enhancer associations together with the corresponding test statistics, a threshold for this test statistics needs to be set as thresholdValue in confusionMatrix(). Every gene-enhancer pair with test statistics below thresholdValue will be considered to be statistically significant pair - POSITIVE. Otherwise, it is consider to be NEGATIVE. Take a look at an example above.

plotInteractions() arguments and notes

min arguments

interactions

A GInteractions object with anchor1 corresponding to regulatory region, and anchor2 to the TSS of a gene. Minimum meta-data is info about gene names store in column "name2". This object can be produced by modelling, or meta-analysis or voting functions from reg2gene package.

range

(default NULL) this object stores info about gene annotations. If default, function retrieves automatically gene annotations based on TxDb.Hsapiens.UCSC.hg19.knownGene package.

Otherwise, this argument corresponds to the range argument from GeneRegionTrack from Gviz R package, which handles many different data input types:
TxDb, GRanges, GRangesList,data.frame, character scala. In the case you want to write your own documentation, for more details about this argument take a look at documentation of Gviz package.

__selectGene (default NULL) a character vector of gene name symbols (eg "FTO","IRX3", etc.) which user wants to plot.

selectRegulatoryRegion

(default NULL) GRanges object or a character vector which stores info about the region of interest which user wants to plot (format: "chr16:53112601-53114200"). This region does not necessarilly need to be equal to the regulatory regions reported in the interactions input objects, whereas it only needs to overlap some regulatory regions

benchInteractions

(default NULL) A GInteractions object or a list of GInteractions objects storing info about interaction regions in the genome; for example, regions from the literature obtained using chromatin conformation capture related methods.

statistics

(default NULL) Column name where info about the heights of loops is stored. If NULL, then all interaction loops have an equal height Otherwise, loops of different height are plotted based on the info stored in the corresponding column, eg "pval" or "qval"

coloring

(default NULL) Information about colors used to plot enhancer promoter interactions. If it is NULL, then all interactions will be plotted red. Otherwise, an ID of the column from interactions object where info about color is stored. For example, interactions for different genes can be colored differently, or statistically significant/insignificant interactions can be colored differently. User pre-defines this column by itself.

cex.title Controls the size of text which describes each track

plotAllAnchors1 (default FALSE). If interactions argument is a list of GInteractions objects, whether to plot all anchor1 or not, eg whether to plot separate track(s) for enhancer regions defined in different tracks, or only in one track

interactionsNaming Track name for the interaction set

benchmarkNaming Track name for the benchmark interaction set

reg2gene() arguments and notes

min arguments

windows

GRanges object that contains the windows of interest. It could be enhancers, promoters, CpG islands, ChIP-Seq or DNase-Seq peaks...

geneAnnotations GRanges which contains info about genes of interest, or

interactions GInteractions object which stores regulatory regions - gene associations. Info about gene names should be stored as a meta-data.

If both objects are available, then input windows are annotated to the gene if they are located in the vicinity of the TSS of that genes, and this info is stored in the input GRanges object.

Otherwise, hierarchical association of input GRanges object with corresponding genes is performed.

IMPORTANT!

If GInteractions object is used as an input then, anchor1 needs to be regulatory region, whereas anchor2 is location of gene/TSS.

upstream

number of basepairs upstream from TSS to look for input windows. default 1000

downstream

number of basepairs downstream from TSS to look for input windows. default 1000

distance (default 1Mb). Maximal allowed distance between genes TSS and input peaks. Used in the 3rd step of the association procedure (genomic regions is associated to the closest gene if the distance between these two locations is smaller than prediefined distance threshold.)

identified (default TRUE). If TRUE, report genomic regions AND corresponding genes; If FALSE, report regions for which info about gene is missing.

annotateInteractions=FALSE and interactions are firstly annotated to the genes (unannotated interactions are removed from the dataset), and windows are subsequently annotated to genes using annotated interactions object.

sessionInfo

sessionInfo()

Short background of the problem

Enhancers

Enhancers represent distal regulatory regions in the genome that can be located up to 1Mb from the transcription start sites of genes, increase gene expression regardless of their position, orientation and distance to the promoter.

Enhancer-like chromatin marks

Different enhancer-like chromatin marks have been previously used to map these regulatory regions and assess their activity: level of histone modifications (especially H3K4me1 and H3K27ac), nucleosome depletion, open chromatin accessibility, DNA methylation, nucleotide conservation, etc. (Mathelier et al., 2015).
To improve reproducibility of mapping efforts, different chromatin marks were integrated in chromatin features (multiple modifications and more complex elements linked together, Stricker et al. 2017), combined with an unsupervised machine-learning approaches and higher number of cell types included in the mapping analysis (for example Ernst and Kellis 2012 used ChromHMM Core 15-state model and reported more than 900,000 potential enhancer regions across 127 epigenomes from the Roadmap Epigenomics Project (Kundaje et al. 2015).

Mapping potential gene targets of an enhancer regions

1) Mapping of regulatory regions and their targeted genes can be done computationally by correlating gene expression with levels of enhancer- associated chromatin features. Ernst et al. 2011 correlated gene expression with different histone modification marks, including enhancer associated marks H3K27ac and H3K4me1 (ENCODE Project Consortium 2012), Sheffield et al. 2013 correlated DNase I Hypersensitivity and gene expression, whereas Varley et al. 2013 was focused on DNA methylation. This package enables easy integration of all these datasets.

1) A long-range interactions can be mediated by chromatin looping - a mechanism by which enhancers and promoters are brought together in the 3D space of a nucleus, which, at least partially, enables precise regulation of gene expression (Gorkin et al. 2014, Rennie et al. 2017). Thus information about the 3D genome architecture, generated by chromosome conformation capture and related techniques, is can be used as a proxy of enhancer-mediated regulation of gene expression or in these case to benchmark results of reg2gene modelling approach (Dekker et al. 2002, Dosie et al. 2006, Simonis et al. 2006, Fullwood et al. 2009, Lieberman-Aiden et al. 2009).

Literature

Akalin, Altuna, et al. "Genomation: a toolkit to summarize, annotate and visualize genomic intervals." Bioinformatics 31.7 (2014): 1127-1129.

Bolstad, Benjamin Milo. "preprocessCore: A collection of pre-processing functions." R package version 1.0 (2013).

Dabney, Alan, John D. Storey, and G. R. Warnes. "qvalue: Q-value estimation for false discovery rate control." R package version 1.0 (2010).

Dekker, Job, et al. "Capturing chromosome conformation." science 295.5558 (2002): 1306-1311.

Dostie, Josée, et al. "Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. " Genome research 16.10 (2006): 1299-1309.

ENCODE Project Consortium. "An integrated encyclopedia of DNA elements in the human genome." Nature 489.7414 (2012): 57-74.

Ernst, Jason, et al. "Mapping and analysis of chromatin state dynamics in nine human cell types." Nature 473.7345 (2011): 43-49.

Ernst, Jason, and Manolis Kellis. "ChromHMM: automating chromatin-state discovery and characterization." Nature methods 9.3 (2012): 215-216.

Fullwood, Melissa J., et al. "An oestrogen-receptor-α-bound human chromatin interactome." Nature 462.7269 (2009): 58-64.

Gorkin, David U., Danny Leung, and Bing Ren. "The 3D genome in transcriptional regulation and pluripotency." Cell stem cell 14.6 (2014): 762-775.

Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. "glmnet: Lasso and elastic-net regularized generalized linear models." R package version 1.4 (2009).

Kundaje, Anshul, et al. "Integrative analysis of 111 reference human epigenomes. " Nature 518.7539 (2015): 317-330.

Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293.

Lonsdale, John, et al. "The genotype-tissue expression (GTEx) project." Nature genetics 45.6 (2013): 580-585.

Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology 15.12 (2014): 550.

Mathelier, Anthony, Wenqiang Shi, and Wyeth W. Wasserman. "Identification of altered cis-regulatory elements in human disease." Trends in Genetics 31.2 (2015): 67-76.

Sheffield, Nathan C., et al. "Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions." Genome research 23.5 (2013): 777-788.

Simonis, Marieke, et al. "Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4C)." Nature genetics 38.11 (2006): 1348-1354.

Stricker, Stefan H., Anna Köferle, and Stephan Beck. "From profiles to function in epigenomics." Nature Reviews Genetics (2016).

Varley, Katherine E., et al. "Dynamic DNA methylation across diverse human cell lines and tissues." Genome research 23.3 (2013): 555-567.

Visel, Axel, et al. "VISTA Enhancer Browser—a database of tissue-specific human enhancers." Nucleic acids research 35.suppl 1 (2007): D88-D92.

Wright, Marvin N., and Andreas Ziegler. "ranger: A fast implementation of random forests for high dimensional data in C++ and R." arXiv preprint arXiv:1508.04409 (2015).

BIMSBbioinfo/reg2gene documentation built on May 3, 2019, 6:42 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

BIMSBbioinfo/reg2gene
Predicting the target genes for regulatory elements

In BIMSBbioinfo/reg2gene: Predicting the target genes for regulatory elements

Introduction

Schematic representation of quantification and data integration:

Building models with reg2gene

How to quantify gene expression?

How to quantify enhancer activity?

How to prepare data prior modelling procedure?

How to run models of gene expression ~ enhancer activity?

How to benchmark reported enhancer-gene associations?

How to assess modelling performance?

How to visualize reported enhancer-gene associations?

Annotating genomic regions to genes

How to annotate regulatory regions to genes?

Data description

Argument description & notes for functions

bwToGeneExp() arguments and notes

regActivity() arguments and notes

regActivityAroundTSS() arguments and notes

associateReg2Gene() arguments and notes

voteInteractions() arguments and notes

metaInteractions() arguments and notes

benchmarkInteractions() arguments and notes

confusionMatrix() arguments and notes

plotInteractions() arguments and notes

reg2gene() arguments and notes

sessionInfo

Short background of the problem

Literature

R Package Documentation

Browse R Packages

We want your feedback!

BIMSBbioinfo/reg2gene Predicting the target genes for regulatory elements

In BIMSBbioinfo/reg2gene: Predicting the target genes for regulatory elements

Introduction

Schematic representation of quantification and data integration:

Building models with reg2gene

How to quantify gene expression?

How to quantify enhancer activity?

How to prepare data prior modelling procedure?

How to run models of gene expression ~ enhancer activity?

How to benchmark reported enhancer-gene associations?

How to assess modelling performance?

How to visualize reported enhancer-gene associations?

Annotating genomic regions to genes

How to annotate regulatory regions to genes?

Data description

Argument description & notes for functions

bwToGeneExp() arguments and notes

regActivity() arguments and notes

regActivityAroundTSS() arguments and notes

associateReg2Gene() arguments and notes

voteInteractions() arguments and notes

metaInteractions() arguments and notes

benchmarkInteractions() arguments and notes

confusionMatrix() arguments and notes

plotInteractions() arguments and notes

reg2gene() arguments and notes

sessionInfo

Short background of the problem

Literature

R Package Documentation

Browse R Packages

We want your feedback!

BIMSBbioinfo/reg2gene
Predicting the target genes for regulatory elements