Correlation of epigenetic signals and genes in TADs

knitr::opts_chunk$set(echo = TRUE)

Introduction

The InTAD analysis is focused on the processing of generated object that combines all input datasets. Required input data is the following:

Further explained example of performing the analysis procedure is based on H3K27ac data reflecting activity of enhancers in medulloblastoma brain tumour descrbied in the manuscript from C.Y.Lin, S.Erkek et al., Nature, 2016.

This dataset includes normalized enhancer signals obtained from H3K27ac ChIP-seq data and RNA-seq gene expression RPKM counts from 25 medulloblastoma samples. The test subset is extracted from a selected region inside chromosome 15. Additionally, the coordinates for enhancers and genes along with specific sample annotation are provided.

The analysis starts from preparing and loading the data. Here is the overview of integrated input test data, that can serve as a useful example describing supported input formats:

library(InTAD)
# normalized enhancer signals table
enhSel[1:3,1:3]
# enhancer signal genomic coordinates 
as.data.frame(enhSelGR[1:3])
# gene expression normalized counts
rpkmCountsSel[1:3,1:3]
# gene coordiantes
as.data.frame(txsSel[1:3])
# additional sample info data.frame
head(mbAnnData)

Importantly, there are specific requriements for the input datasets. The names of samples should match in signals and gene expression datasets.

summary(colnames(rpkmCountsSel) == colnames(enhSel))

Next, the genomic regions should be provided for each signal as well as for each gene.

# compare number of signal regions and in the input table
length(enhSelGR) == nrow(enhSel)

The genomic regions reflecting the gene coordinates must include "gene_id" and "gene_name" marks. These are typical GTF format markers. One more mark "gene_type" is also useful to perform filtering of gene expression matrix.

All the requirements are checked during the generation of the InTADSig object. Main part of this object is r Biocpkg("MultiAssayExperiment") subset that combines signals and gene expression. Specific annotation information about samples can be also included for further control and visualization. In provided example for medulloblastoma samples annotation contains various aspects such as tumour subgroup, age, gender, etc.

inTadSig <- newSigInTAD(enhSel, enhSelGR, rpkmCountsSel, txsSel,mbAnnData)

The created object contains MultiAssayExperiment that includes both signals and gene expression data.

inTadSig

During the main object generation there are also available special options to activate parallel computing based on usage of R multi-thread librares and log2 adjustment for gene expression. The generated data subsets can be accessed using specific call functions on the object i.e. signals or exprs.

Notably, the main object can be also loaded from the text files representing the input data using function loadSigInTAD. Refer to the documetation of this function for more details.

Main data analysis using TADs

The usage of input gene expression counts matrix assumes filtering of non- or low expressed genes. However if these counts were not filtered before starting the InTAD analysis it's possible to adjust gene expression limits using function filterGeneExpr. This function provides parameters to control minimum gene expression and type. There is additionally a special option to compute gene expression distribution based on usage of r CRANpkg("mclust") package in order to find suitable minimum gene expression cut limit. Here's example how to use this procedure:

# filter gene expression
inTadSig <- filterGeneExpr(inTadSig, checkExprDistr = TRUE)

The analysis starts from the combination of signals and genes inside the TADs. Since the TADs are known to be stable across various cell types, it's possible to use already known TADs obtained from IMR90 cells using HiC technology (Dixon et al 2012). The human IMR90 TADs regions object is integrated into the package.

# IMR90 hg19 TADs
head(tadGR)

However, since the variance is actually observed between TAD calling methods (i.e. described in detailed review by Rola Dali and Mathieu Blanchette, NAR 2017 ), novel obtained TADs can be also applied for the analysis. The requried format: GRanges object.

Composition of genes and signals in TADs is performed using function combineInTAD that has several options. By default, it marks the signal belonging to the TAD by largest overlap and also takes into account genes that are not overlaping the TADs by connecting them to the closest TAD. This can be sensetive strategy since some genomic regions can be missed due to the limits of input HiC data and variance of existing TAD calling methods.

# combine signals and genes in TADs
inTadSig <- combineInTAD(inTadSig, tadGR)

Final step is the correlation analysis. Various options are avialble for this function i.e. correlation method, computation of q-value to control the evidence strength and visualization of connection proportions. This last option allows to show differences in gene and signal regulations.

par(mfrow=c(1,2)) # option to combine plots in the graph
# perform correlation anlaysis
corData <- findCorrelation(inTadSig,plot.proportions = TRUE)

The result data.frame has a special format. It includes each signal, TAD, gene and correlation information.

head(corData,5)

Further filtering of this result data can be performed by adjusting p-value and correlation effect limits (i.e. p-val < 0.01, positive correlation only).

Integration of chromatin loops

Another clear approach to find contacts between genes and epigenetic signals is to use direct chromatin connections, so called loops. The loops are typically derived from HiC data and there are well-known tools that allow to perform this (e.g. FitHiC or HiCCUPS).

An example of chromatin loops visualisation from IGV showing connection of medulloblastoma tumor specific enhancers to genes. The loops are derived from IMR90 HiC data.

InTAD starting from version 1.9.1 also allows to use HiC loops for the analysis. Main functions to perform this task are combineWithLoops and findCorFromLoops.

To demonstrate this approach InTAD includes an example subset of loops derived from IMR90 cells. This loops data.frame has a specific format where the first 6 columns represent genomic regions for both loop anchors: (chr1,start1,end1,chr2,start2,end2):

loopsDfSel[1:4,1:6]

The loaded loops are applied to find connections between genes and signals using function combineWithLoops. By default 6-column loops format is expected, but the function also supports 4-column format where for the loop anchors only middle positions are provided (as in FitHiC output): (chr1,middlePos1,chr2,middlePos2) .

Howerer in this case loop fragment length is also required and using the variable fragmentLength allows to activate this format. In addition other parameters (e.g. transcription start site width, extension of the loops) can be controlled to increase the sensitivity.

In result the function reports how many connections are detected to be supported by loops and saves them within the returned InTAD object:

inTadSig <- combineWithLoops(inTadSig, loopsDfSel)

In this particular example only 1 connection between gene and enhancer is found. To find if there are correlations between detected connected signal-gene pairs the function fincCorFromLoops is applied. It has a list of options similar to corresponding function findCorrelation for usage of TADs (e.g. correlation method, adjusted p-value):

loopEag <- findCorFromLoops(inTadSig,method = "spearman")

Final result also has format similar to representation of correlations between genes and enhancers within TADs. The only difference is that the loops supporting found connection are included:

loopEag

In general, the focus on loops allows to increase the specificity of the detected connections between signals and genes in order to find possible perspective targets for further investigation. However, ideally it should be applied on HiC data derived from the same research target materials (e.g. same tumor type), while TADs from other sources could be used due to their stability.

Visualization

The package provides post-analysis visualization function: the specific signal and gene can be selected for correlation plot generation. Here's example of verified medulllobastoma Group3-specifc enhancer assoicated gene GABRA5 lying in the same TAD as the enhancer, but not close to the gene:

# example enhancer in correlation with GABRA5
cID <- "chr15:26372163-26398073" 
selCorData <- corData[corData$peakid == cID, ]
selCorData[ selCorData$name == "GABRA5", ] 

For the plot generation it is required to provide the signal id and gene name:

plotCorrelation(inTadSig, cID, "GABRA5",
                xLabel = "RPKM gene expr log2",
                yLabel = "H3K27ac enrichment log2", 
                colByPhenotype = "Subgroup")

Note that in the visualization it's also possible to mark the colours representing the samples using option colByPhenotype based on the sample annotation information included in the generation of the main object. In the provided example medulloblastma tumour subgroups are marked.

Specific genomic region of interest can be also visualised to observe the variance and impact of TADs using special function that works on result data.frame obtained from function findCorrelation. The resulting plot provides the location of signals in X-axis and genes in Y-axis. Each point reflects the correlation stength based on p-value: -log10(P-val). This visualization strategy was introduced in the study by S. Waszak et al, Cell, 2015 focused on investigation of chromatin architecture in human cells.

By default only detected TADs with signals inside are visualized, but it is also possible to include all avaialble TAD regions using special option. Here's the example plot covering the whole chromosome 15 region used in the test dataset:

plotCorAcrossRef(inTadSig,corData,
                 targetRegion = GRanges("chr15:25000000-28000000"), 
                 tads = tadGR)

One more option of this function allows to activaite representation of postive correlation values from 0 to 1 instead of strength.

plotCorAcrossRef(inTadSig,corData,
                 targetRegion = GRanges("chr15:25000000-28000000"), 
                 showCorVals = TRUE, tads = tadGR)

It's also possible to focus on the connections by ignoring the signal/gene locations and focusing only on correlation values by adjusting for symmetery. This is typical approach used for HiC contact data visualization in such tools as for example JuiceBox. This can be activate by using the corresponding option:

plotCorAcrossRef(inTadSig,corData,
                 targetRegion = GRanges("chr15:25000000-28000000"), 
                 showCorVals = TRUE, symmetric = TRUE, tads = tadGR)

These visualization strategies allow to investigate the impact of TADs.

Additional documentation is available for each function via standard R help.

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()

References

Dali, R. and Blanchette, M., 2017. A critical assessment of topologically associating domain prediction tools. Nucleic acids research, 45(6), pp.2994-3005.

Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S. and Ren, B., 2012. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485(7398), p.376.

Lin, C.Y., Erkek, S., Tong, Y., Yin, L., Federation, A.J., Zapatka, M., Haldipur, P., Kawauchi, D., Risch, T., Warnatz, H.J. and Worst, B.C., 2016. Active medulloblastoma enhancers reveal subgroup-specific cellular origins. Nature, 530(7588), p.57.

Waszak, S.M., Delaneau, O., Gschwind, A.R., Kilpinen, H., Raghav, S.K., Witwicki, R.M., Orioli, A., Wiederkehr, M., Panousis, N.I., Yurovsky, A. and Romano-Palumbo, L., 2015. Population variation and genetic control of modular chromatin architecture in humans. Cell, 162(5)



Try the InTAD package in your browser

Any scripts or data that you put into this service are public.

InTAD documentation built on Nov. 8, 2020, 7:47 p.m.