There are 2 kinds of experimental designs that can be used for a metagene analysis. This document will introduce the first kind, which is the comparison of two groups of regions from the same ChIP-Seq sample. It also illustrates how metagene can statistically test differences between groups of profiles within an experiment.
The H3K27ac histone mark is normally enriched at active compared to poised
transcriptional start sites (TSS) (Roadmap Epigenomics Consortium et al.,2015).
metagene was used to robustly test for the difference in enrichment of
H3K27ac between these two groups of promoter regions in A549 cells. More
information about those data can be found here :
http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#exp_18state
In R, we must first load the packrat package, which enable data usage, and metagene package:
suppressMessages(library(packrat)) packrat::on("../") suppressMessages(library(metagene))
We will use the TSS as defined by the Roadmap Consortium
(Romanoski et al., 2015). For more details on how the data have been
downloaded, please see the R/get_regions.R
file.
Two groups of regions, active and bivalent TSS, will be used.
#data(TssA) # Active TSS #data(TssBiv) # Bivalent TSS load("../data/TssA.RData") # Active TSS load("../data/TssBiv.RData") # Bivalent TSS # We will make sure that all the regions have the same size to avoid having to # scale them during the metagene analysis TssA <- GenomicRanges::resize(TssA, 500, fix = "center") TssBiv <- GenomicRanges::resize(TssBiv, 500, fix = "center") regions <- GenomicRanges::GRangesList(TssA, TssBiv) names(regions) <- c("TssA", "TssBiv") # For memory and speed consideration, we will only use a subset of the regions regions <- lapply(regions, function(x) x[sample(seq_along(x), 1000)]) regions <- GenomicRanges::GRangesList(regions) regions
For the ChIP-Seq of the H3K27ac histone mark, we will use the ENCFF000AKF
,
ENCFF000AKI
, ENCFF000AHC
and ENCFF000AHD
files from ENCODE
(ENCODE Project Consortium, 2012).
The ENCFF000AKF.bam
and ENCFF000AKI.bam
files are two replicates from the same experiment and ENCFF000AHC.bam
and
ENCFF000AHD.bam
are the recommended control files.
bam_files <- c("../inst/extdata/ENCFF000AKF.bam", "../inst/extdata/ENCFF000AKI.bam", "../inst/extdata/ENCFF000AHC.bam", "../inst/extdata/ENCFF000AHD.bam") bam_files
The bam_files object should have 4 elements.
Multiple samples should be included in the same profile. Replicates need to be
combined while controls have to be used as background. The design has to be
constructed so that metagene
will process correctly each file.
design <- data.frame(samples = bam_files, H3K27ac = c(1,1,2,2)) design
The design
uses the value 1
for a dataset and 2
for a control.
The first step is to create a metagene
object using metagene$new()
function. The goal of this first metagene step is to extract the coverage
from all BAM files present in the bam_files
object and to normalize the
signal. Since this step can be computationally intensive, we do not want to do
it every time we want to experiment with a new design. Thus, it is a good idea
to extract the coverage from every BAM files in a single step, save the
results in RData
format and then explore different designs. The design
object will be used in next steps.
mg <- metagene$new(regions = regions, bam_files = bam_files, cores = 2) # We could save this object to avoid re-doing this computationally intensive # step: #save(mg, file = "mg.RData")
The plot()
function allows us to create a metagene graph. By default, the
graph will contain a profile for each combination of group of regions and BAM
file. In this example, we have 2 groups of regions (TssA and TssBiv) and 4 BAM
files (2 samples and 2 controls).
Since we only want to produce a single profile for the TssA and TssBiv groups
of regions, we have to use the design
produced previously. Other than the
samples column that contains the names of the BAM files, there is only one
column in the design
object. This means that there will only be 1 profile
plotted for each group of regions (2 profiles in this example).
The metagene
object can be saved and reused.
df <- mg$plot(design = design, bin_size = 10)
Figure S1. H3K27ac TssA vers H3K27ac TssBiv
metagene has statistically detected differences between the TssA and TssBiv groups using Friedman rank sum test. Evenmore, metagene has robustly computed confidence intervals (CI) of the estimators (mean) for each bin using a bootstrap method.
If we only wanted to plot the profile for one group of region, we could have
specified it using the regions_group
parameter. This parameter is a vector of
the names of the regions we want to keep for the current metagene
plot.
df <- mg$plot(design = design, regions_group = "TssA")
Figure S2. H3K27ac TssA
The list
returned by the plot()
function can be saved and reused.
# We could save the list returned by the plot() function and # use it to create a new graph # save(df, file = "df.RData")
Note: When a GRangesList
object is used for the group of regions in
the new()
function, the exact same name as to be used in the plot()
function for the regions_group
parameter. If we specified a vector of
filename, metagene will automatically name the region using the filename
without the directory path and without the extension.
ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489: 57–74. doi: 10.1038/nature11247. pmid:22955616
Roadmap Epigenomics Consortium et al. (2015). Integrative analysis of 111 reference human epigenomes. Nature, 518: 317–330.
Romanoski, C. E. et al. (2015). Epigenomics: Roadmap for regulation. Nature, 518: 314–316. doi: 10.1038/518314a. pmid:25693562
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.