library(BiocParallel) register(bpstart(SerialParam()))
This vignette is organized hierarchically in terms of level of detail:
In the Quickstart section, we show a basic analysis with chromswitch using a small dataset included in the package.
In the next section, we give a brief overview of the method for detecting chromatin state switches, referencing specific functions in chromswitch which implement the different steps of the method.
In the Walkthrough, we demonstrate a basic analysis chromswitch with a discussion of data import, input, the most important parameters available to the user, and interpretation of chromswitch output. In this section we use the wrapper functions which provide one-line commands to call chromatin state switches based on a single mark or type of input data.
In last section, Step-by-step, we demonstrate how the steps of the method can be run individually to gain finer control over the analysis and show the intermediate results from chromswitch available to the user.
Load r Biocpkg("chromswitch")
:
library(chromswitch)
We will use the package r Biocpkg("rtracklayer")
to import data from BED files:
library(rtracklayer)
We'll start with a toy dataset containing MACS2 narrow peak calls for H3K4me3 ChIP-seq in 3 brain tissues and 3 other adult tissues from the Roadmap Epigenomics Project, restricted to a short region on chromosome 19. In the code below, we import the input to chromswitch and run chromswitch on our dataset. This involves constructing a feature matrix, and we will describe two ways of doing so. We can then call chromatin state switches by thresholding on the value of the Consensus score, which scores the similarity between the cluster assignments and the biological condition labels.
Chromswitch essentially requires 3 inputs:
GRanges
object storing one or more regions of
interestGRanges
objects, each of which stores
peaks or features for one sample, with elements named according to the sample IDs as
specified in the metadataThe latter two inputs define the dataset on which the query is performed.
Each of these inputs can be imported from TSV or BED files. Learn more about GRanges
objects by checking out r Biocpkg("GenomicRanges")
.
Here we use the import()
function from r Biocpkg("rtracklayer")
to import
query regions stored in a BED file.
# Path to BED file in chromswitch installation query_path <- system.file("extdata/query.bed", package = "chromswitch") # Read in BED file, creating a GRanges object query <- rtracklayer::import(con = query_path, format = "BED") query
# Path to TSV in chromswitch meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch") # Read in the table from a 2-column TSV file metadata <- read.delim(meta_path, sep = "\t", header = TRUE) metadata
Here we assume that peaks are in the ENCODE narrowPeak format.
Note: If the metadata file has an additional column containing the path for each sample, then that column can be passed to this function,
e.g. readNarrowPeak(paths = metadata$path, metadata = metadata)
.
# Paths to the BED files containing peak calls for each sample peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"), package = "chromswitch") # Import BED files containing MACS2 narrow peak calls using rtracklayer peaks <- readNarrowPeak(paths = peak_paths, # Paths to files, metadata = metadata) # Metadata dataframe
Run chromswitch using the summary strategy:
callSummary(query = query, # Input 1: Query regions metadata = metadata, # Input 2: Metadata dataframe peaks = peaks, # Input 3: Peaks mark = "H3K4me3") # Arbitrary string describing the data type
Chromswitch outputs a measure of cluster quality (Average_Silhouette), the score predicting a chromatin state switch (Consensus), and the cluster assignment for each sample.
Looking at the Consensus score in each case, which represents the similarity between the cluster assignments and the biological groups of the samples, here we see a good agreement in the first region, indicating a switch, and poor agreement in the second, indicating the absence of a switch. This score takes on values between -1 and 1, where 1 represents a perfect agreement between cluster assignments and the biological conditions of the sample
Run chromswitch using the binary strategy:
callBinary(query = query, # Input 1: Query regions metadata = metadata, # Input 2: Metadata dataframe peaks = peaks) # Input 3: Peaks
The output has the same format for both strategies. In this case, both strategies predict a switch for the first region, and a non-switch for the second region.
Both of these wrapper functions have additional parameters which allow for greater sensitivity and finer control in the analysis. These are explored in the rest of the vignette.
Our method for detecting chromatin state switches involves three steps. These are illustrated in the figure below, which is a schematic for the analysis performed on one query region, based on the reference metadata and peaks or features.
As input, chromswitch requires epigenetic features represented by their genomic coordinates and optionally, some associated statistics. Possible examples include ChIP-seq or DNase-seq peaks, or previously-learned chromatin state segmentations such as from ChromHMM. Here we'll refer to peaks for simplicity, but the analysis is the same for other types of epigenetic features given as intervals.
In the pre-processing phase, the user can set thresholds on any statistics
associated with peaks and filter out peaks below these thresholds
(filterPeaks()
). These statistics can then be normalized genome-wide for each
sample (normalizePeaks()
), which is strongly recommended. More detailed discussion of the normalization process
can be found in the documentation of that function.
Both these steps are optional. We then retrieve all the peaks in each sample
which overlap the query region (retrievePeaks()
).
Next, chromswitch transforms the peaks in the query region into a
sample-by-feature matrix using one of two strategies. In the summary strategy
(summarizePeaks()
), we compute a set of summary statistics from the peaks
in each sample in the query region. These can include the mean, median, and max
of the statistics associated with the peaks in the input, as well as the
fraction of the region overlapped by peak and the number of peaks. Genome-wide normalization of the data is therefore extremely
important if choosing this strategy.
In the binary strategy (binarizePeaks()
) we
construct a binary matrix where each feature corresponds to a unique peak in the
region, and the matrix holds the binary presence or absence calls of each
unique peak in each sample. We obtain the unique peaks by collapsing the union
of all peaks in the region observed in all samples using a parameter p
which specifies how much reciprocal overlap is required between two peaks to
call them the same. Since regions corresponding to the same biological event
can occasionally result in separate peaks during the process of interpretation
of raw signal, peak calling, etc., we also introduce an option to combine
peaks which are separated by less than gap
base pairs (reducePeaks()
).
Finally, chromswitch clusters samples hierarchically
and then scores the similarity between the inferred cluster assignments and the
known biological condition labels of the samples (cluster()
).
Schematic of the method outlined above.
The package ships with a small dataset that we will analyze to detect brain-
specific chromatin state switches. The dataset contains MACS2 narrow
peak calls for H3K4me3 in a short section of chromosome 19 for six samples,
3 adult brain tissues, and 3 other adult tissues. The peaks are available
as BED files, stored in the extdata folder of chromswitch, as well as in the
object H3K4me3
, which is a list of length 6. Each element of the list stores the peak calls as GRanges
objects. GRanges
objects are sets of genomic ranges,
and here, one range describes one peak. The data packaged with chromswitch is described in the manual, and the documentation can be accessed by running ??chromswitch::H3K4me3
in the console.
Genome browser screenshot of H3K4me3 dataset included in the chromswitch package. Red boxes indicate the genes studied in the demo analysis in this section.
Chromswitch essentially requires 3 inputs:
The specification for the inputs and examples of how to import them are described below.
Chromswitch expects query regions in the form of a GRanges
object storing one or more regions of interest. GRanges objects are containers for genomic regions, implemented
in r Biocpkg("GenomicRanges")
. An introduction to these objects can be obtained by running ??GenomicRanges::GRanges_and_GRangesList_slides
in the console.
We will apply chromswitch to 5kbp windows surrounding chromatin state
switches in three genes on chromosome 19. Here, we will read in the query regions
from a BED file using the import()
function from r Biocpkg("rtracklayer")
.
# Path to BED file in chromswitch installation query_path <- system.file("extdata/query.bed", package = "chromswitch") # Read in BED file, creating a GRanges object query <- rtracklayer::import(con = query_path, format = "BED") query
If your query regions are stored in another tabular format, this table can be read
in using read.delim()
and passed to the makeGRangesFromDataFrame()
function from r Biocpkg("GenomicRanges")
, which converts query regions stored in a dataframe with at least 3 columns, chr, start, and end, into a GRanges
object (remember the keep.extra.columns = TRUE
argument to preserve any
additional data associated with the regions, such as gene symbols). Any metadata
columns in the GRanges
object passed to the query
argument will be included
in the chromswitch output (but not used for the analysis).
Chromswitch accepts metadata in the form of a dataframe with at least two columns: Sample which stores sample IDs (these can be any strings), and Condition, which stores the biological condition labels of the samples (these must be strings, with only two possible values in the column). Additional columns are not used.
In the code below, we read in the metadata from a TSV file.
The resulting dataframe can be passed to the metadata
argument of any chromswitch
functions that require it.
# Path to TSV in chromswitch meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch") # Read in the table from a 2-column TSV file metadata <- read.delim(meta_path, sep = "\t", header = TRUE) metadata
Chromswitch expects epigenomic features in the form of a named list of GRanges
objects,
each of which stores
peaks or features for one sample, with elements named according to the sample IDs as
specified in the metadata. The peaks should be in the same order (with respect
to samples) as in the metadata.
The nature of these features are flexible: for example, they can include peak calls
from a ChIP-seq or DNase-seq experiment, or assignments of a certain state obtained
from a previously learned chromatin state segmentation such as from ChromHMM.
These features may be attached to certain metrics quantifying enrichment,
significance, or probabilities, and the format of the data
will control how we import it. Here we focus on narrow peaks for H3K4me3 ChIP-seq, but demonstrate three ways of importing data into
GRanges
objects, each suitable for different formats.
Option 1: To import BED files containing peak MACS2 narow peak calls, we can use a helper function implemented in chromswitch, which processes peaks which follow exactly the ENCODE narrowPeak format.
# Paths to the BED files containing peak calls for each sample peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"), package = "chromswitch") # Import BED files containing MACS2 narrow peak calls using # a helper function from chromswitch peaks <- readNarrowPeak(paths = peak_paths, # Paths to files, metadata = metadata) # Ensure the list is named by sample names(peaks) <- metadata$Sample
Option 2: To read in features in other formats where the first three columns are chr,
start, and end, use r Biocpkg("rtracklayer")
and specify the identity and
type of the other columns. For example, we can read in the narrow peak calls
manually.
The same process can be applied to BED files containing epigenomic features
other than peaks (for example, chromatin state segmentations); the extraCols
argument to rtracklayer::import
should be modified to fit the data. More information about importing BED files can be obtained by
running ??rtracklayer::BEDFile
in the console to access the rtracklayer
documentation.
extra_cols <- c("signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric", "peak" = "numeric") # Obtain a list of GRanges objects containing peak calls peaks <- lapply(peak_paths, rtracklayer::import, format = "bed", extraCols = extra_cols) # Ensure the list is named by sample names(peaks) <- metadata$Sample
Option 3: Alternatively, if your epigenomic features are stored in another tabular format,
read in the files using read.delim()
and convert them to GRanges
objects
afterwards. We demonstrate on the same narrow peak calls:
# Read in all files into dataframes df <- lapply(peak_paths, read.delim, sep = "\t", header = FALSE, col.names = c("chr", "start", "end", "name", "score", "strand", "signalValue", "pValue", "qValue", "peak")) # Convert the dataframes into GRanges objects, retaining # additional columns besides chr, start, end peaks <- lapply(df, makeGRangesFromDataFrame, keep.extra.columns = TRUE) # Ensure the list is named by sample names(peaks) <- metadata$Sample
All three methods described above produce an identical peaks
object.
We will first run a basic analysis using the summary strategy for constructing feature matrices, using the default features under the summary matrix construction strategy: the number of peaks in the region, and the fraction of the region overlapped by peaks. Note that whne the number of peaks is large ($n >> 1$), the output, particularly the heatmap, may be difficult to interpret. All the computations described in the method are wrapped in one command; in later sections we'll explore running each step of the method (preprocessing, feature matrix construction, clustering) individually.
Note that the column names passed to arguments in the wrappers (e.g.
normalize_columns
, summarize_columns
, etc.) must match exactly the column
names in the BED files.
out <- callSummary(query = query, # Input 1: Query regions metadata = metadata, # Input 2: Metadata dataframe peaks = peaks, # Input 3: Peaks mark = "H3K4me3") # Arbitrary string describing the data type out
We can threshold on the consensus score to subset the query regions to those containing putative chromatin state switches:
out[out$Consensus >= 0.75, ]
Now, we'll construct the feature matrix by summarizing on the qValue
and
signalValue
statistics. This means that for each sample, the features used to
the cluster samples in the
region will be the mean, median, and maximum of these two statistics across
peaks. We will apply genome-wide normalization to the same columns we will use
in the feature matrices. Normalization is an optional step, but
strongly recommended.
Let's explore some more options for detecting chromatin state switches
with chromswitch. The options are briefly described in the comments,
but you can obtain additional explanation of arguments and explore others
not covered here by running ??chromswitch::callSummary
. Note that wherever
column names are passed to chromswitch functions, these must match the columns
in the peaks/features data exactly (this is case sensitive).
out2 <- callSummary( # Standard arguments of the function query = query, metadata = metadata, peaks = peaks, # Arbitrary string describing data type mark = "H3K4me3", # For quality control, filter peaks based on associated stats # prior to constructing feature matrices filter = TRUE, # Provide column names and thresholds to use in the same order filter_columns = c("qValue", "signalValue"), filter_thresholds = c(10, 4), # Normalization options normalize = TRUE, # Strongly recommended # By default, set to equal summarize_columns, below normalize_columns = c("qValue", "signalValue"), # Columns to use for for feature matrix construction summarize_columns = c("qValue", "signalValue"), # In addition to summarizing peak statistics, # we can also optionally compute the # fraction of the region overlapped by peaks # and the number of peaks fraction = TRUE, n = FALSE, # TRUE by default, return the optimal number # of clusters, otherwise require k = 2 optimal_clusters = TRUE, # Set this to TRUE to save a PDF of the heatmap # for each region to the current working directory heatmap = FALSE, # Chromswitch uses BiocParallel as a backend for # parallel computations. Analysis is parallelized at the # level of query regions. BPPARAM = BiocParallel::SerialParam()) out2
The summary approach can be applied to epigenomic data where there
are no statistics associated with the features (peaks, states, etc.). In this
case, set summarize_columns = NULL
, filter = FALSE
, normalize = FALSE
,
and ensure that either n
or fraction
(or both) are set to TRUE
.
The binary strategy requires approximately the same basic input as the summary strategy. It also uses two tuning parameters:
gap
which is the distance between peaks in the same sample below which
two peaks should be merged. This preprocessing step is optional, and is
controlled by the option reduce
p
which is the fraction of reciprocal overlap required to call two
peaks the same. This rule is used to obtain a set of unique peaks observed
across samples in the query region, and to assign binary presence or absence
of each peak in each sample to construct the feature matrix.We use default values of gap = 300
and p = 0.4
, but the method is robust
to changes in these parameters within reasonable ranges.
The other option unique to this strategy is n_features
. The number of features
in the matrix used for clustering corresponds to the number of unique peaks
that chromswitch identifies for these samples in the query region, and
this option controls whether to include an additional column recording
the number of features in the output.
Additional parameters can be explored by running ??chromswitch::callBinary
in the console.
out3 <- callBinary( # Standard arguments of the function query = query, metadata = metadata, peaks = peaks, # Logical, controls whether to # reduce gaps between peaks to eliminate noise reduce = TRUE, # Peaks in the same sample which are within this many bp # of each other will be merged gap = 300, # The fraction of reciprocal overlap required to define # two peaks as non unique, used to construct a binary ft matrix p = 0.4, # Include in output the number of features obtained in # each query region n_features = TRUE) out3
Again, threshold the output to obtain putative switches:
out3[out3$Consensus >= 0.75, ]
The basic output of chromswitch is a tidy dataframe which includes:
optimal_clusters = FALSE
,
otherwise, the optimal set of clusters is obtained by selecting the clusters
with the highest average Silhouette width, displayed in the next column)If chromswitch performs hierarchical clustering on a feature matrix with more
than one feature, both wrapper functions, callSummary()
and callBinary()
as well as cluster()
can optionally
produce a heatmap with the resulting dendrogram as a PDF. Whether this
heatmap is produced or not is passed as a logical value to the argument heatmap
.
The title of the heatmap and prefix of the file name can be passed as a string
to title
, while the path to the output directory can be passed to outdir
.
Chromswitch is implemented in a relatively modular way, with functions for each
step of the method. In this section, we repeat the analysis we performed in
the previous sections, except that instead of using the wrapper functions,
callBinary()
and callSummary()
, we perform each step individually, and
allocate some more discussion to options and parameters at each step.
This section leverages the modularity of chromswitch, so the pipe operator %>%
from r CRANpkg("magrittr")
is helpful here.
library(magrittr)
We will also inspect some of the intermediate objects returned by chromswitch
functions, so we load the data manipulation package, r CRANpkg("dplyr")
.
library(dplyr)
After preparing the metadata dataframe (meta
) and importing data into GenomicRanges
objects (H3K4me3
), we can pre-process the data. First, since the peak calls
from MACS2 are associated with a fold change of enrichment of ChIP-seq
signal, a q value for enrichment, etc., we'll set
some thresholds on these statistics and filter out peaks which do not meet them.
The user can pass the names of any numeric columns in the data to the columns
argument, ensuring that a numeric threshold for each is passed to thresholds
,
in the same order. If too few threshold values provided, they will not be
recycled; chromswitch will return an error.
# Number of peaks in each sample prior to filtering lapply(H3K4me3, length) %>% as.data.frame() H3K4me3_filt <- filterPeaks(peaks = H3K4me3, columns = c("qValue", "signalValue"), thresholds = c(10, 4)) # Number of peaks in each sample after filtering lapply(H3K4me3_filt, length) %>% as.data.frame()
There are some additional pre-processing steps which are specific to the approach used to construct feature matrices from the data, and we explain these below.
In the summary approach, the features for each sample are a set of summary
statistics which represent the peaks observed in that sample in the query region.
It's important, therefore, that the features have comparable ranges. We
normalize each statistic genome-wide for each sample. The normalization process
essentially involves rescaling the central part of the data to the range [0, 1]
and bounding the lower and upper outliers to 0 and 1 respectively. The amount
of outliers in each tail to bound can be specified by the user in the tail
parameter, which expects a fraction in [0, 1] (for example, tail = 0.005
results
in bounding the upper and lower 0.5% of the data, the default).
This normalization is optional, but is strongly recommended. The effect of not normalizing is that the hierarchical clustering algorithm will be influenced by small changes between samples, which may lead to false positive chromatin state switch calls.
# Summary of the two statistics we will use downstream in raw data in one sample H3K4me3_filt %>% lapply(as.data.frame) %>% lapply(select, signalValue, qValue) %>% lapply(summary) %>% `[`(1) H3K4me3_norm <- normalizePeaks(H3K4me3_filt, columns = c("qValue", "signalValue"), tail = 0.005) # Summary after normalization H3K4me3_norm %>% lapply(as.data.frame) %>% lapply(select, signalValue, qValue) %>% lapply(summary) %>% `[`(1)
Next, we'll retrieve the peaks in the query region. Here we'll consider the
5 kbp window around the transcription start site of TTYH1, which is a known
brain-specific gene. This returns an object of the LocalPeaks
class.
A LocalPeaks
object is a container for the peaks for a set
of samples in a specific genomic region of interest, as well as the genomic region
itself, and the sample IDs. These components are needed to convert sets of peaks
into rectangular feature-by-sample matrices which we can then use for downstream
analysis - and in particular, as input to a clustering algorithm in order to
call a chromatin state switch.
# TTYH1 ttyh1 <- query[2] ttyh1 ttyh1_pk <- retrievePeaks(peaks = H3K4me3_norm, metadata = metadata, region = ttyh1) ttyh1_pk
We can now construct a sample-by-feature matrix from the filtered and normalized
data in the query region.
To do so, we need to select some summary statistics to be the features
in the matrix. If there are any values associated with the peaks or data,
specifying these in the cols
parameter to summarizePeaks()
results in taking the mean, median, and maximum of each statistic as a separate feature. There are also two
measures which can be calculated from genomic ranges alone: the number of peaks
overlapping the query region (argument n
), and the fraction of the region
overlapped by peaks (argument fraction
). These parameters take logical
values specifying whether or not they should be included.
Here, we take the mean, median, and max of the q value and fold change of peaks in the query region, as well as the fraction of the region overlapped by peaks. When selecting which peak statistics to use in feature construction, it's important to consider whether statistics are redundant, for example, here we avoid using both the p and q values.
summary_mat <- summarizePeaks(localpeaks = ttyh1_pk, mark = "H3K4me3", cols = c("qValue", "signalValue"), fraction = TRUE, n = FALSE) # The sample-by-feature matrix summary_mat
Finally, we can cluster over samples using this matrix, and call a chromatin state switch by assessing the agreement between the inferred cluster assignments and the known biological condition labels of the samples. Since hierarchical clustering results in a dendrogram, we choose the partition of the samples which maximizes the average Silhouette width, which is an internal measure of cluster goodness based on cluster compactness and separation.
cluster(ft_mat = summary_mat, query = ttyh1, metadata = metadata, heatmap = TRUE, title = "TTYH1 - summary", optimal_clusters = TRUE)
The consensus score is equal to 1, indicating a perfect agreement between the inferred clusters and the biological groups (Brain/Other), so we can infer a chromatin state switch around the TSS of TTYH1.
The cluster
function has a heatmap
argument which controls whether a heatmap
is produced as a PDF file in the current working directory or at a path specified by outdir
.
Heatmap showing hierarchical clustering result from chromswitch applied to TTYH1 using the summary strategy
In the binary approach, the features correspond to unique peaks observed in the region across samples. This can be sensitive to noisy data in the region, so we propose an additional pre-processing step prior to this approach. When considering peaks, often, we observe two scenarios:
The way chromswitch handles these is controlled by two tuning parameters. The first
is the gap
parameter, and is used to decide when nearby
peaks should be joined and replaced by one peak. The gap
parameter is the distance
between two peaks below which they should be joined.
Example of a transformation on peaks to join nearby peaks which are likely due
to the same biological event, implemented in chromswitch as reducePeaks()
.
The reducePeaks()
function takes a LocalPeaks
object as input, so we can
use the peaks in the TTYH1 window that we've already obtained.
ttyh1_pk_red <- reducePeaks(localpeaks = ttyh1_pk, gap = 300)
In this approach, we convert the data in the query region into a binary representation
by modelling the presence or absence of each unique peak. In this function,
first the unique peaks are obtained by collapsing down the union of all peaks
supplied in the region, and then we look for each unique peak in each sample
to construct the binary matrix. This involves using the second tuning parameter,
p
, which specifies how much reciprocal overlap two peaks must have in order
to be considered the same peak.
binary_mat <- binarizePeaks(ttyh1_pk_red, p = 0.4) # Chromswitch finds a single unique peak in the region binary_mat
In terms of selecting values for these tuning parameters, our experiments indicate that chromswitch is very robust to changes in these parameters, within reasonable ranges. Visual inspection of the data in a genome browser can be useful for determining exact values, and can help to control the resolution of the analysis.
Again, chromswitch finds good agreement between the cluster assignments and the biological groups in our analysis.
cluster(ft_mat = binary_mat, metadata = metadata, query = ttyh1, optimal_clusters = TRUE)
These steps can also be easily run in a pipeline using the %>%
operator, which is convenient when using individual chromswitch functions to compose an analysis rather than the two wrapper functions.
Bug reports and questions about usage, the method, etc. are welcome. Please open an issue on the GitHub repository for the development version of chromswitch: https://github.com/sjessa/chromswitch/issues/new.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.