preciseTAD Vignette

options(width = 400)
knitr::opts_chunk$set(warnings = FALSE, message = FALSE)

Introduction

preciseTAD is an R package designed to transform domain boundary calling into a supervised machine learning framework. preciseTAD offers full functionality in building the predictor-response data and selecting the best model for precise prediction of boundary regions. This functionality can be broken into 2 primary usages:

Input data

The training/testing data used for modeling is represented as a matrix with rows being genomic regions, columns being genomic annotations, and cells containing measures of association between them. Users have the option to concatenate genomic regions from multiple chromosomes.

To create the row-wise dimension of the data, preciseTAD uses shifted binning, a strategy for making the dimensions of the data matrix used for modeling by segmenting the linear genome into nonoverlapping regions. This step is transparent for the user. To create shifted bins, chromosome-specific bins start at half of the resolution r, and continue in congruent intervals of length r until the end of the chromosome (mod r + r/2), using hg19 genomic coordinates. The shifted genomic bins, are then defined as boundary regions (Y = 1) if they contain a called boundary, and non-boundary regions (Y = 0) otherwise, thus establishing the binary response vector (Y) used for classification. Intuitively, shifted bins are centered on borders between the original bins, thus capturing potential boundaries. We found the shifted binning strategy improves model performance.

The column-wise dimension is formed by genomic annotations, such as transcription factor binding sites (TFBSs), histone modification marks, chromatin states. The ($log_{2}$) distances, which enumerate the genomic distance from the center of each genomic bin to the center of the nearest ChIP-seq peak region of interest, form the feature space. Other feature type options include binary overlaps - an indicator for whether a ChIP-seq region overlaps with genomic bin, count overlaps - the number of overlaps in each bin, and percent overlaps - the percentage of overlap between any bin and the total width of all ChIP-seq regions overlapping it. The customized training/testing data formation offered by preciseTAD allows users to implement any binary classification machine learning algorithm.

preciseTAD functionality and output

preciseTAD implements a random forest (RF) model, allowing tuning hyperparameters and applying feature reduction. The primary inputs are the training and testing data, a list of hyperparameter values, the number of cross-validation folds to use (if a grid of hyperparameter values is provided), and the metric used for optimization. The output includes the model object (necessary for downstream prediction of boundaries), the variable importance values, and a list of performance metrics when validating the model on the testing data (see Table 1). This model is then used to predict base-level precise boundary locations.

To predict the base-level location of domain boundaries, the preciseTAD model is applied for each base annotated with the aforementioned genomic annotations. The base-level probabilities are clustered using density-based clustering and scalable data partitioning techniques, to narrow boundary regions and points. First, the probability vector, $p_{n_{i}}$, is extracted ($n_{i}$ representing the length of chromosome $i$). Next, DBSCAN (Density-based Spatial Clustering of Applications with Noise) [@ester1996density; @hahsler2019dbscan] is applied to the matrix of pairwise genomic distances between bases with $p_{n_{i}} \ge t$, where $t$ is a threshold determined by the user. The resulting clusters of highly predictive bases identified by DBSCAN are termed preciseTAD boundary regions (PTBR). To precisely identify a single base among each PTBR, preciseTAD implements partitioning around medoids (PAM) within each cluster. The corresponding cluster medoid is defined as a preciseTAD boundary point (PTBP), making it the most representative and biologically meaningful base within each clustered PTBR. The output includes a list of genomic coordinates of PTBPs and PTBSs.

preciseTAD allows us to use the pre-trained model to predict the precise location of domain boundaries in cell types without Hi-C data but with genome annotation data. Specifically, only cell-type-specific ChIP-seq data (BED format) for CTCF, RAD21, SMC3, and ZNF143 transcription factor binding sites is required. We provide models pre-trained using GM12878 and K562 genome annotation data, and Arrowhead- and Peakachu-predicted boundaries for chromosome-specific prediction of precise domain boundaries in other cell types.

Getting Started

Installation

# if (!requireNamespace("BiocManager", quietly=TRUE))
#     install.packages("BiocManager")
# BiocManager::install("preciseTAD")
devtools::install_github("dozmorovlab/preciseTAD")
# Alternatively
# devtools::install_github("stilianoudakis/preciseTAD")
library(knitr)
library(e1071)
library(preciseTAD)

Implementation

Model building

Construction of the data matrix

preciseTAD requires users to supply genomic coordinates of the "ground-truth" domain boundaries to establish the response vector (Y). As an example, we consider TAD boundaries derived from the popular Arrowhead TAD caller, a part of the Juicebox suite of tools developed by Aiden Lab [@durand2016juicer]. To get boundaries, Arrowhead was applied on the autosomal chromosomes using 5 kb GM12878 Hi-C data [@rao20143d] (GSE63525). An example of the command line script used to called TADs with Arrowhead is provided below, as well as the first three columns of the resulting TAD coordinates. A more detailed tutorial for implementing Arrowhead can be found here.

arrowhead -c 1 \ #chromosome to call TADs on
  -r 5000 \ #HiC data resolution
  ~/GSE63525_GM12878_insitu_primary.hic \ #location of the .HIC file
  ~/arrowhead_output #location to store the output
data("arrowhead_gm12878_5kb")
head(arrowhead_gm12878_5kb)

Users will then need to transform this TAD coordinate data into a GRanges object of unique boundary coordinates using the preciseTAD::extractBoundaries() function. Here we only extract boundaries for CHR1 and CHR22 (CHR = c("CHR1", "CHR22") parameter). We specify preprocess=FALSE because we are only interested in boundaries and not filtering TADs by length. Lastly, we specify resolution=5000 to match the resolution used by Arrowhead (although this argument is ignored given that preprocess=FALSE). As shown below, there were a total of 1901 unique TAD boundaries reported by Arrowhead for chromosomes 1 and 22.

bounds <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb, preprocess = FALSE, CHR = c("CHR1", "CHR22"), resolution = 5000)
# View unique boundaries
bounds

To identify genomic features best predictive of TAD boundaries, preciseTAD requires functional genomic annotation data. They are used to establish the feature space ($\textbf{X}={X_{1}, X_{2}, \cdots, X_{p} }$). Cell type-specific genomic annotation data can be downloaded from the ENCODE data portal as BED files. Once you have downloaded your preferred list of functional genomic annotations, store them in a specific file location (i.e., "./path/to/BEDfiles"). These files can then be converted into a GRangesList object and used to create the feature space using the preciseTAD::bedToGRangesList() function. The signal argument refers to the column in the BED files containing peak signal strength values and is used to assign metadata to the corresponding GRanges (only necessary for downstream plotting). We have already provided a GRangesList object with 26 transcription factor binding sites (TFBS) specific to the GM12878 cell type. Once you load it in, you can see the list of transcription factors using the following commands.

# path <- "./path/to/BEDfiles"
# tfbsList <- bedToGRangesList(filepath=path, bedList=NULL, bedNames=NULL, pattern = "*.bed", signal=4)

data("tfbsList")
names(tfbsList)
tfbsList

Using the "ground-truth" boundaries and the following TFBS, we can build the data matrix that will be used for predictive modeling. The preciseTAD::createTADdata() function can create the training and testing data. Here, we specify to train on chromosome 1 and test on chromosome 22. Additionally, we specify resolution = 5000 to construct 5kb shifted genomic bins (to match the Hi-C data resolution), featureType = "distance" for a $log_2(distance + 1)$-type feature space, and resampling = "rus" to apply random under-sampling (RUS) on the training data to balance classes of boundary vs. non-boundary regions. We also specify a seed to ensure reproducibility when performing the resampling. The result is a list containing two data frames: (1) the resampled (if specified) training data, and (2) the testing data.

set.seed(123)
tadData <- createTADdata(bounds.GR = bounds,
                         resolution = 5000,
                         genomicElements.GR = tfbsList,
                         featureType = "distance",
                         resampling = "rus",
                         trainCHR = "CHR1",
                         predictCHR = "CHR22")

# View subset of training data
tadData[[1]][1:5,1:4]
# Check it is balanced
table(tadData[[1]]$y)

# View subset of testing data
tadData[[2]][1:5,1:4]

Feature selection using recursive feature elimination

We can now implement our machine learning algorithm of choice to predict boundary regions. Here, we opt for the random forest algorithm for binary classification. preciseTAD offers functionality for performing recursive feature elimination (RFE) as a form of feature reduction through the use of the preciseTAD::TADrfe() function, which is a wrapper for the rfe function in the caret package [@kuhn2012caret]. preciseTAD::TADrfe() implements a random forest model on the best subset of features from 2 to the maximum number of features in the data by powers of 2, using 5-fold cross-validation. We specify accuracy as the performance metric. An example is shown below.

set.seed(123)
rfe_res <- TADrfe(trainData = tadData[[1]],
                 tuneParams = list(ntree = 500, nodesize = 1),
                 cvFolds = 5,
                 cvMetric = "Accuracy",
                 verbose = TRUE)
# View RFE performances
rfe_res[[1]]

# View the variable importance among top n features across each CV fold
head(rfe_res[[2]])

Recursive feature elimination results indicate that model accuracy begins to stabilize when only considering the top four transcription factors for building random forest predictive models (Figure 1A). After aggregating the variable importance values of the top four TFBS across each cross-fold, we see that the top four most important TFBS in each of the five folds are the SMC3, RAD21, CTCF, and ZNF143 (Figure 1B). These are known components of the loop-extrusion model that has been proposed as a mechanism for the 3D architecture of the human genome [@sanborn2015chromatin; @fudenberg2016formation; @hansen2018recent].

**Figure 1**. (A) Recursive feature elimination (RFE) indicates that performance stabilizes when only considering the top 4 most predictive transcription factor binding sites (TFBS). (B) Aggregate mean variable importance (using mean decrease in accuracy) for the top 4 TFBS across each of the 5 cross-folds.

Implementing a random forest for boundary prediction

Now that we have suitably reduced our feature space, we can implement a random forest algorithm built simply on the TFBS mentioned above (SMC3, RAD21, CTCF, and ZNF143). We can take advantage of the preciseTAD::TADrandomForest() function, which is a wrapper of the randomForest package [@breiman2001random; @liaw2002classification]. We specify the training and testing data, the hyperparameter values, the number of cross-validation folds, the performance metric to consider (here, accuracy), the seed to initialize for reproducibility, an indicator for retaining the model object, an indicator for retaining variable importances, the variable importance measure to consider (here, mean decrease in accuracy (MDA)), and an indicator for retaining model performances based on the test data. The function returns a list containing: 1) a trained model from caret with model information (tadModel[[1]]), 2) a data.frame of variable importance for each feature included in the model (tadModel[[2]]), and 3) a data.frame of various model performance metrics.

# Restrict the data matrix to include only SMC3, RAD21, CTCF, and ZNF143
tfbsList_filt <- tfbsList[names(tfbsList) %in% c("Gm12878-Ctcf-Broad", 
                                            "Gm12878-Rad21-Haib",
                                            "Gm12878-Smc3-Sydh",
                                            "Gm12878-Znf143-Sydh")]

set.seed(123)
tadData <- createTADdata(bounds.GR   = bounds,
                         resolution  = 5000,
                         genomicElements.GR = tfbsList_filt,
                         featureType = "distance",
                         resampling  = "rus",
                         trainCHR    = "CHR1",
                         predictCHR  = "CHR22")

# Run RF
set.seed(123)
tadModel <- TADrandomForest(trainData  = tadData[[1]],
                            testData   = tadData[[2]],
                            tuneParams = list(mtry     = 2,
                                              ntree    = 500,
                                              nodesize = 1),
                            cvFolds      = 3,
                            cvMetric     = "Accuracy",
                            verbose      = FALSE,
                            model        = TRUE,
                            importances  = TRUE,
                            impMeasure   = "MDA",
                            performances = TRUE)
# View model performances
performances <- tadModel[[3]]
performances$Performance <- round(performances$Performance, digits = 2)
rownames(performances) <- performances$Metric
kable(t(performances), caption = "List of model performances when validating an RF built on CHR1 on CHR22 test data.")

As you may know, there exist other machine learning binary classifiers that can be used in this setting. For example, suppose we opt to implement a support vector machine (SVM). This is easy enough to accomplish given that preciseTAD::createTADdata() conveniently sets up the training and testing data sets. We use the e1071 package to run the SVM with a radial kernel, cost = 1, and gamma = 0.5 using the example command below. We see that the SVM model's accuracy is only 0.67, whereas the accuracy from our random forest was 0.69 (Table 1).

svmModel <- svm(y ~ ., data = tadData[[1]], kernel = "radial", cost = 1, gamma = 0.5)

svmPreds <- predict(svmModel, tadData[[2]][, -1], positive = "Yes")

# View confusion matrix
table(svmPreds, tadData[[2]][, 1])

Precise boundary prediction

Recall that our model classifies boundary regions, in that each prediction refers to a genomic bin of width 5000 bases. To predict boundary coordinates at the base resolution more precisely, we can leverage our model through the use of the preciseTAD::preciseTAD() function. Conceptually, instead of genomic bins, we annotate each base with the selected genomic annotations and featureType. We then apply our model on this annotation matrix to predict the probability of each base is a boundary given the associated genomic annotations. To minimize computational costs, the base-level prediction should be performed in selected regions.

Running preciseTAD

Suppose we want to precisely predict the domain boundary coordinates for the 2Mb section of CHR22:25,500,000-27,500,000. To do so, we specify chromCoords = list(25500000, 27500000). Additionally, we set a threshold of 1.0 for constructing PTBRs using threshold = 1.0. For DBSCAN, we assign 10000 and 3 for the $\epsilon$ and MinPts parameters.

# Run preciseTAD
set.seed(123)
pt <- preciseTAD(genomicElements.GR = tfbsList_filt,
                featureType         = "distance",
                CHR                 = "CHR22",
                chromCoords         = list(25500000, 27500000),
                tadModel            = tadModel[[1]],
                threshold           = 1.0,
                verbose             = FALSE,
                parallel            = NULL,
                DBSCAN_params       = list(10000, 3),
                flank               = 5000)

# What is getting returned? PTBRs - precide TAD boundary regions, and PTBPs - points
names(pt)
# View the results
pt

The output of preciseTAD::preciseTAD() function is a list with 4 elements: 1) the genomic coordinates spanning each preciseTAD predicted region (PTBR), 2) the genomic coordinates of preciseTAD predicted boundaries points (PTBP). 3) a named list including summary statistics of the following: PTBRWidth - PTBR width, PTBRCoverage - the ratio of base level coordinates with probabilities that exceed the threshold to PTBRWidth, DistanceBetweenPTBR - the genomic distance between the end of the previous PTBR and the start of the subsequent PTBR, NumSubRegions - the number of elements in each PTBR cluster, SubRegionWidth - the genomic coordinates spanning the subregion associated with each PTBR, DistBetweenSubRegions - the genomic distance between the end of the previous PTBR-specific region and the start of the subsequent PTBR-specific region, and the normalized enrichment of the genomic annotations used in the model around flanked PTBPs. Normalized enrichment is calculated as the total number of peak regions that overlap with flanked predicted boundaries divided by the number of predicted boundaries. A schematic of how each of the summaries are calculated is presented in Figure 2.

**Figure 2**. A schematic illustrating how each of the diagnostic summaries are calculated in the *preciseTAD()* output.

Using preciseTAD with Juicebox

Juicebox is an interface provided by Aiden Lab that allows for superimposing boundary coordinates onto Hi-C contact maps. To visualize domains flanked by the predicted boundaries, you must first select a Hi-C map. As an example, you can import the contact matrix for GM12878 derived by Rao et al. 2014 by choosing File -> Open... -> Rao and Huntley et al. -> GM12878 -> in situ Mbol -> primary. To format preciseTAD results to use in Juicebox, users can take advantage of preciseTAD::juicer_func(), which is a function available in the preciseTAD R package that transforms a GRanges object into a data frame as shown below.

# Transform
pt_juice <- juicer_func(pt[[2]])

You will then need to save the PTBPs as a BED file using write.table as shown below. Once saved, import them into Juicebox using Show Annotation Panel -> 2D Annotations -> Add Local -> pt_juice.bed.

filepath = "~/path/to/store/ptbps"
write.table(pt_juice, 
            file.path(filepath, "pt_juice.bed"),
            quote = FALSE,
            col.names = FALSE,
            row.names = FALSE,
            sep = "\t")

Cross-cell-type prediction

preciseTAD allows us to use the pre-trained model to predict the precise location of domain boundaries in cell types without Hi-C data but with genome annotation data. Specifically, only the cell type-specific ChIP-seq data (BED format) for CTCF, RAD21, SMC3, and ZNF143 transcription factor binding sites is required. These data need to be assembled in a GRanges object using the aforementioned bedToGRangesList() function (i.e., creating another cell-type-specific tfbsList_filt object).

The pre-trained model tadModel can be obtained by using bedToGRangesList(), extractBoundaries(), createTADdata(), and TADrandomForest() functions, as described. Alternatively, chromosome-specific pre-trained models can be downloaded from the OneDrive storage. We recommend models pre-trained using GM12878 genome annotation data and Peakachu peaks. Note that each model is annotated with "CHRi_holdout", representing the chromosome held out of training, and thus to be used to predict boundaries.

As an example, we consider the Hela-S3 cell line. We will download and use the CHR22_GM12878_5kb_Peakachu model to predict boundaries for chr22 in the Hela-S3 cell line.

# Read in RF model built on chr1-chr21
tadModel <- readRDS("~/Downloads/CHR22_GM12878_5kb_Peakachu.rds")

The Hela-S3-specific BED-formatted TFBS data can be downloaded at ENCODE. The selected TFBS data can be downloaded from the supplementary repository. They can be converted into a GRangesList.

# Establishing Hela-specific genomic annotations
# Reading in CTCF, RAD21, SMC3, and ZNF143 from repository and storing as list
ctcf <- as.data.frame(read_delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/ctcf.bed", delim = "\t", col_names = FALSE))
rad21 <- as.data.frame(read_delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/rad21.bed", delim = "\t", col_names = FALSE))
smc3 <- as.data.frame(read_delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/smc3.bed", delim = "\t", col_names = FALSE))
znf143 <- as.data.frame(read_delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/znf143.bed", delim = "\t", col_names = FALSE))
helaList <- list(ctcf, rad21, smc3, znf143)

hela.GR <- bedToGRangesList(bedList=helaList, bedNames=c("Ctcf", "Rad21", "Smc3", "Znf143"), signal=5)

Now we are ready to predict boundaries on chr22 for Hela-S3.

# Run preciseTAD
set.seed(123)
pt <- preciseTAD(genomicElements.GR = hela.GR,
                featureType         = "distance",
                CHR                 = "CHR22",
                chromCoords         = list(25500000, 27500000),
                tadModel            = tadModel[[1]],
                threshold           = 1.0,
                verbose             = FALSE,
                parallel            = NULL,
                DBSCAN_params       = list(10000, 3),
                flank               = 5000)

pt

References



Try the preciseTAD package in your browser

Any scripts or data that you put into this service are public.

preciseTAD documentation built on Nov. 8, 2020, 6:51 p.m.