TADpole is a computational tool designed to identify and analyze the entire hierarchy of topologically associated domains (TADs) in intra-chromosomal interaction matrices.
install.packages(c('bigmemory', 'cowplot', 'doParallel', 'foreach', 'fpc',
'ggdendro', 'ggplot2', 'ggpubr', 'gridExtra', 'Matrix',
'plyr', 'reshape2', 'rioja', 'viridis', 'zoo'))
Please note that TADpole has been tested to work with R versions since 3.5.2.
by using wget:
wget https://github.com/paulasoler/TADpole/archive/master.zip
unzip master.zip
mv TADpole-master TADpole
or by cloning the repository:
git clone https://github.com/paulasoler/TADpole.git
R CMD INSTALL TADpole
Note: if you download the zip file from the GitHub website instead, it will be named TADpole-master
, so please adapt the unzip
command accordingly.
In this repository, we provide a test case from a publicly available Hi-C data set (SRA: SRR1658572) (1).
In the inst/extdata/
directory, we provided a 6 Mb region (chr18:9,000,000-15,000,000) of a human Hi-C dataset at 30kb resolution.
inst/extdata/raw_chr18_300_500_30kb.tsv
To obtain this interaction matrix, we processed the Hi-C data using the TADbit (2) Python package, that deals with all the necessary processing and normalization steps.
To run the main function TADpole
, you need to provide an intrachromosomal interaction matrix, representing an entire chromosome or a chromosome region. The input is a tab-separated values file containing the interaction matrix (M) with N rows and N columns, where N is the number of bins in which the chromosome region is divided. Each position of the matrix (Mij) contains the interaction values (raw or normalized) between the corresponding pair of genomic bins i and j. We recommend ONED (3) normalization, as it effectively corrects for known experimental biases.
Schematic overview of the TADpole algorithm (for further details, refer to Soler-Vila et al. (4)
The basic usage is the following:
library(TADpole)
mat_file <- system.file("extdata", "raw_chr18_300_500_30kb.tsv", package = "TADpole")
tadpole <- TADpole(mat_file, chr = "chr18", start = 9000000, end = 15000000, resol = 30000)
logical
. Plot the distribution of column coverage to help in selecting a useful value for bad_frac
. Mostly for debugging purposes.
-->logical
. Split the matrix by the centromere into two sub-matrices representing the chromosome arms. Useful when working with big matrices (>15000 bins).The function TADpole
returns a tadpole
object containing the following descriptors:
list
containing the TADs for each hierarchical level (x) defined by the broken stick model.x
: start and end coordinades of all TADs.centromere_search
is TRUE
, contains the start and end coordinates of the TADs of the full chromosome.head(tadpole)
$n_pcs
[1] 20
$optimal_n_clusters
[1] 12
$dendro
Call:
rioja::chclust(d = dist(pcs))
Cluster method : coniss
Distance : euclidean
Number of objects: 198
$clusters
$clusters$`2`
start end
1 1 110
2 111 200
...
$scores
1 2 3 4 5 6 7 8 9
1 NA 47,90916 42,22857 39,40353 43,61547 41,24569 0,00000 0,00000 0,00000
2 NA 44,47879 43,28183 45,06219 44,02830 45,38542 49,09032 0,00000 0,00000
...
Automatically, TADpole generates a map of the intra-chromosomal interaction matrix under study, together with a histogram showing the distribution of interaction values. In the latter, a dashed line that indicates the number of columns (and corresponding rows) excluded from the analysis for having a low number of interactions (the so-called bad columns). Specifically, the columns (and rows) that contain an empty cell at the main diagonal, and those whose cumulative interactions are below the first (by default) percentile, are excluded from the analysis.
Left, the complete dendrogram obtained from the Hi-C matrix cut at a maximum significant number of levels (max(ND)) reported by the broken-stick model (including the partitions in 2 up to 16 TADs). Among these levels, the highest-scoring one is selected according to the CH index analysis. Right, Hi-C contact map showing the complete hierarchy of the significant levels selected by the broken stick model (black lines) along with the optimal one with 12 TADs, identified by the highest CH index (blue line).
plot_hierarchy(mat_file, tadpole, chr = "chr18", start = 9000000, end = 15000000, resol = 30000)
tadpole
objectlogical
. Split the matrix by the centromere into two sub-matrices representing the chromosome arms. Useful when working with big matrices (>15000 bins).
CH_map(tadpole)
tadpole
object.
To compare pairs of topological partitions, P and Q, identified by TADpole at the same level of the hierarchy, we defined a Difference Topology score (DiffT). Specifically, the partitioned matrices are transformed into binary forms p for P, and q for Q, in which each entry pij (qij) is equal to 1 if the bins i and j are in the same TAD and 0 otherwise. Then, the DiffT is computed as the normalized (from 0 to 1) difference between the binarized matrices as a function of the bin index b as:
where N is the total number of bins.
Here, the DiffT score analysis is used to compare the chromatin partitions at the same hierarchical level determined in two different experiments: control and case.
In the inst/extdata/
directory, there are 2 files in a BED-like format.
inst/extdata/control.bed
inst/extdata/case.bed
control <- read.table(system.file("extdata", "control.bed", package = "TADpole"))
case <- read.table(system.file("extdata", "case.bed", package = "TADpole"))
difft_control_case <- diffT(control, case)
data.frame
s with a BED-like format with 3 columns: chromosome, start and end coordinates of each TAD, in bins.The function diffT
returns a numeric
vector representing the cumulative DiffT score profiles as a function of the matrix bins.
The highest local differences between the two matrices can be identified by the sharpest changes in the slope of the function.
```R plot(difft_control_case, type = "l") ``````
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.