vignettes/preciseTADhub.md

title: "Predicting TAD/loop boundaries using pre-trained models provided by preciseTADhub" author: - name: Spiro Stilianoudakis affiliation: - &1 Department of Biostatistics, Virginia Commonwealth University, Richmond, VA - name: Mikhail Dozmorov affiliation: - *1 date: 'November 6, 2020' package: preciseTADhub output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{preciseTADhub} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console

Introduction

preciseTADhub is an ExperimentData R package that supplements the preciseTAD software R package. preciseTADhub offers users access to pre-trained random forest classification models used to predict TAD/loop boundary regions. The model building process introduced by preciseTAD (https://doi.org/10.1101/2020.09.03.282186) can be computationally intensive. To avoid this burden, we have provided users with 84 (2 cell lines $\times$ 2 ground truth boundaries $\times$ 21 autosomal chromosomes) .RDS files containing pre-trained models that can be leveraged to predict TAD and/or chromatin loop boundaries at base-level resolution using functionality provided by preciseTAD (https://bioconductor.org/packages/devel/bioc/vignettes/preciseTAD/inst/doc/preciseTAD.html).

Each of the 84 files are stored as lists containing two objects: 1) a train object from \code{caret} with RF model information, and 2) a data.frame of variable importance for each genomic annotation included in the model. The file names are stuctured as follows:

$i$$j$$k$_$l$.rds

where $i$ denotes the chromosome that was used as a holdout {CHR1, CHR2, ..., CHR21, CHR22} (i.e. for testing; meaning all other chromosomes were used for training), j denotes the cell line {GM12878, K562}, k denotes the resolution (size of genomic bins) {5kb, 10kb}, and l denotes the TAD/loop caller used to define ground truth {Arrowhead, Peakachu}.

For example the file named "CHR1_GM12878_5kb_Arrowhead.rds" is a list whose first item is a RF model that was built on data for chromosomes 2-22 (omitting CHR9; see https://doi.org/10.1101/2020.09.03.282186), binned using 5 kb bins, ground truth TAD boundaries were identified using the Arrowhead TAD caller at 5 kb on GM12878. All models included the same number of predictors including CTCF, RAD21, SMC3, and ZNF143. The second item in the list is a data.frame with variable importances for CTCF, RAD21, SMC3, and ZNF143.

The pre-trained models set up users to apply them to predict their own boundaries on chromosomes that were heldout, per the framework in the preciseTAD paper (https://doi.org/10.1101/2020.09.03.282186).

Getting Started

The following is an example of how to predict TAD boundaries at base-level resolution for CHR22 on GM12878, using a pre-trained model stored in preciseTADhub.

Installation

library(preciseTAD)
library(ExperimentHub)

Reading in a file stored on ExperimentHub

Table 1 shows the file names and the corresponding ExperimentHub (EH) IDs. Since we want to make TAD boundary predictions on CHR22 for GM12878, we opt to read in the "CHR1_GM12878_5kb_Arrowhead.rds" file. This corresponds to the EH3895 EHID.

| FileName | EHID | |---------------------------------|---------------| | CHR1_GM12878_5kb_Arrowhead.rds | EH3815 | | CHR1_GM12878_10kb_Peakachu.rds | EH3816 | | CHR1_K562_5kb_Arrowhead.rds | EH3817 | | CHR1_K562_10kb_Peakachu.rds | EH3818 | | CHR2_GM12878_5kb_Arrowhead.rds | EH3819 | | CHR2_GM12878_10kb_Peakachu.rds | EH3820 | | CHR2_K562_5kb_Arrowhead.rds | EH3821 | | CHR2_K562_10kb_Peakachu.rds | EH3822 | | CHR3_GM12878_5kb_Arrowhead.rds | EH3823 | | CHR3_GM12878_10kb_Peakachu.rds | EH3824 | | CHR3_K562_5kb_Arrowhead.rds | EH3825 | | CHR3_K562_10kb_Peakachu.rds | EH3826 | | CHR4_GM12878_5kb_Arrowhead.rds | EH3827 | | CHR4_GM12878_10kb_Peakachu.rds | EH3828 | | CHR4_K562_5kb_Arrowhead.rds | EH3829 | | CHR4_K562_10kb_Peakachu.rds | EH3830 | | CHR5_GM12878_5kb_Arrowhead.rds | EH3831 | | CHR5_GM12878_10kb_Peakachu.rds | EH3832 | | CHR5_K562_5kb_Arrowhead.rds | EH3833 | | CHR5_K562_10kb_Peakachu.rds | EH3834 | | CHR6_GM12878_5kb_Arrowhead.rds | EH3835 | | CHR6_GM12878_10kb_Peakachu.rds | EH3836 | | CHR6_K562_5kb_Arrowhead.rds | EH3837 | | CHR6_K562_10kb_Peakachu.rds | EH3838 | | CHR7_GM12878_5kb_Arrowhead.rds | EH3839 | | CHR7_GM12878_10kb_Peakachu.rds | EH3840 | | CHR7_K562_5kb_Arrowhead.rds | EH3841 | | CHR7_K562_10kb_Peakachu.rds | EH3842 | | CHR8_GM12878_5kb_Arrowhead.rds | EH3843 | | CHR8_GM12878_10kb_Peakachu.rds | EH3844 | | CHR8_K562_5kb_Arrowhead.rds | EH3845 | | CHR8_K562_10kb_Peakachu.rds | EH3846 | | CHR10_GM12878_5kb_Arrowhead.rds | EH3847 | | CHR10_GM12878_10kb_Peakachu.rds | EH3848 | | CHR10_K562_5kb_Arrowhead.rds | EH3849 | | CHR10_K562_10kb_Peakachu.rds | EH3850 | | CHR11_GM12878_5kb_Arrowhead.rds | EH3851 | | CHR11_GM12878_10kb_Peakachu.rds | EH3852 | | CHR11_K562_5kb_Arrowhead.rds | EH3853 | | CHR11_K562_10kb_Peakachu.rds | EH3854 | | CHR12_GM12878_5kb_Arrowhead.rds | EH3855 | | CHR12_GM12878_10kb_Peakachu.rds | EH3856 | | CHR12_K562_5kb_Arrowhead.rds | EH3857 | | CHR12_K562_10kb_Peakachu.rds | EH3858 | | CHR13_GM12878_5kb_Arrowhead.rds | EH3859 | | CHR13_GM12878_10kb_Peakachu.rds | EH3860 | | CHR13_K562_5kb_Arrowhead.rds | EH3861 | | CHR13_K562_10kb_Peakachu.rds | EH3862 | | CHR14_GM12878_5kb_Arrowhead.rds | EH3863 | | CHR14_GM12878_10kb_Peakachu.rds | EH3864 | | CHR14_K562_5kb_Arrowhead.rds | EH3865 | | CHR14_K562_10kb_Peakachu.rds | EH3866 | | CHR15_GM12878_5kb_Arrowhead.rds | EH3867 | | CHR15_GM12878_10kb_Peakachu.rds | EH3868 | | CHR15_K562_5kb_Arrowhead.rds | EH3869 | | CHR15_K562_10kb_Peakachu.rds | EH3870 | | CHR16_GM12878_5kb_Arrowhead.rds | EH3871 | | CHR16_GM12878_10kb_Peakachu.rds | EH3872 | | CHR16_K562_5kb_Arrowhead.rds | EH3873 | | CHR16_K562_10kb_Peakachu.rds | EH3874 | | CHR17_GM12878_5kb_Arrowhead.rds | EH3875 | | CHR17_GM12878_10kb_Peakachu.rds | EH3876 | | CHR17_K562_5kb_Arrowhead.rds | EH3877 | | CHR17_K562_10kb_Peakachu.rds | EH3878 | | CHR18_GM12878_5kb_Arrowhead.rds | EH3879 | | CHR18_GM12878_10kb_Peakachu.rds | EH3880 | | CHR18_K562_5kb_Arrowhead.rds | EH3881 | | CHR18_K562_10kb_Peakachu.rds | EH3882 | | CHR19_GM12878_5kb_Arrowhead.rds | EH3883 | | CHR19_GM12878_10kb_Peakachu.rds | EH3884 | | CHR19_K562_5kb_Arrowhead.rds | EH3885 | | CHR19_K562_10kb_Peakachu.rds | EH3886 | | CHR20_GM12878_5kb_Arrowhead.rds | EH3887 | | CHR20_GM12878_10kb_Peakachu.rds | EH3888 | | CHR20_K562_5kb_Arrowhead.rds | EH3889 | | CHR20_K562_10kb_Peakachu.rds | EH3890 | | CHR21_GM12878_5kb_Arrowhead.rds | EH3891 | | CHR21_GM12878_10kb_Peakachu.rds | EH3892 | | CHR21_K562_5kb_Arrowhead.rds | EH3893 | | CHR21_K562_10kb_Peakachu.rds | EH3894 | | CHR22_GM12878_5kb_Arrowhead.rds | EH3895 | | CHR22_GM12878_10kb_Peakachu.rds | EH3896 | | CHR22_K562_5kb_Arrowhead.rds | EH3897 | | CHR22_K562_10kb_Peakachu.rds | EH3898 | Table: File names and corresponding ExperimentHub (EH) IDs for all 84 .RDS files stored in preciseTADhub.

Below is a function that can be used to read in specific models by file name/EHID.

#Initialize ExperimentHub
hub <- ExperimentHub()
query(hub, "preciseTADhub")
## ExperimentHub with 84 records
## # snapshotDate(): 2020-11-06
## # $dataprovider: preciseTAD
## # $species: Homo sapiens
## # $rdataclass: list
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH3815"]]' 
## 
##            title                          
##   EH3815 | CHR1_GM12878_5kb_Arrowhead.rds 
##   EH3816 | CHR1_GM12878_10kb_Peakachu.rds 
##   EH3817 | CHR1_K562_5kb_Arrowhead.rds    
##   EH3818 | CHR1_K562_10kb_Peakachu.rds    
##   EH3819 | CHR2_GM12878_5kb_Arrowhead.rds 
##   ...      ...                            
##   EH3894 | CHR21_K562_10kb_Peakachu.rds   
##   EH3895 | CHR22_GM12878_5kb_Arrowhead.rds
##   EH3896 | CHR22_GM12878_10kb_Peakachu.rds
##   EH3897 | CHR22_K562_5kb_Arrowhead.rds   
##   EH3898 | CHR22_K562_10kb_Peakachu.rds
myfiles <- query(hub, "preciseTADhub")

# Read in File names and EHIDs
data("FileNames")

# Function for reading in pre-trained model
readEH <- function(chr, cl, gt){
  EHID <- FileNames$EHID[which(FileNames$FileName == paste0(chr, 
                                                            "_",
                                                            cl,
                                                            "_",
                                                            ifelse(gt=="Arrowhead", "5kb", "10kb"),
                                                            "_",
                                                            gt,
                                                            ".rds"))]

  ptm <- myfiles[[EHID]] 

  #keep only the model object
  ptm <- ptm[[1]]

  return(ptm)
}

CHR22_GM12878_5kb_Arrowhead <- readEH("CHR22", "GM12878", "Arrowhead")
## Warning: package 'preciseTADhub' is in use and will not be installed

Use the pre-trained model to predict TAD boundaries on CHR22 for GM12878 at base-level resolution



stilianoudakis/preciseTADhub documentation built on Dec. 31, 2020, 7:34 a.m.