suppressWarnings(library(knitr)) suppressMessages(library(GenomicRanges)) suppressMessages(library(data.table))
This document provides an introduction to the AdaLiftOver package. AdaLiftOver is a handy large-scale computational tool for adaptively identifying orthologous regions across different species using the UCSC liftOver framework. For given query genomic regions, AdaLiftOver incorporates epigenomic signals as well as sequence grammars to priorize and pinpoint candidate target genomic regions.
AdaLiftOver will be submitted to Bioconductor. Currently, AdaLiftOver can be downloaded and installed in R by:
devtools::install_github("ThomasDCY/AdaLiftOver")
AdaLiftOver depends on the following \R{} packages:
(a) r CRANpkg("data.table") is used for fast data manipulation and computation.
(b) r Biocpkg("GenomicRanges") is used for operating genomic intervals.
(c) r CRANpkg("motifmatchr") is used for fast motif scanning.
(d) r CRANpkg("rtracklayer") is the UCSC framework.
(e) r CRANpkg("Matrix") is used for efficient matrix operations.
(f) r CRANpkg("PRROC") is used in the training module to compute the area under curve.
library(AdaLiftOver)
We can apply AdaLiftOver between any two species giving the following inputs. For example, we can map from mouse (mm10) to human (hg38).
The input is a GRanges object containing all query genomic regions.
data("data_example") gr
We can download the UCSC chain file from mm10.hg38.rbest.chain.gz and then import the chain file with the following code.
chain <- rtracklayer::import.chain('mm10.hg38.rbest.chain')
We uses 67 matched ENCODE functional genomic datasets including TF/DNA methylation ChIP-seq and DNase-seq between human and mouse for illustration. The inputs for this part is two GRangesList objects with the same length.
data("epigenome_mm10") epigenome_mm10
data("epigenome_hg38") epigenome_hg38
We first map the query regions in mouse to candidate target regions in human. The key parameters:
gr_target_list <- adaptive_liftover(gr, chain, window = 2000, step_size = 200) gr_target_list <- adaptive_liftover(gr, chain, window = 2000, step_size = 10000)
This function will output a GRangesList object that has a one-to-one correspondence to the query genomic regions.
gr_list
We then utilize the orthologous functional genomic datasets to compute epigenomic signal similarities between query regions and their corresponding candidate target regions. A metadata column epigenome will be added.
gr_list <- compute_similarity_epigenome(gr, gr_list, epigenome_mm10, epigenome_hg38) gr_list
We also compute the sequence grammar similarities. A metadata column grammar will be added.
data('jaspar_pfm_list') gr_list <- compute_similarity_grammar(gr, gr_list, 'mm10', 'hg38', jaspar_pfm_list) gr_list
We then aggregate the epigenome signal and sequence grammar similarities as logistic probability scores and filter the scores with gr_candidate_filter() function. The key parameters:
The gr_candidate_filter() function also allows the logistic regression parameters for the sigmoid function. The logistic regression parameters are estimable with paired epigenome datasets with the training_module() function (see the next section). The default parameters are for mouse and human studies with the ENCODE epigenome repertoire we prepare. Check the documentation for more details.
gr_list_filter <- gr_candidate_filter( gr_list, best_k = 1L, b_interaction = NULL, threshold = 0.5 ) gr_list_filter
We might need to learn the parameters for a pair of matched epigenome datasets other than the ENCODE repertoire we prepare. AdaLiftOver provides a training module to estimate the logistic regression parameters and suggest an optimal score threshold without filtering out too many candidate target regions. The key parameters:
data("training_module_example") gr_candidate gr_true
The training module labels the candidate target regions (gr_candidate) as positives if they overlap with the corresponding epigenome peaks (gr_true) and as negatives otherwise.

training_module(gr_candidate, gr_true, interaction = FALSE) # without the interaction term training_module(gr_candidate, gr_true, interaction = TRUE) # with the interaction term
The training module outputs a data table with the following columns:
We can use the estimated regression coefficients as inputs to the function gr_candidate_filter().
For other model organisms, e.g. rats, we will need to collect orthologous epigenome repertoire from scratch. We hereby illustrate an example.
We can download the UCSC chain file from rat to human rn6.hg38.rbest.chain.gz and then import the chain file with the following code.
chain <- rtracklayer::import.chain('rn6.hg38.rbest.chain')
To provide a minimal example, we download the following human and rat orthologous ChIP-seq H3K4me3 datasets from ChIP-Atlas:
Make sure to install the BSgenome object for rat genome as well.
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("BSgenome.Rnorvegicus.UCSC.rn6")
These datasets are organized as GRangesList objects epigenome_rn6_test, epigenome_hg38_test
data('rat_example')
To conduct LOOCV, we first apply AdaLiftOver to map each rat epigenome dataset to human while excluding the current dataset for epigenomic signal similarity calculations. The following code takes about 15 minutes to run. Please note that leaving out only one pair of epigenome datasets might not be a good idea when there are multiple datasets from the same tissue. Excluding all epigenome datasets from the same tissue is recommended for LOOCV.
library(parallel) tissues <- names(epigenome_rn6_test) gr_candidate_list <- mclapply(1:length(epigenome_rn6_test), function(i) { message(paste(tissues[i], 'is being processed!')) gr <- epigenome_rn6_test[[i]] # current LOOCV fold # leave out the i-th epigenome dataset and train with the rest gr_candidate <- adaptive_liftover( gr, chain, window = 2000, option = 'adaptive', verbose = TRUE ) gr_candidate <- compute_similarity_epigenome( gr_query = gr, gr_target_list = gr_candidate, query_grlist = epigenome_rn6_test[-i], # training data target_grlist = epigenome_hg38_test[-i], # training data verbose = TRUE ) gr_candidate <- compute_similarity_grammar( gr_query = gr, gr_target_list = gr_candidate, query_genome = 'rn6', target_genome = 'hg38', motif_list = jaspar_pfm_list, verbose = TRUE ) return(gr_candidate) }, mc.cores = 4) names(gr_candidate_list) <- tissues
After computing the candidate target regions for each rat epigenome peaks, we first label the candidate target regions in the human genome as positives if they overlap with the corresponding human epigenome peaks and as negatives otherwise.
Then, we estimate the parameters with the training_module() function.
tissues <- names(epigenome_rn6_test) training_result <- rbindlist( lapply(tissues, function(tissue) { dt <- training_module( gr_candidate_list[[tissue]], epigenome_hg38_test[[tissue]], max_filter_proportion = 0.5, interaction = FALSE ) return(dt) })) training_result
We can take the averaged logistic regression parameters as default for rat studies.
print(sessionInfo())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.