suppressWarnings(library(knitr))
suppressMessages(library(GenomicRanges))
suppressMessages(library(data.table))

Introduction

This document provides an introduction to the AdaLiftOver package. AdaLiftOver is a handy large-scale computational tool for adaptively identifying orthologous regions across different species using the UCSC liftOver framework. For given query genomic regions, AdaLiftOver incorporates epigenomic signals as well as sequence grammars to priorize and pinpoint candidate target genomic regions.

Installation

AdaLiftOver will be submitted to Bioconductor. Currently, AdaLiftOver can be downloaded and installed in R by:

devtools::install_github("ThomasDCY/AdaLiftOver")

AdaLiftOver depends on the following \R{} packages:

(a) r CRANpkg("data.table") is used for fast data manipulation and computation. (b) r Biocpkg("GenomicRanges") is used for operating genomic intervals. (c) r CRANpkg("motifmatchr") is used for fast motif scanning. (d) r CRANpkg("rtracklayer") is the UCSC framework. (e) r CRANpkg("Matrix") is used for efficient matrix operations. (f) r CRANpkg("PRROC") is used in the training module to compute the area under curve.

Example

Inputs

library(AdaLiftOver)

We can apply AdaLiftOver between any two species giving the following inputs. For example, we can map from mouse (mm10) to human (hg38).

Query genomic regions

The input is a GRanges object containing all query genomic regions.

data("data_example")
gr

UCSC chain file

We can download the UCSC chain file from mm10.hg38.rbest.chain.gz and then import the chain file with the following code.

chain <- rtracklayer::import.chain('mm10.hg38.rbest.chain')

Orthologous epigenomic signals

We uses 67 matched ENCODE functional genomic datasets including TF/DNA methylation ChIP-seq and DNase-seq between human and mouse for illustration. The inputs for this part is two GRangesList objects with the same length.

data("epigenome_mm10")
epigenome_mm10
data("epigenome_hg38")
epigenome_hg38

Map query regions

We first map the query regions in mouse to candidate target regions in human. The key parameters:

gr_target_list <- adaptive_liftover(gr, chain, window = 2000, step_size = 200)
gr_target_list <- adaptive_liftover(gr, chain, window = 2000, step_size = 10000)

This function will output a GRangesList object that has a one-to-one correspondence to the query genomic regions.

gr_list

Compute similarity scores

We then utilize the orthologous functional genomic datasets to compute epigenomic signal similarities between query regions and their corresponding candidate target regions. A metadata column epigenome will be added.

gr_list <- compute_similarity_epigenome(gr, gr_list, epigenome_mm10, epigenome_hg38)
gr_list

We also compute the sequence grammar similarities. A metadata column grammar will be added.

data('jaspar_pfm_list')
gr_list <- compute_similarity_grammar(gr, gr_list, 'mm10', 'hg38', jaspar_pfm_list)
gr_list

Filter target regions

We then aggregate the epigenome signal and sequence grammar similarities as logistic probability scores and filter the scores with gr_candidate_filter() function. The key parameters:

The gr_candidate_filter() function also allows the logistic regression parameters for the sigmoid function. The logistic regression parameters are estimable with paired epigenome datasets with the training_module() function (see the next section). The default parameters are for mouse and human studies with the ENCODE epigenome repertoire we prepare. Check the documentation for more details.

gr_list_filter <- gr_candidate_filter(
    gr_list,
    best_k = 1L,
    b_interaction = NULL,
    threshold = 0.5
)
gr_list_filter

The training module

Learn the parameters from a pair of epigenome datasets

We might need to learn the parameters for a pair of matched epigenome datasets other than the ENCODE repertoire we prepare. AdaLiftOver provides a training module to estimate the logistic regression parameters and suggest an optimal score threshold without filtering out too many candidate target regions. The key parameters:

data("training_module_example")
gr_candidate
gr_true

The training module labels the candidate target regions (gr_candidate) as positives if they overlap with the corresponding epigenome peaks (gr_true) and as negatives otherwise. An illustration of labeling candidate target regions of a mouse query region with the corresponding human epigenome peaks.  Positive and negative classes are represented by 1 and 0, respectively. The translucent gray bands represent candidate orthologous mappings.

training_module(gr_candidate, gr_true, interaction = FALSE) # without the interaction term
training_module(gr_candidate, gr_true, interaction = TRUE) # with the interaction term

The training module outputs a data table with the following columns:

We can use the estimated regression coefficients as inputs to the function gr_candidate_filter().

Leave one out cross validation (LOOCV) for another epigenome repertoire

For other model organisms, e.g. rats, we will need to collect orthologous epigenome repertoire from scratch. We hereby illustrate an example.

We can download the UCSC chain file from rat to human rn6.hg38.rbest.chain.gz and then import the chain file with the following code.

chain <- rtracklayer::import.chain('rn6.hg38.rbest.chain')

To provide a minimal example, we download the following human and rat orthologous ChIP-seq H3K4me3 datasets from ChIP-Atlas:

Make sure to install the BSgenome object for rat genome as well.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BSgenome.Rnorvegicus.UCSC.rn6")

These datasets are organized as GRangesList objects epigenome_rn6_test, epigenome_hg38_test

data('rat_example')

To conduct LOOCV, we first apply AdaLiftOver to map each rat epigenome dataset to human while excluding the current dataset for epigenomic signal similarity calculations. The following code takes about 15 minutes to run. Please note that leaving out only one pair of epigenome datasets might not be a good idea when there are multiple datasets from the same tissue. Excluding all epigenome datasets from the same tissue is recommended for LOOCV.

library(parallel)
tissues <- names(epigenome_rn6_test)

gr_candidate_list <- mclapply(1:length(epigenome_rn6_test), function(i) {
    message(paste(tissues[i], 'is being processed!'))
    gr <- epigenome_rn6_test[[i]] 
    # current LOOCV fold
    # leave out the i-th epigenome dataset and train with the rest

    gr_candidate <- adaptive_liftover(
        gr,
        chain,
        window = 2000,
        option = 'adaptive',
        verbose = TRUE
    )

    gr_candidate <- compute_similarity_epigenome(
        gr_query = gr,
        gr_target_list = gr_candidate,
        query_grlist = epigenome_rn6_test[-i], # training data
        target_grlist = epigenome_hg38_test[-i], # training data
        verbose = TRUE
    ) 

    gr_candidate <- compute_similarity_grammar(
        gr_query = gr,
        gr_target_list = gr_candidate,
        query_genome = 'rn6',
        target_genome = 'hg38',
        motif_list = jaspar_pfm_list,
        verbose = TRUE
    )

    return(gr_candidate)
}, mc.cores = 4)

names(gr_candidate_list) <- tissues

After computing the candidate target regions for each rat epigenome peaks, we first label the candidate target regions in the human genome as positives if they overlap with the corresponding human epigenome peaks and as negatives otherwise.

Then, we estimate the parameters with the training_module() function.

tissues <- names(epigenome_rn6_test)
training_result <- rbindlist(
  lapply(tissues, function(tissue) {
    dt <- training_module(
        gr_candidate_list[[tissue]], 
        epigenome_hg38_test[[tissue]], 
        max_filter_proportion = 0.5, 
        interaction = FALSE
    )
    return(dt)
}))
training_result

We can take the averaged logistic regression parameters as default for rat studies.

Session Information

print(sessionInfo())


ThomasDCY/AdaLiftOver documentation built on June 2, 2025, 12:35 p.m.