In ETHZ-INS/scanMiR: scanMiR

library(BiocStyle)
knitr::opts_chunk$set(crop = NULL)

12-mer dissociation rates

McGeary, Lin et al. (2019) used RNA bind-n-seq (RBNS) to empirically determine the affinities (i.e. dissoiation rates) of selected miRNAs towards random 12-nucleotide sequences (termed 12-mers). As expected, bound sequences typically exhibited complementarity to the miRNA seed region (positions 2-8 from the miRNA's 5' end), but the study also revealed non-canonical bindings and the importance of flanking di-nucleotides. Based on these data, the authors developed a model which predicted 12-mer dissociation rates (KD) based on the miRNA sequence. ScanMiR encodes a compressed version of these prediction in the form of a KdModel object.

The 12-mer is defined as the 8 nucleotides opposite the miRNA's extended seed region plus flanking dinucleotides on either side:

knitr::include_graphics(system.file('docs', '12mer.png', package = 'scanMiR'))

KdModels

The KdModel class contains the information concerning the sequence (12-mer) affinity of a given miRNA, and is meant to compress and make easily manipulable the dissociation constants (Kd) predictions from McGeary, Lin et al. (2019).

We can take a look at the example KdModel:

library(scanMiR)
data(SampleKdModel)
SampleKdModel

In addition to the information necessary to predict the binding affinity to any given 12-mer sequence, the model contains, minimally, the name and sequence of the miRNA. Since the KdModel class extends the list class, any further information can be stored:

SampleKdModel$myVariable <- "test"

An overview of the binding affinities can be obtained with the following plot:

plotKdModel(SampleKdModel, what="seeds")

The plot gives the -log(Kd) values of the top 7-mers (including both canonical and non-canonical sites), with or without the final "A" vis-à-vis the first miRNA nucleotide.

To predict the dissociation constant (and binding type, if any) of a given 12-mer sequence, you can use the assignKdType function:

assignKdType("ACGTACGTACGT", SampleKdModel)
# or using multiple sequences:
assignKdType(c("CTAGCATTAAGT","ACGTACGTACGT"), SampleKdModel)

The log_kd column contains log(Kd) values multiplied by 1000 and stored as an integer (which is more economical when dealing with millions of sites). In the example above, r (lkd <- assignKdType("CTAGCATTAAGT", SampleKdModel)$log_kd) means r lkd/1000, or a dissociation constant of r exp(lkd/1000). The smaller the values, the stronger the relative affinity.

KdModelLists

A KdModelList object is simply a collection of KdModel objects. We can build one in the following way:

# we create a copy of the KdModel, and give it a different name:
mod2 <- SampleKdModel
mod2$name <- "dummy-miRNA"
kml <- KdModelList(SampleKdModel, mod2)
kml
summary(kml)

Beyond operations typically performed on a list (e.g. subsetting), some specific slots of the respective KdModels can be accessed, for example:

conservation(kml)

Creating a KdModel object

KdModel objects are meant to be created from a table assigning a log_kd values to 12-mer target sequences, as produced by the CNN from McGeary, Lin et al. (2019). For the purpose of example, we create such a dummy table:

kd <- dummyKdData()
head(kd)

A KdModel object can then be created with:

mod3 <- getKdModel(kd=kd, mirseq="TTAATGCTAATCGTGATAGGGGTT", name = "my-miRNA")

Alternatively, the kd argument can also be the path to the output file of the CNN (and if mirseq and name are in the table, they can be omitted).

Common KdModel collections

The scanMiRData package contains KdModel collections corresponding to all human, mouse and rat mirbase miRNAs.

Under the hood

When calling getKdModel, the dissociation constants are stored as an lightweight overfitted linear model, with base KDs coefficients (stored as integers in object$mer8) for each 1024 partially-matching 8-mers (i.e. at least 4 consecutive matching nucleotides) to which are added 8-mer-specific coefficients (stored in object$fl) that are multiplied with a flanking score generated by the flanking di-nucleotides. The flanking score is calculated based on the di-nucleotide effects experimentally measured by McGeary, Lin et al. (2019). To save space, the actual 8-mer sequences are not stored but generated when needed in a deterministic fashion. The 8-mers can be obtained, in the right order, with the getSeed8mers function.