knitr::opts_chunk$set(collapse=TRUE,comment = "#>") suppressPackageStartupMessages(library(universalmotif)) suppressMessages(suppressPackageStartupMessages(library(MotifDb))) suppressMessages(suppressPackageStartupMessages(library(Logolas))) suppressMessages(suppressPackageStartupMessages(library(TFBSTools))) data(examplemotif) data(MA0003.2)
This vignette will introduce the
universalmotif class and its structure, the import and export of motifs in R, basic motif manipulation, creation, and visualization. For an introduction to sequence motifs, see the introductory vignette. For sequence-related utilities, see the sequences vignette. For motif comparisons and P-values, see the motif comparisons and P-values vignette.
universalmotif package stores motifs using the
universalmotif class. The most basic
universalmotif object exposes the
motif slots; furthermore, the
bkg slots are also stored but not shown.
universalmotif class motifs can be PCM, PPM, PWM, or ICM type.
library(universalmotif) data(examplemotif) examplemotif
A brief description of all the available slots:
name: motif name
altname: (optional) alternative motif name
family: (optional) a word representing the transcription factor or matrix family
organism: (optional) organism of origin
motif: the actual motif matrix
alphabet: motif alphabet
type: motif 'type', one of PCM, PPM, PWM, ICM; see the introductory vignette
icscore: (generated automatically) Sum of information content for the motif
nsites: (optional) number of sites the motif was created from
pseudocount: this value to added to the motif matrix during certain type conversions; this is necessary to avoid
-Infvalues from appearing in PWM type motifs
bkg: a named vector of probabilities which represent the background letter frequencies
bkgsites: (optional) total number of background sequences from motif creation
consensus: (generated automatically) for DNA/RNA/AA motifs, the motif consensus
strand: strand motif can be found on
pval: (optional) P-value from de novo motif search
qval: (optional) Q-value from de novo motif search
eval: (optional) E-value from de novo motif search
multifreq: (optional) higher-order motif representations.
extrainfo: (optional) any extra motif information that cannot fit in the existing slots
The other slots will be shown as they are filled.
library(universalmotif) data(examplemotif) ## The various slots can be accessed individually using `[` examplemotif["consensus"] ## To change a slot, use `[<-` examplemotif["family"] <- "My motif family" examplemotif
Though the slots can easily be changed manually with
[<-, a number of safeguards have been put in place for some of the slots which will prevent incorrect values from being introduced.
library(universalmotif) data(examplemotif) ## The consensus slot is dependent on the motif matrix examplemotif["consensus"] ## Changing this would mean it no longer matches the motif examplemotif["consensus"] <- "GGGAGAG" ## Another example of trying to change a protected slot: examplemotif["strand"] <- "x"
Below the exposed metadata slots, the actual 'motif' matrix is shown. Each position is its own column: row names showing the alphabet letters, and the column names showing the consensus letter at each position.
universalmotif package aims to unify most of the motif-related Bioconductor packages by providing the
convert_motifs() function. This allows for easy transition between supported packages (see
?convert_motifs for a complete list of supported packages).
convert_motifs function is embedded in most of the
universalmotif functions, meaning that compatible motif classes from other packages can be used without needed to manually convert them first. However keep in mind some conversions are terminal. Furthermore, internally, all motifs regardless of class are handled as
universalmotif objects, even if the returning class is not. This will result in at times slightly different objects (though usually no information should be lost).
library(universalmotif) library(MotifDb) data(examplemotif) data(MA0003.2) ## convert from a `universalmotif` motif to another class convert_motifs(examplemotif, "TFBSTools-PWMatrix") ## convert to universalmotif convert_motifs(MA0003.2) ## convert between two packages convert_motifs(MotifDb, "TFBSTools-ICMatrix")
universalmotif package offers a number of
read_*() functions to allow for easy import of various motif formats. These include:
read_cisbp(): CIS-BP [@cisbp]
read_jaspar(): JASPAR [@jaspar]
read_matrix(): generic reader for simply formatted motifs
read_transfac(): TRANSFAC [@transfac]
read_uniprobe(): UniPROBE [@uniprobe]
These functions should work natively with these formats, but if you are generating your own motifs in one of these formats than it must adhere quite strictly to the format. An example of each of these is included in this package (see
Compatible motif classes can be written to disk using:
write_matrix() function, similar to its
read_matrix() counterpart, can write motifs as simple matrices with an optional header. Additionally, please keep in mind format limitations. For example, multiple
MEME motifs written to a single file will all share the same alphabet, with identical background letter frequencies.
universalmotif object can transition between PCM, PPM, PWM, and ICM types seamlessly using the
convert_type() function. The only exception to this is if the ICM calculation is performed with sample correction, or as relative entropy. If this occurs, then back conversion to another type will be inaccurate (and
convert_type() would not warn you, since it can't know this has taken place).
library(universalmotif) data(examplemotif) ## This motif is currently a PPM: examplemotif["type"]
When converting to PCM, the
nsites slot is needed to tell it how many sequences it originated from. If empty, 100 is used.
For converting to PWM, the
pseudocount slot is used to determine if any correction should be applied:
examplemotif["pseudocount"] convert_type(examplemotif, "PWM")
You can either change the
pseudocount slot manually beforehand, or pass one to
convert_type(examplemotif, "PWM", pseudocount = 1)
There are a couple of additional options for ICM conversion:
relative_entropy. The former uses the
TFBSTools:::schneider_correction() function (and thus requires that the
TFBSTools package be installed) for sample size correction. The latter uses the
bkg slot to calculate information content.
examplemotif["nsites"] <- 10 convert_type(examplemotif, "ICM", nsize_correction = FALSE) convert_type(examplemotif, "ICM", nsize_correction = TRUE) examplemotif["bkg"] <- c(A = 0.4, C = 0.1, G = 0.1, T = 0.4) convert_type(examplemotif, "ICM", relative_entropy = TRUE)
universalmotif package includes the
merge_motifs() function to combine motifs. Motifs are first aligned, and the best match found before the motif matrices are averaged. The implementation for this is identical to that used by
compare_motifs() (see the motif comparisons vignette for more information).
library(universalmotif) m1 <- create_motif("TTAAACCCC", name = "1") m2 <- create_motif("AACC", name = "2") m3 <- create_motif("AACCCCGG", name = "3") view_motifs(c(m1, m2, m3)) view_motifs(merge_motifs(c(m1, m2, m3), method = "PCC"))
Get the reverse complement of a motif.
library(universalmotif) data(examplemotif) ## Quickly switch to the reverse complement of a motif ## Original: examplemotif ## Reverse complement: motif_rc(examplemotif)
Since not all motif formats or programs support RNA alphabets by default, the
switch_alph() function can quickly go between DNA and RNA motifs.
library(universalmotif) data(examplemotif) ## DNA --> RNA switch_alph(examplemotif) ## RNA --> DNA motif <- create_motif(alphabet = "RNA") motif switch_alph(motif)
Get rid of low information content edges on motifs, such as
CGGGC. The 'amount' of trimming can also be controlled by setting a minimum required information content.
library(universalmotif) motif <- create_motif("NNGCSGCGGNN") motif trim_motifs(motif)
Round off near-zero probabilities.
motif1 <- create_motif("ATCGATGC", pseudocount = 5, type = "PPM", nsites = 100) motif2 <- round_motif(motif1) view_motifs(c(motif1, motif2), dedup.names = TRUE)
universalmotif class motifs can be created using the
new constructor, the
universalmotif package provides the
create_motif() function which aims to provide a simpler interface to motif creation. The
universalmotif class was initially designed to work natively with DNA, RNA, and amino acid motifs. Currently though, it can handle any custom alphabet just as easily. The only downsides to custom alphabets is the lack of support for certain slots such as the
create_motif() function will be introduced here only briefly; see
?create_motif for details.
Should you wish to make use of the
universalmotif functions starting from a unsupported motif class, you can instead create
universalmotif class motifs using the
motif.matrix <- matrix(c(0.7, 0.1, 0.1, 0.1, 0.7, 0.1, 0.1, 0.1, 0.1, 0.7, 0.1, 0.1, 0.1, 0.7, 0.1, 0.1, 0.1, 0.1, 0.7, 0.1, 0.1, 0.1, 0.7, 0.1, 0.1, 0.1, 0.1, 0.7, 0.1, 0.1, 0.1, 0.7), nrow = 4) motif <- create_motif(motif.matrix, alphabet = "RNA", name = "My motif", pseudocount = 1, nsites = 20, strand = "+") ## The 'type', 'icscore' and 'consensus' slots will be filled for you motif
As a short aside: if you have a motif formatted simply as a matrix, you can still use it with the
universalmotif package functions natively without creating a motif with
convert_motifs() also has the ability to handle motifs formatted as matrices. However it is much safer to first specify the motif beforehand with
If all you have is a particular consensus sequence in mind, you can easily create a full motif using
create_motif(). This can be convenient if you'd like to create a quick motif to use with an external program such as from the
MEME suite or
motif <- create_motif("CCNSNGG", nsites = 50, pseudocount = 1) ## Now to disk: ## write_meme(motif, "meme_motif.txt") motif
If you wish, it's easy to create random motifs. The values within the motif are generated using
rgamma() to avoid creating low information content motifs. If background probabilities are not provided, then they are generated with
You can change the probabilities used to generate the values within the motif matrix:
create_motif(bkg = c(A = 0.2, C = 0.4, G = 0.2, T = 0.2))
With a custom alphabet:
create_motif(alphabet = "QWERTY")
There are several packages which offer motif visualization capabilities, such as
universalmotif package has chosen
ggseqlogo as the default implementation, and used to drive the
universalmotif package function
view_motifs(). Here I will briefly show how to use these to visualize
universalmotif class motifs.
library(universalmotif) data(examplemotif) ## With the native `view_motifs` function: view_motifs(examplemotif) ## For all the following examples, simply passing the functions a PPM is ## sufficient motif <- convert_type(examplemotif, "PPM") ## Only need the matrix itself motif <- motif["motif"] ## seqLogo: seqLogo::seqLogo(motif) ## motifStack: motifStack::plotMotifLogo(motif) ## Logolas: colnames(motif) <- seq_len(ncol(motif)) Logolas::logomaker(motif, type = "Logo") ## ggseqlogo: ggseqlogo::ggseqlogo(motif)
ggseqlogo offer many additional options for logo customization, including custom alphabets as well as manually determining the heights of each letter, via the
ggplot2 packages respectively.
motifStack package allows for a number of different motif stacking visualizations. The
universalmotif package, while not capable of emulating these, still offers basic stacking via
view_motifs(). The motifs are aligned using
library(universalmotif) library(MotifDb) motifs <- convert_motifs(MotifDb[1:3]) view_motifs(motifs)
Though PCM, PPM, PWM, and ICM type motifs are still widely used today, a few 'next generation' motif formats have been proposed. These wish to add another layer of information to motifs: positional interdependence. To illustrate this, consider the following sequences:
# | Sequence -- | -------- 1 | CAAAACC 2 | CAAAACC 3 | CAAAACC 4 | CTTTTCC 5 | CTTTTCC 6 | CTTTTCC : (#tab:seqs2) Example sequences.
This becomes the following PPM:
Position | 1 | 2 | 3 | 4 | 5 | 6 | 7 -------- | --- | --- | --- | --- | --- | --- | --- A | 0.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0 C | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 G | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 T | 0.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0 : (#tab:ppm2) Position Probability Matrix.
Based on the PPM representation, all three of CAAAACC, CTTTTCC, and CTATACC are equally likely. Though looking at the starting sequences, should CTATACC really be considered so? For transcription factor binding sites, it is not always so. By incorporating this type of information into the motif, it can allow for increased accuracy in motif searching. A few implementations of this include: TFFM by @tffm, BaMM by @bamm, and KSM by @ksm.
universalmotif package implements its own, rather simplified, version of this concept. Plainly, the standard PPM has been extended to include
k-letter frequencies, with
k being any number higher than 1. For example, the 2-letter version of the table \@ref(tab:ppm2) motif would be:
Position | 1 | 2 | 3 | 4 | 5 | 6 -------- | --- | --- | --- | --- | --- | --- AA | 0.0 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0 AC | 0.0 | 0.0 | 0.0 | 0.0 | 0.5 | 0.0 AG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 AT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 CA | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 CC | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 CG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 CT | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 GA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 GC | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 GG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 GT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 TA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 TC | 0.0 | 0.0 | 0.0 | 0.0 | 0.5 | 0.0 TG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 TT | 0.0 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0 : (#tab:multi) 2-letter probability matrix.
This format shows the probability of each letter combined with the probability of the letter in the next position. The seventh column has been dropped, since it is not needed: the information in the sixth column is sufficient, and there is no eighth position to draw 2-letter probabilities from. Now, the probability of getting CTATACC is no longer equal to CTTTTCC and CAAAACC. This information is kept in the
multifreq slot of
universalmotif class motifs. To add this information, use the
library(universalmotif) motif <- create_motif("CWWWWCC", nsites = 6) sequences <- DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3)) motif.k2 <- add_multifreq(motif, sequences, add.k = 2) ## Alternatively: # motif.k2 <- create_motif(sequences, add.multifreq = 2) motif.k2
view_motifs() cannot be used to visualize this higher order motif representation. However, this can be done via the
library(Logolas) logomaker(motif.k2["multifreq"][["2"]], type = "Logo", color_type = "per_symbol")
This information is most useful with functions such as
enrich_motifs(). Though other tools in the
universalmotif can work with
multifreq motifs (such as
compare_motifs()), keep in mind they are not as well supported as regular motifs (getting P-values from
multifreq motifs is exponentially slower, and P-values from using
multifreq motifs are not available by default). See the sequences vignette for using
scan_sequences() with the
A number of convenience functions are included for manipulating motifs.
For DNA, RNA and AA motifs, the
universalmotif will automatically generate a
consensus string slot. Furthermore,
create_motif() can generate motifs from consensus strings. The internal functions for these have been made available:
library(universalmotif) get_consensus(c(A = 0.7, C = 0.1, G = 0.1, T = 0.1)) consensus_to_ppm("G")
Filter a list of motifs, using the
universalmotif slots with
library(universalmotif) library(MotifDb) ## Let us extract all of the Arabidopsis and C. elegans motifs (note that ## conversion from the MotifDb format is terminal) motifs <- filter_motifs(MotifDb, organism = c("Athaliana", "Celegans")) ## Only keeping motifs with sufficient information content and length: motifs <- filter_motifs(motifs, icscore = 10, width = 10) head(summarise_motifs(motifs))
Get a random set of sequences which are created using the probabilities of the motif matrix, in effect generating motif sites, with
library(universalmotif) data(examplemotif) sample_sites(examplemotif)
Shuffle a set of motifs with
shuffle_motifs(). The original shuffling implementation is taken from
shuffle_sequences(), described in the sequences vignette.
library(universalmotif) library(MotifDb) motifs <- convert_motifs(MotifDb[1:50]) head(summarise_motifs(motifs)) motifs.shuffled <- shuffle_motifs(motifs, k = 3) head(summarise_motifs(motifs.shuffled))
Motif matches in a set of sequences are typically obtained using logodds scores. Several functions are exposed to reveal some of the internal work that goes on.
get_matches(): show all possible sequence matches above a certain score
get_scores(): obtain all possible scores from all possible sequence matches
motif_score(): translate score thresholds to logodds scores
prob_match(): return probabilities for sequence matches
score_match(): return logodds scores for sequence matches
library(universalmotif) data(examplemotif) examplemotif ## Get the min and max possible scores: motif_score(examplemotif) ## Show matches above a score of 10: get_matches(examplemotif, 10) ## Get the probability of a match: prob_match(examplemotif, "TTTTTTT", allow.zero = FALSE) ## Score a specific sequence: score_match(examplemotif, "TTTTTTT") ## Take a look at the distribution of scores: plot(density(get_scores(examplemotif)))
convert_type() will take care of switching the current type for
universalmotif objects, the individual type conversion functions are also available for personal use. These are:
These functions take a one dimensional vector. To use these for matrices:
library(universalmotif) m <- create_motif(type = "PCM")["motif"] m apply(m, 2, pcm_to_ppm)
position_icscore() can be used to get the total information content per position:
library(universalmotif) position_icscore(c(0.7, 0.1, 0.1, 0.1))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.