knitr::opts_chunk$set(collapse=TRUE, comment = "#>") suppressPackageStartupMessages(library(universalmotif)) suppressPackageStartupMessages(library(Biostrings)) data(ArabidopsisPromoters) data(ArabidopsisMotif)
This vignette goes through generating your own sequences from a specified background model, shuffling sequences whilst maintaining a certain
k-let size, and the scanning of sequences and scoring of motifs. For an introduction to sequence motifs, see the introductory vignette. For a basic overview of available motif-related functions, see the motif manipulation vignette. For a discussion on motif comparisons and P-values, see the motif comparisons and P-values vignette.
Biostrings package offers an excellent suite of functions for dealing with biological sequences. The
universalmotif package hopes to help extend these by providing the
shuffle_sequences() functions. The first of these,
create_sequences(), in it's simplest form generates a set of letters in random order, then passes these strings to the
Biostrings package. The number and length of sequences can be specified. The probabilities of individual letters can also be set.
freqs option of
create_sequences() also takes higher order backgrounds. In these cases the sequences are constructed in a Markov-style manner, where the probability of each letter is based on which letters precede it.
library(universalmotif) library(Biostrings) ## Create some DNA sequences for use with an external program (default ## is DNA): sequences.dna <- create_sequences(seqnum = 500, freqs = c(A=0.3, C=0.2, G=0.2, T=0.3)) ## writeXStringSet(sequences.dna, "dna.fasta") sequences.dna ## Amino acid: create_sequences(alphabet = "AA") ## Any set of characters can be used create_sequences(alphabet = paste0(letters, collapse = ""))
Sequence backgrounds can be retrieved for DNA and RNA sequences with
"Biostrings. Unfortunately, no such
Biostrings function exists for other sequence alphabets. The
universalmotif package proves
get_bkg() to remedy this. Similarly, the
get_bkg() function can calculate higher order backgrounds for any alphabet as well. It is recommended to use the original
Biostrings for very long DNA and RNA sequences whenever possible though, as it is much faster than
library(universalmotif) ## Background of DNA sequences: dna <- create_sequences() get_bkg(dna, k = 1:2) ## Background of non DNA/RNA sequences: qwerty <- create_sequences("QWERTY") get_bkg(qwerty, k = 1:2)
When performing de novo motif searches or motif enrichment analyses, it is common to do so against a set of background sequences. In order to properly identify consistent patterns or motifs in the target sequences, it is important that there be maintained a certain level of sequence composition between the target and background sequences. This reduces results which are derived purely from differential letter frequency biases.
In order to avoid these results, typically it desirable to use a set of background sequences which preserve a certain
k-let size (such as dinucleotide or trinucleotide frequencies in the case of DNA sequences). Though for some cases a set of similar sequences may already be available for use as background sequences, usually background sequences are obtained by shuffling the target sequences, while preserving a desired
k-let size. For this purpose, a commonly used tool is
uShuffle [@ushuffle]. The
universalmotif package aims to provide its own
k-let shuffling capabilities for use within
universalmotif package offers three different methods for sequence shuffling:
linear. The first method,
euler, can shuffle sequences while preserving any desired
k-let size. Furthermore 1-letter counts will always be maintained. However in order for this to be possible, the first and last letters will remain unshuffled. This method is based on the initial random Eulerian walk algorithm proposed by @markovmodel2 and the subsequent cycle-popping algorithm detailed by @eulerAlgo for quickly and efficiently finding Eulerian walks.
The second method,
markov can only guarantee that the approximate
k-let frequency will be maintained, but not that the original letter counts will be preserved. The
markov method involves determining the original
k-let frequencies, then creating a new set of sequences which will have approximately similar
k-let frequency. As a result the counts for the individual letters will likely be different. Essentially, it involves a combination of determining
k-let frequencies followed by
create_sequences(). This type of shuffling is discussed by @markovmodel.
The third method
linear preserves the original 1-letter counts exactly, but uses a more crude shuffling technique. In this case the sequence is split into sub-sequences every
k-let (of any size), which are then re-assembled randomly. This means that while shuffling the same sequence multiple times with
method = "linear" will result in different sequences, they will all have started from the same set of
k-length sub-sequences (just re-assembled differently).
library(universalmotif) library(Biostrings) data(ArabidopsisPromoters) ## Potentially starting off with some external sequences: # ArabidopsisPromoters <- readDNAStringSet("ArabidopsisPromoters.fasta") euler <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "euler") markov <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "markov") linear <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "linear") k1 <- shuffle_sequences(ArabidopsisPromoters, k = 1)
Let us compare how the methods perform:
o.letter <- get_bkg(ArabidopsisPromoters, 1) e.letter <- get_bkg(euler, 1) m.letter <- get_bkg(markov, 1) l.letter <- get_bkg(linear, 1) data.frame(original=o.letter$count, euler=e.letter$count, markov=m.letter$count, linear=l.letter$count, row.names = DNA_BASES) o.counts <- get_bkg(ArabidopsisPromoters, 2) e.counts <- get_bkg(euler, 2) m.counts <- get_bkg(markov, 2) l.counts <- get_bkg(linear, 2) data.frame(original=o.counts$count, euler=e.counts$count, markov=m.counts$count, linear=l.counts$count, row.names = get_klets(DNA_BASES, 2))
Since biological sequences are usually contained in
XStringSet class objects,
shuffle_sequences() are designed to work with such objects. For cases when strings are not
XStringSet objects, the following functions are available:
count_klets(): alternative to
shuffle_string(): alternative to
library(universalmotif) string <- "DASDSDDSASDSSA" count_klets(string, 2) shuffle_string(string, 2)
get_klets() function can be used to get a list of all possible
k-lets for any sequence alphabet:
library(universalmotif) get_klets(c("A", "S", "D"), 2)
There are many motif-programs available with sequence scanning capabilities, such as HOMER and tools from the MEME suite. The
universalmotif package does not aim to supplant these, but rather provide convenience functions for quickly scanning a few sequences without needing to leave the
R environment. Furthermore, these functions allow for taking advantage of the higher-order (
multifreq) motif format described here.
Two scanning-related functions are provided:
enrich_motifs(). The latter simply runs
scan_sequences() twice on a set of target and background sequences. Given a motif of length
scan_sequences() considers every possible
n-length subset in a sequence and scores it using the PWM format. If the match surpasses the minimum threshold, it is reported. This is case regardless of whether one is scanning with a regular motif, or using the higher-order (
multifreq) motif format (the
multifreq matrix is converted to a PWM).
Before scanning a set of sequences, one must first decide the minimum logodds threshold for retrieving matches. This decision is not always the same between scanning programs out in the wild, nor is it usually told to the user what the cutoff is or how it is decided. As a result,
universalmotif aims to be as transparent as possible in this regard by allowing for complete control of the threshold. For more details on PWMs, see the introductory vignette.
One way is to set a cutoff between 0 and 1, then multiplying the highest possible PWM score to get a threshold. The
matchPWM() function from the
Biostrings package for example uses a default of 0.8 (shown as
"80%"). This is quite arbitrary of course, and every motif will end up with a different threshold. For high information content motifs, there is really no right or wrong threshold; as they tend to have fewer non-specific positions. This means that incorrect letters in a match will be more punishing. To illustrate this, contrast the following PWMs:
library(universalmotif) m1 <- create_motif("TATATATATA", nsites = 50, type = "PWM", pseudocount = 1) m2 <- matrix(c(0.10,0.27,0.23,0.19,0.29,0.28,0.51,0.12,0.34,0.26, 0.36,0.29,0.51,0.38,0.23,0.16,0.17,0.21,0.23,0.36, 0.45,0.05,0.02,0.13,0.27,0.38,0.26,0.38,0.12,0.31, 0.09,0.40,0.24,0.30,0.21,0.19,0.05,0.30,0.31,0.08), byrow = TRUE, nrow = 4) m2 <- create_motif(m2, alphabet = "DNA", type = "PWM") m1["motif"] m2["motif"]
In the first example, sequences which do not have a matching base in every position are punished heavily. The maximum logodds score in this case is approximately 20, and for each incorrect position the score is reduced approximately by 5.7. This means that a threshold of zero would allow for at most three mismatches. At this point, it is up to you how many mismatches you would deem appropriate.
This thinking becomes impossible for the second example. In this case, mismatches are much less punishing, to the point that one could ask: what even constitutes a mismatch? The answer to this question is much more difficult in cases such as these. An alternative to manually deciding upon a threshold is to instead start with maximum P-value one would consider appropriate for a match. If, say, we want matches with a P-value of at most 0.001, then we can use
motif_pvalue() to calculate the appropriate threshold (see the comparisons and P-values vignette for details on motif P-values).
motif_pvalue(m2, pvalue = 0.001)
scan_sequences() function offers the ability to scan using the
multifreq slot, if available. This allows to take into account inter-positional dependencies, and get matches which more faithfully represent the original sequences from which the motif originated.
library(universalmotif) library(Biostrings) data(ArabidopsisPromoters) ## A 2-letter example: motif.k2 <- create_motif("CWWWWCC", nsites = 6) sequences.k2 <- DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3)) motif.k2 <- add_multifreq(motif.k2, sequences.k2)
head(scan_sequences(motif.k2, ArabidopsisPromoters, RC = TRUE, threshold = 0.9, threshold.type = "logodds"))
Using 2-letter information to scan:
head(scan_sequences(motif.k2, ArabidopsisPromoters, use.freq = 2, RC = TRUE, threshold = 0.9, threshold.type = "logodds"))
As an aside: the previous example involved calling
add_multifreq() separately. In this case however this could have been simplified to just calling
create_motif() and using the
library(universalmotif) library(Biostrings) sequences <- DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3)) motif <- create_motif(sequences, add.multifreq = 2:3)
universalmotif package offers the ability to search for enriched motif sites in a set of sequences via
enrich_motifs(). There is little complexity to this, as it simply runs
scan_sequences() twice: once on a set of target sequences, and once on a set of background sequences. After which the results between the two sequences are collated and run through enrichment tests. The background sequences can be given explicitly, or else
enrich_motifs() will create background sequences on its own by using
shuffle_sequences() on the target sequences.
Let us consider the following basic example:
library(universalmotif) data(ArabidopsisMotif) data(ArabidopsisPromoters) enrich_motifs(ArabidopsisMotif, ArabidopsisPromoters, shuffle.k = 3, threshold = 0.001, RC = TRUE)
Here we can see that the motif is significantly enriched in the target sequences. The
Pval was calculated by calling
One final point: always keep in mind the
threshold parameter, as this will ultimately decide the number of hits found. (A bad threshold can lead to a false negative.)
universalmotif class motifs can be gapped, which can be used by
enrich_motifs(). Note that gapped motif support is currently limited to these two functions. All other functions will ignore the gap information, and even discard them in functions such as
First, obtain the component motifs:
library(universalmotif) data(ArabidopsisPromoters) m1 <- create_motif("TTTATAT", name = "PartA") m2 <- create_motif("GGTTCGA", name = "PartB")
Then, combine them and add the desired gap. In this case, a gap will be added between the two motifs which can range in size from 4-6 bases.
m <- cbind(m1, m2) m <- add_gap(m, gaploc = ncol(m1), mingap = 4, maxgap = 6) m
Now, it can be used directly in
scan_sequences(m, ArabidopsisPromoters, threshold = 0.4, threshold.type = "logodds")
universalmotif package provides the
motif_peaks() function, which can test for positionally preferential motif sites in a set of sequences. This can be useful, for example, when trying to determine whether a certain transcription factor binding site is more often than not located at a certain distance from the transcription start site (TSS). The
motif_peaks() function finds density peaks in the input data, then creates a null distribution from randomly generated peaks to calculate peak P-values.
library(universalmotif) data(ArabidopsisMotif) data(ArabidopsisPromoters) hits <- scan_sequences(ArabidopsisMotif, ArabidopsisPromoters, RC = FALSE) res <- motif_peaks(hits$start, seq.length = unique(width(ArabidopsisPromoters)), seq.count = length(ArabidopsisPromoters)) ## Significant peaks: res$Peaks
Using the datasets provided in this package, a significant motif peak was found about 100 bases away from the TSS. If you'd like to simply know the locations of any peaks, this can be done by setting
max.p = 1.
The function can also output a plot:
In this plot, red dots are used to indicate density peaks and the blue line shows the P-value cutoff.
universalmotif package provides a simple wrapper to the powerful motif discovery tool
MEME [@meme3]. To run an analysis with
MEME, all that is required is a set of
XStringSet class sequences (defined in the
Biostrings package), and
run_meme() will take care of running the program and reading the output for use within
The first step is to check that R can find the
MEME binary in your
$PATH by running
run_meme() without any parameters. If successful, you should see the default
MEME help message in your console. If not, then you'll need to provide the complete path to the
MEME binary. There are two options:
library(universalmotif) ## 1. Once per session: via `options()` options(meme.bin = "/path/to/meme/bin/meme") run_meme(...) ## 2. Once per run: via `run_meme()` run_meme(..., bin = "/path/to/meme/bin/meme")
Now we need to get some sequences to use with
run_meme(). At this point we can read sequences from disk or extract them from one of the
library(universalmotif) data(ArabidopsisPromoters) ## 1. Read sequences from disk (in fasta format): library(Biostrings) # The following `read*()` functions are available in Biostrings: # DNA: readDNAStringSet # DNA with quality scores: readQualityScaledDNAStringSet # RNA: readRNAStringSet # Amino acid: readAAStringSet # Any: readBStringSet sequences <- readDNAStringSet("/path/to/sequences.fasta") run_meme(sequences, ...) ## 2. Extract from a `BSgenome` object: library(GenomicFeatures) library(TxDb.Athaliana.BioMart.plantsmart28) library(BSgenome.Athaliana.TAIR.TAIR9) # Let us retrieve the same promoter sequences from ArabidopsisPromoters: gene.names <- names(ArabidopsisPromoters) # First get the transcript coordinates from the relevant `TxDb` object: transcripts <- transcriptsBy(TxDb.Athaliana.BioMart.plantsmart28, by = "gene")[gene.names] # There are multiple transcripts per gene, we only care for the first one # in each: transcripts <- lapply(transcripts, function(x) x) transcripts <- unlist(GRangesList(transcripts)) # Then the actual sequences: # Unfortunately this is a case where the chromosome names do not match # between the two databases seqlevels(TxDb.Athaliana.BioMart.plantsmart28) #>  "1" "2" "3" "4" "5" "Mt" "Pt" seqlevels(BSgenome.Athaliana.TAIR.TAIR9) #>  "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC" # So we must first rename the chromosomes in `transcripts`: seqlevels(transcripts) <- seqlevels(BSgenome.Athaliana.TAIR.TAIR9) # Finally we can extract the sequences promoters <- getPromoterSeq(transcripts, BSgenome.Athaliana.TAIR.TAIR9, upstream = 1000, downstream = 0) run_meme(promoters, ...)
Once the sequences are ready, there are few important options to keep in mind. One is whether to conserve the output from
MEME. The default is not to, but this can be changed by setting the relevant option:
run_meme(sequences, output = "/path/to/desired/output/folder")
The second important option is the search function (
objfun). Some search functions such as the default
classic do not require a set of background sequences, whilst some do (such as
de). If you choose one of the latter, then you can either let
MEME create them for you (it will shuffle the target sequences) or you can provide them via the
Finally, choose how you'd like the data imported into
R. Once the
MEME program exits,
run_meme() will import the results into
read_meme(). At this point you can decide if you want just the motifs themselves (
readsites = FALSE) or if you'd like the original sequence sites as well (
readsites = TRUE, the default). Doing the latter gives you the option of generating higher order representations for the imported
MEME motifs as shown here:
motifs <- run_meme(sequences) motifs.k23 <- mapply(add_multifreq, motifs$motifs, motifs$sites)
There are a wealth of other
MEME options available, such as the number of desired motifs (
nmotifs), the width of desired motifs (
maxw), the search mode (
mod), assigning sequence weights (
weights), using a custom alphabet (
alph), and many others. See the output from
run_meme() for a brief description of the options, or visit the online manual for more details.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.