library(knitr) knit_hooks$set(plot = function(x, options) { paste('<figure><img src="', opts_knit$get('base.url'), paste(x, collapse = '.'), '"><figcaption>', options$fig.cap, '</figcaption></figure>', sep = '') }) library(motifcounter) library(MotifDb) library(seqLogo) opts_chunk$set(fig.path="fig/")
This software package grew out of the work that I did to obtain my PhD. If it is of help for your analysis, please cite
@Manual{, title = {motifcounter: R package for analysing TFBSs in DNA sequences}, author = {Wolfgang Kopp}, year = {2017}, doi = {10.18129/B9.bioc.motifcounter} }
Details about the compound Poisson model are available under
@article{improvedcompound, title={An improved compound Poisson model for the number of motif hits in DNA sequences}, author={Kopp, Wolfgang and Vingron, Martin}, journal={Bioinformatics}, pages={btx539}, year={2017}, publisher={Oxford University Press} }
# Estimate a background model on a set of sequences bg <- readBackground(sequences, order)
# Normalize a given PFM new_motif <- normalizeMotif(motif)
# Evaluate the scores along a given sequence scores <- scoreSequence(sequence, motif, bg)
# Evaluate the motif hits along a given sequence hits <- motifHits(sequence, motif, bg)
# Evaluate the average score profile average_scores <- scoreProfile(sequences, motif, bg)
# Evaluate the average motif hit profile average_hits <- motifHitProfile(sequences, motif, bg)
# Compute the motif hit enrichment enrichment_result <- motifEnrichment(sequences, motif, bg)
Transcription factors (TFs) play a crucial role in gene regulation. They function by recognizing and binding to specific DNA stretches that are usually 5-30bp in length which are referred to as transcription factor binding sites (TFBSs). TF-binding acts on the neighboring genes by up- or down-regulating their gene expression levels.
The aim of the motifcounter
package is to provide statistical
tools for studying putative TFBSs in given DNA sequence, including
the presence and location of TFBSs and the enrichment of TFBSs.
motifcounter
The main ingredients for an analysis with motifcounter
consist of
A PFM represents the affinity of a TF to bind a certain DNA segment. A
large set of known PFMs can be acquired e.g. from
the MotifDb
package [@motifdb].
On the other hand,
the background model defines the properties of unbound DNA sequences.
motifcounter
implements the background model as an
order-$d$ Markov model, where $d$ is prescribed by the user.
The advantage of using higher-order background models is that they
are able to capture higher-order sequence features which is crucial
for studying naturally occurring DNA sequences (e.g. CpGs islands).
Using the PFM and the background model, motifcounter
computes
the motif score for a given DNA sequence,
which is defined to be the log-likelihood ratio
between the PFM and the background.
The motif score represents a measure that indicates whether a certain
position in the DNA sequence is bound or unbound by the TF. Intuitively,
the higher the score, the more like does the sequence represent a TFBS.
The motif scores are also used to determine
motif hits (e.g. putative TFBSs) in the DNA sequence.
To this end, motifcounter
uses a predetermined score threshold
and calls putative TFBSs whenever the observed score at a give
position is greater or equal to the score threshold.
motifcounter
establishes the score threshold
automatically based on 1) the
score distribution and 2) the user-prescribed false positive level $\alpha$.
To this end, the score distribution is determined by
an efficient dynamic programming algorithm for general order-$d$
background models. Details of the algorithm are described our paper (see above).
Testing for motif hit enrichment in motifcounter
is based
on the number of motif hits that are observed in a set of DNA sequences.
In order to be able to judge significance of the observed number of hits
(e.g. 10 predicted TFBSs in the sequence of length 10kb),
the package approximates the distribution of the number of motif hits
in random DNA sequences with matched lengths.
Accordingly, motifcounter
provides two fast and accurate alternatives
for approximating this distribution:
Both of these methods support higher-order background models and
account for the self-overlapping structure of the motif.
For example, a repeat-like word, e.g. 'AAAA', likely gives rise to a string of mutually overlapping hits which
are referred to as clumps [@reinert, @pape]^[
By contrast, a simple binomial approximation [@rsat1,@rahmann]
does not account for self-overlapping matches.].
motifcounter
not only account for overlapping motif hits
with respect to a single DNA strand, but also
for overlapping reverse complementary hits,
if both DNA strands are scanned for motif hits.
It is essential
to account for clumping, as that influences the distribution
of the number of motif hits and thereby the motif hit enrichment test.
Ignoring this effect could cause misleading statistical conclusions.
The background model is used to specify the properties of unbound DNA sequences. That is, it plays a role as a reference for identifying putative TFBSs as well as for judging motif hit enrichment.
motifcounter
offers the opportunity to use
order-$d$ Markov model with user-defined $d$.
The background model is estimated on a set of user-provided
DNA sequences which are supplied as
DNAStringSet
-objects from the Biostrings
Bioconductor package.
The following code fragment illustrates how an order-$1$ background model is estimated from a given set of DNA sequences:
order <- 1 file <- system.file("extdata", "seq.fasta", package = "motifcounter") seqs <- Biostrings::readDNAStringSet(file) bg <- readBackground(seqs, order)
motifcounter
handles motifs in terms of
position frequency matrices (PFMs), which are commonly
used to represent the binding affinity of transcription factors (TFs).
A convenient source of known motifs is the MotifDb
Bioconductor package [@motifdb],
which shall be the basis for our tutorial.
For example, we retrieve the motif for the human Pou5f1 (or Oct4)
transcription factor as follows
# Extract the Oct4 motif from MotifDb library(MotifDb) oct4 <- as.list(query(query(query(MotifDb, "hsapiens"), "pou5f1"), "jolma2013"))[[1]] motif <- oct4 # Visualize the motif using seqLogo library(seqLogo) seqLogo(motif)
By default, motifcounter
identifies TFBS with a
the false positive probability of $\alpha=0.001$.
The user might want to change
the stringency level of $\alpha$, which is facilitated by motifcounterOptions
:
alpha <- 0.01 motifcounterOptions(alpha)
For other options consult ?motifcounterOptions
.
For the following example, we explore the DNA sequences of a set of Oct4-ChIP-seq peaks that were obtained in human hESC by the ENCODE project [@encode2012]. The peak regions were trimmed to 200 bps centered around the midpoint.
file <- system.file("extdata", "oct4_chipseq.fa", package = "motifcounter") oct4peaks <- Biostrings::readDNAStringSet(file)
The motifcounter
package provides functions for exploring
position- and strand-specific putative TFBSs in individual DNA sequences.
One way to explore a given DNA sequence for TFBSs is by
utilizing scoreSequence
. This function returns the per position and
strand scores for a given Biostring::DNAString
-object
(left panel below).
To put the observed scores into perspective, the right panel shows
the theoretical score distribution in random sequences, which
is obtained by scoreDist
^[The score distribution is computed using an
efficient dynamic programming algorithm.]. Scores at the tail of
the distribution occur very rarely by chance.
Those are also the ones that give rise to TFBS predictions:
# Determine the per-position and per-strand scores scores <- scoreSequence(oct4peaks[[1]], motif, bg) # As a comparison, compute the theoretical score distribution sd <- scoreDist(motif, bg) par(mfrow = c(1, 2)) # Plot the observed scores, per position and per strand plot(1:length(scores$fscores), scores$fscores, type = "l", col = "blue", xlab = "position", ylab = "score", ylim = c(min(sd$score), max(sd$score)), xlim = c(1, 250)) points(scores$rscores, col = "red", type = "l") legend("topright", c("forw.", "rev."), col = c("blue", "red"), lty = c(1, 1)) # Plot the theoretical score distribution for the comparison plot(sd$dist, sd$scores, type = "l", xlab = "probability", ylab = "")
To obtain the predicted TFBSs positions and strands, motifcounter
provides
the function motifHits
. This function calls motif hits
if the observed score exceeds a pre-determined score
threshold^[The threshold is determined for a user-defined false
positive level $\alpha$ (e.g. $\alpha=0.001$) based on
the theoretical score distribution.].
# Call putative TFBSs mhits <- motifHits(oct4peaks[[1]], motif, bg) # Inspect the result fhitpos <- which(mhits$fhits == 1) rhitpos <- which(mhits$rhits == 1) fhitpos rhitpos
In the example sequence, we obtain no motif hit on the forward strand and one motif hit on the reverse strand at position 94. The underlying DNA sequence at this hit can be retrieved by
oct4peaks[[1]][rhitpos:(rhitpos + ncol(motif) - 1)]
Next, we illustrate how a relaxed stringency level influences
the number of motif hits. Using motifcounterOptions
, we prescribe a
false positive probability of $\alpha=0.01$ (the default was $\alpha=0.001$).
This will increase the tendency of producing motif hits
# Prescribe a new false positive level for calling TFBSs motifcounterOptions(alpha = 0.01) # Determine TFBSs mhits <- motifHits(oct4peaks[[1]], motif, bg) fhitpos <- which(mhits$fhits == 1) rhitpos <- which(mhits$rhits == 1) fhitpos rhitpos
Now we obtain four hits on each strand.
While, scoreSequence
and motifHits
can be applied to study
TFBSs in a single DNA sequence (given by a DNAString
-object),
one might also be interested in the
average score or motif hit profiles across multiple sequences of equal length.
This might reveal positional constraints of the motif occurrences
with respect to e.g. the TSS, or the summit of ChIP-seq peaks.
On the one hand,
motifcounter
provides the method scoreProfile
which can be applied for Biostrings::DNAStringSet
-objects.
# Determine the average score profile across all Oct4 binding sites scores <- scoreProfile(oct4peaks, motif, bg) plot(1:length(scores$fscores), scores$fscores, type = "l", col = "blue", xlab = "position", ylab = "score") points(scores$rscores, col = "red", type = "l") legend("bottomleft", legend = c("forward", "reverse"), col = c("blue", "red"), lty = c(1, 1))
On the other hand, motifHitProfile
constructs a similar profile by computing
the position and strand specific mean motif hit frequency
motifcounterOptions() # let's use the default alpha=0.001 again # Determine the average motif hit profile mhits <- motifHitProfile(oct4peaks, motif, bg) plot(1:length(mhits$fhits), mhits$fhits, type = "l", col = "blue", xlab = "position", ylab = "score") points(mhits$rhits, col = "red", type = "l") legend("bottomleft", legend = c("forward", "reverse"), col = c("blue", "red"), lty = c(1, 1))
A central feature of motifcounter
represents a sophisticated
novel approach for identifying motif hit enrichment in DNA sequences.
To this end, the package contains the method motifEnrichment
,
which evaluates the P-value associated with the number of motif hits
that are found in the observed sequence, compared to the background model.
# Enrichment of Oct4 in Oct4-ChIP-seq peaks result <- motifEnrichment(oct4peaks[1:10], motif, bg) result
The method returns a list that contains
pvalue
as well as fold
. While, the pvalue
represents
the probability that more or equally many hits are produced in
random DNA sequences, fold
represents the fold-enrichment of
motif hits relative the
expected number of hits in random DNA sequences.
That is, it represents a measure of the effect size.
motifEnrichment
scans both DNA strands
for motif hits and draws its statistical conclusions based
on the compound Poisson model. However, motif enrichment can
also be performed with respect to scanning
single strands (e.g. when analyzing RNA sequences).
Please consult ?motifEnrichment
for the single strand option.
motifEnrichment
may optionally invoke two alternative approaches for
approximating the P-value, 1) by a compound Poisson approxmiation
and 2) by a combinatorial approximation (see ?motifEnrichment
).
As a rule of thumb, we recommend the use
compound Poisson model for studying long (or many ) DNA sequences with
a fairly stringent $\alpha$ (e.g. 0.001 or smaller).
On the other hand, if a relaxed $\alpha$ is
desired for your analysis (e.g $\alpha\geq 0.01$),
the combinatorial approximation is likely to give more accurate results.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.