motifbreakR | R Documentation |
Predict The Disruptiveness Of Single Nucleotide Polymorphisms On Transcription Factor Binding Sites.
motifbreakR(
snpList,
pwmList,
threshold = 0.85,
filterp = FALSE,
method = "default",
show.neutral = FALSE,
verbose = FALSE,
bkg = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25),
BPPARAM = bpparam()
)
snpList |
The output of |
pwmList |
An object of class |
threshold |
Numeric; the maximum p-value for a match to be called or a minimum score threshold |
filterp |
Logical; filter by p-value instead of by pct score. |
method |
Character; one of |
show.neutral |
Logical; include neutral changes in the output |
verbose |
Logical; if running serially, show verbose messages |
bkg |
Numeric Vector; the background probabilites of the nucleotides
used with method= |
BPPARAM |
a BiocParallel object see |
motifbreakR works with position probability matrices (PPM). PPM
are derived as the fractional occurrence of nucleotides A,C,G, and T at
each position of a position frequency matrix (PFM). PFM are simply the
tally of each nucleotide at each position across a set of aligned
sequences. With a PPM, one can generate probabilities based on the
genome, or more practically, create any number of position specific
scoring matrices (PSSM) based on the principle that the PPM contains
information about the likelihood of observing a particular nucleotide at
a particular position of a true transcription factor binding site. What
follows is a discussion of the three different algorithms that may be
employed in calls to the motifbreakR function via the method
argument.
Suppose we have a frequency matrix M
of width n
(i.e. a
PPM as described above). Furthermore, we have a sequence s
also of
length n
, such that
s_{i} \in \{ A,T,C,G \}, i = 1,\ldots n
.
Each column of
M
contains the frequencies of each letter in each position.
Commonly in the literature sequences are scored as the sum of log probabilities:
Equation 1
F( s,M ) = \sum_{i = 1}^{n}{\log( \frac{M_{s_{i},i}}{b_{s_{i}}} )}
where b_{s_{i}}
is the background frequency of letter s_{i}
in
the genome of interest. This method can be specified by the user as
method='log'
.
As an alternative to this method, we introduced a scoring method to
directly weight the score by the importance of the position within the
match sequence. This method of weighting is accessed by specifying
method='ic'
(information content). A general representation
of this scoring method is given by:
Equation 2
F( s,M ) = p_{s} \cdot \omega_{M}
where p_{s}
is the scoring vector derived from sequence s
and matrix
M
, and w_{M}
is a weight vector derived from M
. First, we
compute the scoring vector of position scores p
Equation 3
p_{s} = ( M_{s_{i},i} ) \textrm{\ \ \ where\ \ \ } \frac{i = 1,\ldots n}{s_{i} \in \{ A,C,G,T \}}
and second, for each M
a constant vector of weights
\omega_{M} = ( \omega_{1},\omega_{2},\ldots,\omega_{n} )
.
There are two methods for producing \omega_{M}
. The first, which we
call weighted sum, is the difference in the probabilities for the two
letters of the polymorphism (or variant), i.e.
\Delta p_{s_{i}}
, or the difference of the maximum and minimum
values for each column of M
:
Equation 4.1
\omega_{i} = \max \{ M_{i} \} - \min \{ M_{i} \}\textrm{\ \ \ \ where\ \ \ \ \ \ }i = 1,\ldots n
The second variation of this theme is to weight by relative entropy.
Thus the relative entropy weight for each column i
of the matrix is
given by:
Equation 4.2
\omega_{i} = \sum_{j \in \{ A,C,G,T \}}^{}{M_{j,i}\log_2( \frac{M_{j,i}}{b_{i}} )}\textrm{\ \ \ \ \ where\ \ \ \ \ }i = 1,\ldots n
where b_{i}
is again the background frequency of the letter i
.
Thus, there are 3 possible algorithms to apply via the method
argument. The first is the standard summation of log probabilities
(method='log'
). The second and third are the weighted sum and
information content methods (method='default'
and method='ic'
) specified by
equations 4.1 and 4.2, respectively. motifbreakR assumes a
uniform background nucleotide distribution (b
) in equations 1 and
4.2 unless otherwise specified by the user. Since we are primarily
interested in the difference between alleles, background frequency is
not a major factor, although it can change the results. Additionally,
inclusion of background frequency introduces potential bias when
collections of motifs are employed, since motifs are themselves
unbalanced with respect to nucleotide composition. With these cautions
in mind, users may override the uniform distribution if so desired. For
all three methods, motifbreakR scores and reports the reference
and alternate alleles of the sequence
(F( s_{\textsc{ref}},M )
and
F( s_{\textsc{alt}},M )
), and provides the matrix scores
p_{s_{\textsc{ref}}}
and p_{s_{\textsc{alt}}}
of the SNP (or
variant). The scores are scaled as a fraction of scoring range 0-1 of
the motif matrix, M
. If either of
F( s_{\textsc{ref}},M )
and
F( s_{\textsc{alt}},M )
is greater than a user-specified
threshold (default value of 0.85) the SNP is reported. By default
motifbreakR does not display neutral effects,
(\Delta p_{i} < 0.4
) but this behaviour can be
overridden.
Additionally, now, with the use of TFMPvalue-package
, we may filter by p-value of the match.
This is unfortunately a two step process. First, by invoking filterp=TRUE
and setting a threshold at
a desired p-value e.g 1e-4, we perform a rough filter on the results by rounding all values in the PWM to two
decimal place, and calculating a scoring threshold based upon that. The second step is to use the function calculatePvalue()
on a selection of results which will change the Refpvalue
and Altpvalue
columns in the output from NA
to the p-value
calculated by TFMsc2pv
. This can be (although not always) a very memory and time intensive process if the algorithm doesn't converge rapidly.
a GRanges object containing:
REF |
the reference allele for the variant |
ALT |
the alternate allele for the variant |
snpPos |
the coordinates of the variant |
motifPos |
The position of the motif relative the the variant |
geneSymbol |
the geneSymbol corresponding to the TF of the TF binding motif |
dataSource |
the source of the TF binding motif |
providerName , providerId |
the name and id provided by the source |
seqMatch |
the sequence on the 5' -> 3' direction of the "+" strand that corresponds to DNA at the position that the TF binding motif was found. |
pctRef |
The score as determined by the scoring method, when the sequence contains the reference variant allele, normalized to a scale from 0 - 1. If |
pctAlt |
The score as determined by the scoring method, when the sequence contains the alternate variant allele, normalized to a scale from 0 - 1. If |
scoreRef |
The score as determined by the scoring method, when the sequence contains the reference variant allele |
scoreAlt |
The score as determined by the scoring method, when the sequence contains the alternate variant allele |
Refpvalue |
p-value for the match for the pctRef score, initially set to |
Altpvalue |
p-value for the match for the pctAlt score, initially set to |
altPos |
the position, relative to the reference allele, of the alternate allele |
alleleDiff |
The difference between the score on the reference allele and the score on the alternate allele |
alleleEffectSize |
The ratio of the |
effect |
one of weak, strong, or neutral indicating the strength of the effect. |
each SNP in this object may be plotted with plotMB
See snps.from.rsid
and snps.from.file
for
information about how to generate the input to this function and
plotMB
for information on how to visualize it's output
library(BSgenome.Hsapiens.UCSC.hg19)
# prepare variants
load(system.file("extdata",
"pca.enhancer.snps.rda",
package = "motifbreakR")) # loads snps.mb
pca.enhancer.snps <- sample(snps.mb, 20)
# Get motifs to interrogate
data(hocomoco)
motifs <- sample(hocomoco, 50)
# run motifbreakR
results <- motifbreakR(pca.enhancer.snps,
motifs, threshold = 0.85,
method = "ic",
BPPARAM=BiocParallel::SerialParam())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.