View source: R/compare_motifs.R
compare_motifs | R Documentation |
Compare motifs using one of the several available metrics. See the "Motif comparisons and P-values" vignette for detailed information.
compare_motifs(motifs, compare.to, db.scores, use.freq = 1,
use.type = "PPM", method = "PCC", tryRC = TRUE, min.overlap = 6,
min.mean.ic = 0.25, min.position.ic = 0, relative_entropy = FALSE,
normalise.scores = FALSE, max.p = 0.01, max.e = 10, nthreads = 1,
score.strat = "a.mean", output.report, output.report.max.print = 10)
motifs |
See |
compare.to |
|
db.scores |
|
use.freq |
|
use.type |
|
method |
|
tryRC |
|
min.overlap |
|
min.mean.ic |
|
min.position.ic |
|
relative_entropy |
|
normalise.scores |
|
max.p |
|
max.e |
|
nthreads |
|
score.strat |
|
output.report |
|
output.report.max.print |
|
The following metrics are available:
Euclidean distance (EUCL
) (Choi et al. 2004)
Weighted Euclidean distance (WEUCL
)
Kullback-Leibler divergence (KL
) (Kullback and Leibler 1951; Roepcke et al. 2005)
Hellinger distance (HELL
) (Hellinger 1909)
Squared Euclidean distance (SEUCL
)
Manhattan distance (MAN
)
Pearson correlation coefficient (PCC
)
Weighted Pearson correlation coefficient (WPCC
)
Sandelin-Wasserman similarity (SW
), or sum of squared distances (Sandelin and Wasserman 2004)
Average log-likelihood ratio (ALLR
) (Wang and Stormo 2003)
Lower limit ALLR (ALLR_LL
) (Mahony et al. 2007)
Bhattacharyya coefficient (BHAT
) (Bhattacharyya 1943)
Comparisons are calculated between two motifs at a time. All possible alignments
are scored, and the best score is reported. In an alignment scores are calculated
individually between columns. How those scores are combined to generate the final
alignment scores depends on score.strat
.
See the "Motif comparisons and P-values" vignette for a description of the
various metrics. Note that PCC
, WPCC
, SW
, ALLR
, ALLR_LL
and BHAT
are similarities;
higher values mean more similar motifs. For the remaining metrics, values closer
to zero represent more similar motifs.
Small pseudocounts are automatically added when one of the following methods
is used: KL
, ALLR
, ALLR_LL
, IS
. This is avoid
zeros in the calculations.
To note regarding p-values: P-values are pre-computed using the
make_DBscores()
function. If not given, then uses a set of internal
precomputed P-values from the JASPAR2018 CORE motifs. These precalculated
scores are dependent on the length of the motifs being compared. This takes
into account that comparing small motifs with larger motifs leads to higher
scores, since the probability of finding a higher scoring alignment is
higher.
The default P-values have been precalculated for regular DNA motifs. They
are of little use for motifs with a different number of alphabet letters
(or even the multifreq
slot).
matrix
if compare.to
is missing; DataFrame
otherwise. For the
latter, function args are stored in the metadata
slot.
Benjamin Jean-Marie Tremblay, benjamin.tremblay@uwaterloo.ca
Bhattacharyya A (1943). “On a measure of divergence between two statistical populations defined by their probability distributions.” Bulletin of the Calcutta Mathematical Society, 35, 99-109.
Choi I, Kwon J, Kim S (2004). “Local feature frequency profile: a method to measure structural similarity in proteins.” PNAS, 101, 3797-3802.
Hellinger E (1909). “Neue Begrundung der Theorie quadratischer Formen von unendlichvielen Veranderlichen.” Journal fur die reine und angewandte Mathematik, 136, 210-271.
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A (2018). “JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.” Nucleic Acids Research, 46, D260-D266.
Kullback S, Leibler RA (1951). “On information and sufficiency.” The Annals of Mathematical Statistics, 22, 79-86.
Itakura F, Saito S (1968). “Analysis synthesis telephony based on the maximum likelihood method.” In 6th International Congress on Acoustics, C-17.
Mahony S, Auron PE, Benos PV (2007). “DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies.” PLoS Computational Biology, 3.
Pietrokovski S (1996). “Searching databases of conserved sequence regions by aligning protein multiple-alignments.” Nucleic Acids Research, 24, 3836-3845.
Roepcke S, Grossmann S, Rahmann S, Vingron M (2005). “T-Reg Comparator: an analysis tool for the comparison of position weight matrices.” Nucleic Acids Research, 33, W438-W441.
Sandelin A, Wasserman WW (2004). “Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics.” Journal of Molecular Biology, 338, 207-215.
Wang T, Stormo GD (2003). “Combining phylogenetic data with co-regulated genes to identify motifs.” Bioinformatics, 19, 2369-2380.
convert_motifs()
, motif_tree()
, view_motifs()
,
make_DBscores()
motif1 <- create_motif(name = "1")
motif2 <- create_motif(name = "2")
motif1vs2 <- compare_motifs(c(motif1, motif2), method = "PCC")
## To get a dist object:
as.dist(1 - motif1vs2)
motif3 <- create_motif(name = "3")
motif4 <- create_motif(name = "4")
motifs <- c(motif1, motif2, motif3, motif4)
## Compare motif "2" to all the other motifs:
if (R.Version()$arch != "i386") {
compare_motifs(motifs, compare.to = 2, max.p = 1, max.e = Inf)
}
## If you are working with a large list of motifs and the mean.min.ic
## option is not set to zero, you may get a number of failed comparisons
## due to low IC. To filter the list of motifs to avoid these, use
## the average_ic() function to remove motifs with low average IC:
## Not run:
library(MotifDb)
motifs <- convert_motifs(MotifDb)[1:100]
compare_motifs(motifs)
#> Warning in compare_motifs(motifs) :
#> Some comparisons failed due to low IC
motifs <- motifs[average_ic(motifs) > 0.5]
compare_motifs(motifs)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.