references: - ISSN: '1544-6115' DOI: 10.2202/1544-6115.1032 volume: '2' page: Article7 container-title: Statistical Applications in Genetics and Molecular Biology author: - family: Rahmann given: Sven - family: "Müller" given: Tobias - family: Vingron given: Martin id: rahmann_power_2003 issued: date-parts: - - 2003 title: On the power of profiles for transcription factor binding site detection type: article-journal
Position count matrices describe empirical frequencies encountered in short sequence motifs in biological sequences, usually DNA sequences. They are often used to describe transcription factor binding sites, and can be employed to predict binding sites in collections of sequences of interest. For this, typically log-odds scores for matches to a count matrix are computed for each possible location on the sequences of interest. This can be done with the matchPWM
from the Biostrings
package, for example. How to decide whether a score is large? Often, cutoffs such as 80% of the maximal value are employed. However, it may be desirable with more precise cutoffs. For example, a cutoff corresponding to a 1% probability for one or more false hits for a given sequence (false discovery rate). Or a cutoff corresponding to a 90% probability of discovering a sequence generated by the probabilities inherent in the count matrix, present in a sequence. @rahmann_power_2003 addressed this problem. This package bring this methodology to R/Bioconductor.
This vignette describes the capabilities of the profileScoreDist package. The package deals with 3 main tasks, each described in the article by Rahmann et al.
regularizeMatrix()
adds pseudocounts to a position count matrix. It does so using a particularly careful method. Pseudocounts are drawn to match the overall nucleotide distribution of the count matrix, but are weighted for each position. The weights are smaller the more the count distribution for a given position deviates from the overall nucleotide distribution. In this way, fewer pseudocounts are added for positions with distinct signals. See the article by Rahmann et al. for more information.computeScoreDist()
computes probability distributions for the standard log-odds matrix scores, given nucleotide sequences assumed either to bescoreDistCutoffs()
estimates score cutoffs for specified FDRs, FNRs, or a tradeoff between these two, from the distributions and scores calculated in step 2. The tradeoff is specified as the point where $c \times$FDR = FNR, the user specifies the parameter $c$.Step 2 is the computational bottleneck, due to the discrete convolution algorithm involved. The algorithm was therefore implemented in C++, and should be relatively optimized: the performance penalty imposed by R will be insignificant.
The user is assumed to have read the article by @rahmann_power_2003, in order to use the package and interpret the results properly.
A typical workflow would be to first regularize a position count matrix with regularizeMatrix()
. Then, the distributions for the regularized matrix for an ensemble of GC percentages would be computed with computeScoreDist()
. Finally, Score cutoffs for a given FDR (or FNR, or a tradeoff between the two) would be calculated with scoreDistCutoffs()
.
The following example code applies these 3 steps to reproduce the results reported by Rahmann et al. for the INR (or cap) count matrix.
We load the package
library(profileScoreDist)
The cap signal (or INR element) was analyzed by Rahmann et al. For convenience, this public domain count matrix is included in the package.
data(INR)
INR
This matrix is regularized:
inr.reg <- regularizeMatrix(INR) inr.reg
The weak C signal at position 6 gets more pseudocounts than the strong signal at position 2, as expected.
Probability distributions will be computed for discrete values of the possible log-odds scores. The granularity (intervals between discrete scores for which probabilities are computed) must be specified. Further, probability distributions should in practice be computed for different GC contents corresponding to the different sequences of interest. Here, we will compute distributions for 1%, 2%, up to 99% GC content.
granularity <- 0.05 gcgran <- 0.01 gcmin <- 0.01 gcmax <- 0.99 ## gc fractions to consider gcs <- seq(gcgran*round(gcmin/gcgran), gcgran*round(gcmax/gcgran), gcgran)
The distributions are then computed for each GC content:
## compute probability distributions distlist <- lapply(gcs, function(x) computeScoreDist(inr.reg, x, granularity))
The profile score distribution is returned as a profileScore
object. This object contains the signal and background distributions, as well as the scores. These are accessible with the signalDist
, backgroundDist
, and score
methods, respectively.
A plot of the results for GC content 50% neatly reproduces Figure 1 in the article by Rahmann et al.
distlist[[50]]
plotDist(distlist[[50]])
With the distributions for the possible scores, we may compute score cutoffs resulting in 5% FDR, or a 5% FNR. We also get the point where FDR=FNR (c=1):
ab5 <- scoreDistCutoffs(distlist[[50]], 500, 1, c=1, 0.05) ## 5% FDR ab5$cutoffa ## 5% FNR ab5$cutoffb ## FDR = FNR ab5$cutoffopt
These cutoffs may then be employed in downstream analysis.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.