geteffect | R Documentation |
geteffect functions calculate the combined likelihood that the fold changes observed for all guides of a gene between two time points or conditions is smaller than expected at certain effect size(s).
geteffect_c( guides, t1, t0, normfun = "sum", normsubset, a, g, gw, gl = 11, subset, effectrange, output = "range", exactci = FALSE, semiexact = FALSE, ebcfun = "max", minprob = 10^-20 ) geteffect_r( guides, r, rse, a, g, gw, gl = 11, subset, effectrange, output = "range", exactci = FALSE, semiexact = FALSE, ebcfun = "max", minprob = 10^-20 )
guides |
Character vector. Guides are assumed to start with the gene name, followed by an underscore, followed by a number or sequence unique within that gene. |
t1 |
Integer vector. Read counts of the test arm |
t0 |
Integer vector. Read counts of the control arm |
normfun |
Character string. Specify with which function to standardize the data. Default = "sum" |
normsubset |
Integer vector. Specify the indices of features that function as controls. If omitted, all features are used. |
a |
Numeric. Estimated potential population doublings between time points. |
g |
Numeric vector. Specify representative guide efficacies. If omitted,
an exponentially decreasing set of guide efficacies with length |
gw |
Numeric vector. Specify the relative prevalence of each guide
efficacy provided with |
gl |
Numeric. Number of guide efficacies to test. Only used when
|
subset |
Integer vector. Specify the indices of features for which to return output. All features will be used for empirical Bayes correction. |
effectrange |
Numeric vector. Sequence of effect values to test. If omitted, a range will be calculated based on the highest and lowest rate ratios in the data set. |
output |
Character string. Specify which output to generate. Can be either "range", "exact", or "both". Default = "range" |
exactci |
Logical or numeric. Specify the confidence interval of the exact effect as a fraction between 0 and 1. Only applicable if exact values are calculated. If FALSE, confidence interval is omitted. Default = FALSE |
semiexact |
Logical. If TRUE, exact effect values are estimated based on the likelihoods of effect values flanking a log likelihood of 0 (or the respective log likelihood of the confidence interval). Note that a finer resolution of effect values than provided by default are recommended when applying this approach. Default = FALSE |
ebcfun |
Character string or logical. Specify the function that is used to select gene-wise rate ratios for the empirical Bayes correction. Suggested options are "max", "median", and "mean". Note that "max" selects the most extreme rate ratio, not the highest. If FALSE, no empirical Bayes correction is performed, which will generally lead to overestimation of negative effects. Default = "max" |
minprob |
Numeric. Cutoff point for the lowest considered likelihood. Default = 10^-20 |
r |
Numeric vector. Log2-transformed rate ratios |
rse |
Numeric vector. Standard error of the rate ratios |
geteffect uses all guides targeting a gene, and incorporates the
distribution of guide efficacies (given as parameters by the user) and
effect values (interpolated from the data). The process is as follows. For
a (predefined) effect value, the expected fold change is calculated at a
range of different guide efficacies. Next, the probability that the
observed fold change is smaller than the expected fold change is calculated
at each guide efficacy. This number is multiplied with the density of each
guide efficacy (i.e. the weight of each guide efficacy, specified by the
user or by default similar to the distribution used in
CRISPRsim
). This yields the probability that the fold change
observed for a specific guide is smaller than expected given the effect
value. The probability is converted to a log10 odds. The log-odds are
summed for each guide targeting the same gene, and the sum is divided by
the square root of the number of guides (see notes). Finally, the log-odds
are corrected for the prior probability of the effect value, which is
interpolated from the data using the ebcfun
. The best estimate of
effect value for a gene is when the probability of being lower or higher
than that effect value or equal, and therefore the log-odds are 0. Besides
the estimates at the predefined effect values, closer estimations for each
gene (or a subset of the data set) may be obtained in the form of exact or
semiexact values.
Raw count data can be used as input for the geteffect_c
function. If
replicates are present, the function geteffect_r
will take the
precalculated log2 rate ratio (i.e. log2 fold change) and the accompanying
standard error of each rate ratio as input. These values can be obtained by
using the rrep
function. These may also be derived from other
packages that can analyze high-throughput count data.
Returns the following (depending on input arguments):
range - data frame with log10 odds of effect per gene
exact - vector with effect estimates, or data frame with effect estimates including confidence interval
While the calculations on the ranged output are fully vectorized, the
exact output needs to be calculated one gene at a time. And three times
when a confidence interval is included. I have noticed that computation
time does not scale linearly after a certain number of genes, but takes
even longer (this seemed to happen around 500 genes in my case). I guess
that this might mean I ran out of memory, although I do not understand why
this would happen. Therefore, I would not recommend running exact
calculations for a whole genome data set. If you want "more exact" results
than given by the default range, set effectrange
yourself (e.g.
seq(-2, 1, by = 0.01)
). Use semiexact == TRUE
to get closer
approximations and unique effect values without crashing the computer. Note
that this does require the precalculation of a range!
Using CRISPRsim
to validate the performance of this function,
the exact effect values end up very close to the input effect values, but
especially genes that lack an efficacious guide are underestimated. Due to
this, the confidence intervals are a bit off. Specifically, the confidence
limits are too tight on the "more extreme than" side, meaning a gene might
actually have a higher probability of a stronger effect than the confidence
interval would suggest.
A minimum likelihood was introduced with the minprob
option to
prevent Inf and -Inf results. This may prevent errors and help plotting.
The sum of the log-odds is divided by the square root of the number of
guides. Therefore, it is not the odds of this particular gene having a more
negative effect value than the tested value, but the odds of any gene with
this many guides having a more negative effect value. The example code in
geneodds
visualizes the combination of odds in this fashion.
Jos B. Poell
rrep
, CRISPRsim
, odds2pq
ut1 <- CRISPRsim(100, 4, a = c(3,3), allseed = 100, t0seed = 10, repseed = 1, perfectseq = TRUE) ut2 <- CRISPRsim(100, 4, a = c(3,3), allseed = 100, t0seed = 20, repseed = 3, perfectseq = TRUE) ut3 <- CRISPRsim(100, 4, a = c(3,3), allseed = 100, t0seed = 30, repseed = 5, perfectseq = TRUE) cgi <- ut1$d > -0.05 & ut1$d < 0.025 df <- data.frame(guides = ut1$guides, T01 = ut1$t0, T02 = ut2$t0, T03 = ut3$t0, UT1 = ut1$t6, UT2 = ut2$t6, UT3 = ut3$t6, stringsAsFactors = FALSE) r <- rrep(t1 = df[,5:7], t0 = df[,2:4], normsubset = cgi) results <- geteffect_r(df$guides, r = r$r, rse = r$se, a = 6, output = "both", semiexact = TRUE) d <- rle(ut1$d)$values plot(d, results$exact, xlab = "real effect", ylab = "estimated effect")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.