geteffect: Approximate effect of gene knockout on growth

geteffectR Documentation

Approximate effect of gene knockout on growth

Description

geteffect functions calculate the combined likelihood that the fold changes observed for all guides of a gene between two time points or conditions is smaller than expected at certain effect size(s).

Usage

geteffect_c(
  guides,
  t1,
  t0,
  normfun = "sum",
  normsubset,
  a,
  g,
  gw,
  gl = 11,
  subset,
  effectrange,
  output = "range",
  exactci = FALSE,
  semiexact = FALSE,
  ebcfun = "max",
  minprob = 10^-20
)

geteffect_r(
  guides,
  r,
  rse,
  a,
  g,
  gw,
  gl = 11,
  subset,
  effectrange,
  output = "range",
  exactci = FALSE,
  semiexact = FALSE,
  ebcfun = "max",
  minprob = 10^-20
)

Arguments

guides

Character vector. Guides are assumed to start with the gene name, followed by an underscore, followed by a number or sequence unique within that gene.

t1

Integer vector. Read counts of the test arm

t0

Integer vector. Read counts of the control arm

normfun

Character string. Specify with which function to standardize the data. Default = "sum"

normsubset

Integer vector. Specify the indices of features that function as controls. If omitted, all features are used.

a

Numeric. Estimated potential population doublings between time points.

g

Numeric vector. Specify representative guide efficacies. If omitted, an exponentially decreasing set of guide efficacies with length gl will be created.

gw

Numeric vector. Specify the relative prevalence of each guide efficacy provided with g. If omitted, a top-heavy distribution will be created to roughly represent guide efficacy distribution in CRISPRsim

gl

Numeric. Number of guide efficacies to test. Only used when g is not specified, in which case it needs to be at least 4. Default = 11

subset

Integer vector. Specify the indices of features for which to return output. All features will be used for empirical Bayes correction.

effectrange

Numeric vector. Sequence of effect values to test. If omitted, a range will be calculated based on the highest and lowest rate ratios in the data set.

output

Character string. Specify which output to generate. Can be either "range", "exact", or "both". Default = "range"

exactci

Logical or numeric. Specify the confidence interval of the exact effect as a fraction between 0 and 1. Only applicable if exact values are calculated. If FALSE, confidence interval is omitted. Default = FALSE

semiexact

Logical. If TRUE, exact effect values are estimated based on the likelihoods of effect values flanking a log likelihood of 0 (or the respective log likelihood of the confidence interval). Note that a finer resolution of effect values than provided by default are recommended when applying this approach. Default = FALSE

ebcfun

Character string or logical. Specify the function that is used to select gene-wise rate ratios for the empirical Bayes correction. Suggested options are "max", "median", and "mean". Note that "max" selects the most extreme rate ratio, not the highest. If FALSE, no empirical Bayes correction is performed, which will generally lead to overestimation of negative effects. Default = "max"

minprob

Numeric. Cutoff point for the lowest considered likelihood. Default = 10^-20

r

Numeric vector. Log2-transformed rate ratios

rse

Numeric vector. Standard error of the rate ratios

Details

geteffect uses all guides targeting a gene, and incorporates the distribution of guide efficacies (given as parameters by the user) and effect values (interpolated from the data). The process is as follows. For a (predefined) effect value, the expected fold change is calculated at a range of different guide efficacies. Next, the probability that the observed fold change is smaller than the expected fold change is calculated at each guide efficacy. This number is multiplied with the density of each guide efficacy (i.e. the weight of each guide efficacy, specified by the user or by default similar to the distribution used in CRISPRsim). This yields the probability that the fold change observed for a specific guide is smaller than expected given the effect value. The probability is converted to a log10 odds. The log-odds are summed for each guide targeting the same gene, and the sum is divided by the square root of the number of guides (see notes). Finally, the log-odds are corrected for the prior probability of the effect value, which is interpolated from the data using the ebcfun. The best estimate of effect value for a gene is when the probability of being lower or higher than that effect value or equal, and therefore the log-odds are 0. Besides the estimates at the predefined effect values, closer estimations for each gene (or a subset of the data set) may be obtained in the form of exact or semiexact values.

Raw count data can be used as input for the geteffect_c function. If replicates are present, the function geteffect_r will take the precalculated log2 rate ratio (i.e. log2 fold change) and the accompanying standard error of each rate ratio as input. These values can be obtained by using the rrep function. These may also be derived from other packages that can analyze high-throughput count data.

Value

Returns the following (depending on input arguments):

  • range - data frame with log10 odds of effect per gene

  • exact - vector with effect estimates, or data frame with effect estimates including confidence interval

Note

While the calculations on the ranged output are fully vectorized, the exact output needs to be calculated one gene at a time. And three times when a confidence interval is included. I have noticed that computation time does not scale linearly after a certain number of genes, but takes even longer (this seemed to happen around 500 genes in my case). I guess that this might mean I ran out of memory, although I do not understand why this would happen. Therefore, I would not recommend running exact calculations for a whole genome data set. If you want "more exact" results than given by the default range, set effectrange yourself (e.g. seq(-2, 1, by = 0.01)). Use semiexact == TRUE to get closer approximations and unique effect values without crashing the computer. Note that this does require the precalculation of a range!

Using CRISPRsim to validate the performance of this function, the exact effect values end up very close to the input effect values, but especially genes that lack an efficacious guide are underestimated. Due to this, the confidence intervals are a bit off. Specifically, the confidence limits are too tight on the "more extreme than" side, meaning a gene might actually have a higher probability of a stronger effect than the confidence interval would suggest.

A minimum likelihood was introduced with the minprob option to prevent Inf and -Inf results. This may prevent errors and help plotting.

The sum of the log-odds is divided by the square root of the number of guides. Therefore, it is not the odds of this particular gene having a more negative effect value than the tested value, but the odds of any gene with this many guides having a more negative effect value. The example code in geneodds visualizes the combination of odds in this fashion.

Author(s)

Jos B. Poell

See Also

rrep, CRISPRsim, odds2pq

Examples

ut1 <- CRISPRsim(100, 4, a = c(3,3), allseed = 100, t0seed = 10,
                 repseed = 1, perfectseq = TRUE)
ut2 <- CRISPRsim(100, 4, a = c(3,3), allseed = 100, t0seed = 20,
                 repseed = 3, perfectseq = TRUE)
ut3 <- CRISPRsim(100, 4, a = c(3,3), allseed = 100, t0seed = 30,
                 repseed = 5, perfectseq = TRUE)
cgi <- ut1$d > -0.05 & ut1$d < 0.025
df <- data.frame(guides = ut1$guides,
                 T01 = ut1$t0, T02 = ut2$t0, T03 = ut3$t0,
                 UT1 = ut1$t6, UT2 = ut2$t6, UT3 = ut3$t6,
                 stringsAsFactors = FALSE)
r <- rrep(t1 = df[,5:7], t0 = df[,2:4], normsubset = cgi)
results <- geteffect_r(df$guides, r = r$r, rse = r$se, a = 6,
                       output = "both", semiexact = TRUE)
d <- rle(ut1$d)$values
plot(d, results$exact, xlab = "real effect",
     ylab = "estimated effect")


tgac-vumc/CSSA documentation built on Oct. 10, 2022, 7:27 p.m.