sortingsim: Simulate a selection-based CRISPR-Cas9 pooled screen
In tgac-vumc/CSSA: Analysis and Simulation Tools for CRISPR-Cas9 Pooled Screens

sortingsim

R Documentation

Simulate a selection-based CRISPR-Cas9 pooled screen

Description

sortingsim simulates a selection-based CRISPR-Cas9 pooled screen with user-defined parameters. In a way, this is a simplified version of CRISPRsim, because there is no simulation of growth and passaging. It is about entering into an assay with a number of cells, and collecting the selected and unselected cells for sequencing. This was mainly created with FACS-based screens in mind, but I suppose one can think of other selections as well. Drug treatment is still an option! Again, "infected" cells either have a successful knockout (with associated effect) or not, based on guide efficacy. Cells are then sorted, which in this simulator means that all cells with a certain knockout have a base probability to be positively selected, modified by a gene-specific score. Note that this function takes the number of sorted cells as input, of which only a fraction is selected. This fraction is roughly equal to the baseprob. The output of this function is a data frame that contains the guide-relevant parameters and the sequencing coverage per guide for the selected and unselected arms. Simulated screens will aid researchers with their experimental setup. Furthermore, this offers a unique platform for the evaluation of analysis methods for sorting-based pooled gene knockout screens.

Usage

sortingsim(
  genes,
  guides,
  g,
  f,
  d,
  e,
  baseprob = 0.1,
  hitfraction = 1/200,
  hitsup,
  hitfactor = 10,
  efraction,
  eup,
  efactor,
  sortedcells,
  seqdepth,
  offtargets = FALSE,
  allseed = NULL,
  gseed,
  fseed,
  dseed,
  eseed,
  oseed,
  t0seed,
  repseed,
  perfectsampling = FALSE,
  perfectseq,
  returnall = FALSE,
  outputfile
)

Arguments

`genes`	Single integer or character vector. Specify how many or which genes to include in the experiment respectively. Not required when a full list of guides is given.
`guides`	Single integer, integer vector or character vector. In case of single integer, specify by how many guides each gene is represented. In case of an integer vector, specify per gene by how many guides it is represented. In case of a character vector, guides are assumed to contain a gene name, followed by an underscore, followed by an identifier within that gene (e.g. a number or a nucleotide sequence).
`g`	Integer vector. Specify guide efficacies per guide. If omitted, guide efficacies will be sampled from a representative distribution.
`f`	Integer vector. Specify guide abundance at time of infection per guide. If omitted, guide abundance will be sampled from a representative distribution.
`d`	Integer vector. Specify gene-specific modifier of probability of selection. This is a modifier of the odds. If omitted, effect of gene knockout will be sampled from three distributions, depending on base probability and hit factor. If the length of the vector does not match the number of genes, values will be randomly sampled from the specified distribution!
`e`	Integer vector. Specify treatment-specific selection effect per gene. If omitted, effects will be sampled from a representative distribution. If the length of the vector does not match the number of genes, values will be randomly sampled from the specified distribution!
`baseprob`	Numeric. Baseline probability of selection. Needs to be larger than 0 and smaller than 1. Default = 0.1
`hitfraction`	Numeric. Fraction of genes that affect selection significantly. Default = 1/200
`hitsup`	Numeric. Fraction of hits of which the selection probability is multiplied by `hitfactor`. Probability of the other (down) hits are divided by `hitfactor`. Defaults to `1-baseprob`
`hitfactor`	Numeric. Multiplication factor with which hits affect selection probability on average. Default = 10
`efraction`	Numeric. Fraction of genes that affects selection specifically in this treatment arm. Defaults to `hitfraction`
`eup`	Numeric. Same as hitsup, but now relating to treatment effects. Defaults to 0.5
`efactor`	Numeric. Multiplication factor for treatment-specific effects. Defaults to `hitfactor`
`sortedcells`	Integer. Number of cells put through the simulated selection. Note that this is the sum of the selected and unselected cells!
`seqdepth`	Integer. Specify the amount of sequencing reads devoted to each experimental arm. If omitted, depth will default to 500 times the number of guides
`offtargets`	Logical or numeric. Specify the fraction of off-targets. If TRUE, 1 in 1000 guides (0.001) will target a different gene. Default = FALSE
`allseed`	Integer. All unspecified seeds default to this plus an increment of 1 for each different seed. Defaults to NULL, in which case the unspecified seeds are randomly generated. Default = NULL
`gseed`	Integer. Specify seed for guide effiency assignment
`fseed`	Integer. Specify seed for infectious units assignment, which dictates a guide's abundance at the start of the experiment
`dseed`	Integer. Specify seed for straight lethality assignment of genes
`eseed`	Integer. Specify seed for sensitizer assignment of genes
`oseed`	Integer. Specify seed for off-target selection
`t0seed`	Integer. Specify seed for t0, which encompasses assignment of successful knockout cells versus no knockout cells for each guide
`repseed`	Integer. Specify the seed after t0
`perfectsampling`	Logical. If TRUE, all sampling steps are replaced by simple equations to calculate representation of guides. Useful as null control to isolate the effect of sampling. Default = FALSE
`perfectseq`	Logical. If TRUE, sequencing results are a perfect representation (though still rounded) of guides in the harvested cells. Applicable to speed up simulations, assuming sequencing is sufficiently deep. Defaults to `perfectsampling`
`returnall`	Logical. If TRUE, function returns a list with the simulated data in the guidesdf, summary per gene in the genesdf, and parameters. Default = FALSE
`outputfile`	Character string. When used, returned data frame will be saved as a tab-delimited text to the specified file path

Details

sortingsim performs a genome-wide (or subsetted) pooled CRISPR knockout screen ending a binary selection. Perhaps even more so than growth-based screens, the outcome of such screens can be a massive black box. As of yet, no specific analysis methods have been published for these kind of screens, but the simulator below can help assess those. The parameters are highly customizable, so I sincerely recommend reading the documentation for all the options. And it is always possible to provide your own gene-specific selection modifiers or guide efficacies if you are not happy with the provided distributions. Seeds are relevant if you want to create replicate screens. You can easily "practice" by simulating some small experiments (i.e. limit the amount of genes). The basis of the simulation are as follows. Cells have an a priori probability baseprob to be selected. The corresponding odds baseprop/(1-baseprop) are multiplied by gene-specific modifier d (and optionally gene-specific modifier for treatment e). These odds are converted to the modified probability mod_prob, which is used to determine how many cells with a specific knockout are selected. Each guide has an efficacy, which is the chance to create a successful knockout. Only in case of successful knockout are the modifiers applied. Selected and not-selected cells are separately sequenced, both to the indicated sequencing depth.

Value

Returns a data frame with every row representing a single guide. Contains the pertinent parameters of each guide and the number of sequencing reads of selected and not selected cells. If the argument returnall is set to TRUE, the function also returns a data frame with the true values for the genes, and lists all parameters as well.

Note

If you specify an inverted hitfactor (e.g. 0.1 instead of 10), your hits are turned around.

While it also makes sense to be able to specify how many cells are positively selected (this could be your FACS setup of course), this is not directly compatible with this simulator. Instead, you can divide the number of cells you want with baseprob and use that as input for sortedcells. If that does not come close (some wonky parameters perhaps), or you want to be more precise, you can do a test run with argument returnall. One of the returned values is selectedcells, which corresponds to the number of cells used as input for sequencing of the selected arm. It follows that noteselectedcells equals sortedcells minus selectedcells

Author(s)

Jos B. Poell

Examples

sortdf <- sortingsim(18000, 4, e = TRUE, perfectsampling = TRUE)
d <- rle(sortdf$d)$values
lod <- log(d)
e <- rle(sortdf$e)$values
loe <- log(e)
plot(lod, loe, main = "log odds of selection")
enrichment <- log(sortdf$selected+1)-log(sortdf$notselected+1)
kocell_logodds <- log(sortdf$mod_prob)-log(1-sortdf$mod_prob)
plot(kocell_logodds, enrichment, pch = 16,
     cex = 0.75, col = rgb(sortdf$g, 0, 1-sortdf$g))

tgac-vumc/CSSA documentation built on Oct. 10, 2022, 7:27 p.m.