# gCrisprTools and the Analysis of Pooled Screening Data In gCrisprTools: Suite of Functions for Pooled Crispr Screen QC and Analysis

### 1. Overview of gCrisprTools

Competitive screening experiments, in which bulk cell cultures infected with a heterogeneous viral library are experimentally manipulated to identify guide RNAs or shRNAs that influence cell viability, are conceptually straightforward but often challenging to implement. Here, we present gCrisprTools, an R/Bioconductor analysis suite facilitating quality assessment, target prioritization, and interpretation of arbitrarily complex competitive screening experiments. gCrisprTools provides functionalities for detailed and principled analysis of diverse aspects of these experiments both as a standalone pipeline or as an extension to alternative analytical approaches.

#### 1.1 Installation

Install gCrisprTools in the usual way:

if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("gCrisprTools")


#### 1.2 Citing gCrisprTools

If you use gCrisprTools while developing a publication, please cite the following paper:

[Bioinformatics App Note Citation, to be updated later]

#### 1.3 Explore the Vignettes Folder

This vignette is only one of the resources provided in gCrisprTools to help you understand, analyse, and explore pooled screening data. As appropriate, please see the /vignettes subdirectory for additional documentation describing example code, and the /inst directory for more information about algorithm implementation and package layout.

#### 1.4 Dependencies

gCrisprTools uses the existing Biobase framework for data storage and manipulation and consequently depends heavily on the Biobase and limma packages.

library(Biobase)
library(limma)
library(gCrisprTools)


### 2. Inputs

#### 2.1 Counting Cassettes from Sequencing Data

To use the various methods available in this package, you will first need to conform your screen data into an ExpressionSet object containing cassette abundance counts in the assayData slot, retrievable with exprs(). This package assumes that end users are familiar enough with the R/Bioconductor framework and their own sequencing pipelines to extract raw cassette counts from FASTQ files and to compose them into an ExpressionSet. For newer users read counting may be facilitated with cutadapt or other software designed for these purposes; details about composition of ExpressionSet objects can be found in the Biobase vignette.

##### 2.2 An ExpressionSet of Cassette Counts

Raw cassette counts should be contained within an ExpressionSet object, with the counts retrievable withexprs(). The column names (colnames()) should correspond to unique sample identifiers, and the row names (row.names()) should correspond to identifiers uniquely specifying each cassette of interest.

data("es", package = "gCrisprTools")
es

##### 2.3 An Annotation Object

gCrisprTools requires an annotation object mapping the individual cassettes to genes or other genomic features for most applications. The annotation object should be provided as a named data.frame, with columns describing the 'geneID' and 'geneSymbol' of the target elements to which each cassette is annotated. These columns should contain character vectors with elements that uniquely describe the targets in the screen; by convention, the geneID field contains an official identifier that unambiguously describes each target element in a manner suitable for external software (e.g., an Entrez ID). The geneSymbol column indicates a more human-readable descriptor, such as a gene symbol.

The annotation object may optionally contain other columns with additional information about the corresponding cassettes.

data("ann", package = "gCrisprTools")


#### 2.4 A Sample Key

Many gCrisprTools functions require or are enhanced by a sample key detailing the experimental groups of the functions included in the study. This key should be provided as a named factor, with names perfectly matching the colnames of the ExpressionSet. The first level of the sample key should correspond to the 'control' condition, indexing samples whose cassette distributions are expected to be the minimally distorted by experimental treatments.

### 6. Visualization of Results

After identifying candidate targets, various aspects of the contrast may be visualized with gCrisprTools.

##### 6.1 ct.topTargets

The ct.topTargets function enables simple visualization of the model effect estimates (log2 fold changes) and associated uncertainties of all cassettes associated with the top-ranking targets.

ct.topTargets(fit,
resultsDF,
ann,
targets = 10,
enrich = TRUE)

##### 6.2 ct.stackGuides

In some screens it can be useful to visualize the degree of library distortion associated with the strongest signals. Such an approach can supply additional confidence in a particular candidate of interest by showing that clear differences are evident outside of the linear modeling framework (which may be inaccurate in heavily distorted libraries).

ct.stackGuides(
es,
sk,
plotType = "Target",
annotation = ann,
subset = names(sk)[grep('Expansion', sk)]
)

##### 6.3 ct.viewGuides

gCrisprTools provides methods to visualize the behavior of individual cassettes annotated to target of interest, and positions these within the observed distribution of effect sizes across all cassettes within the experiment.

ct.viewGuides("Target1633", fit, ann)


#### 6.4 ct.makeContrastReport and ct.makeReport

As with the Quality Control components of an individual screen, gCrisprTools provides functionality to automatically generate contrast-level reports.

#Not run:
path2Contrast <-
ct.makeContrastReport(eset = es,
fit = fit,
sampleKey = sk,
results = resultsDF,
annotation = ann,
comparison.id = NULL,
identifier = 'Crispr_Contrast_Report')


If you wish, you can also make a single report encompassing both quality control and the contrast of interest.

#Not run:
path2report <-
ct.makeReport(fit = fit,
eset = es,
sampleKey = sk,
annotation = ann,
results = resultsDF,
aln = aln,
outdir = ".")


### 7 Hypothesis Testing

In addition to identifying targets of interest within a screen, it may be worthwhile to ask more comprehensive questions about the targets identified. gCrisprTools provides a series of basic functions for determining the enrichment of known or unknown target groups within the context of a screen.

##### 7.1 ct.PantherPathwayEnrichment

If a screen was performed with a library targeting genes, gCrisprTools can provide basic ontological enrichment testing. This function annotates Entrez gene IDs contained in the geneID column of the annotation object to pathways contained in the PANTHER database, and then checks for significant enrichment or depletion of these pathways using a hypergeometric test.

#Not run:
enrichmentResults <-
ct.PantherPathwayEnrichment(
resultsDF,
pvalue.cutoff = 0.01,
enrich = TRUE,
organism = 'mouse'
)

> head(enrichmentResults)   #Note: Pathway names have been edited for display purposes.
PATHWAY nGenes sigGenes expected     odds            p        FDR
1 EGF receptor signaling pathway    200       14 5.240550 3.332647 0.0004498023 0.03958260
2          FGF signaling pathway    230       14 5.949958 2.869779 0.0016284304 0.07165094
3     Insulin/MAP kinase cascade    138        9 3.714916 2.785331 0.0101632822 0.20272465
4          CCKR signaling map ST    331       15 8.211061 2.148459 0.0126368744 0.20272465
5               p38 MAPK pathway    145        9 3.891333 2.641968 0.0135928688 0.20272465
6              B cell activation    204       11 5.336192 2.368705 0.0146442541 0.20272465

##### 7.2 ct.targetSetEnrichment, ct.ROC, and ct.PRC

In some cases, it may be useful to ask whether a set of known targets is disproportionately enriched or depleted within a screen. gCrisprTools provides functions for answering these sorts of questions with ct.ROC(), which generates Reciever-Operator Characteristics for a specified gene set within a screen, and ct.PRC(), which draws precision-recall curves. When called, both functions return the raw data necessary to reproduce or combine these results, along with appropriate statistics for assessing the significance of the overall signal within the specified target set (via a hypergeometric test).

data("essential.genes", package = "gCrisprTools")  #Artificial list created for demonstration
data("resultsDF", package = "gCrisprTools")
ROC <- ct.ROC(resultsDF, essential.genes, stat = "deplete.p")
str(ROC)

PRC <- ct.PRC(resultsDF, essential.genes, stat = "deplete.p")
str(PRC)


Alternatively, the significance of the enrichment within the target set may be assessed directly with ct.targetSetEnrichment.

targetsTest <- ct.targetSetEnrichment(resultsDF, essential.genes, enrich = FALSE)
str(targetsTest)


[^1]: Kolde R, Laur S, Adler P, Vilo J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics. 2012;28(4):573-80. PMID:22247279

[^2]: Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, Irizarry RA, Liu JS, Brown M, Liu XS. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol. 2014;15(12):554. PMID:25476604

[^3]:Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995;57(1):289–300. MR 1325392.

sessionInfo()


## Try the gCrisprTools package in your browser

Any scripts or data that you put into this service are public.

gCrisprTools documentation built on Nov. 1, 2018, 3:02 a.m.