In cbg-ethz/perturbatr: Statistical Analysis of High-Throughput Genetic Perturbation Screens

  BiocStyle::markdown()

  knitr::opts_chunk$set(echo = TRUE)
  options(warn = -1)
  library(dplyr)
  library(tibble)
  library(methods)
  library(perturbatr)
  data(rnaiscreen)
  rnaiscreen <- dataSet(rnaiscreen) %>%    
    dplyr::select(Condition, Replicate, GeneSymbol,
                  Perturbation, Readout, Control, 
                  Design, ScreenType, Screen) %>%
    as.tibble()

Introduction

perturbatr does stage-wise analysis of large-scale genetic perturbation screens for integrated data sets consisting of multiple screens. For multiple integrated perturbation screens a hierarchical model that considers the variance between different biological conditions is fitted. That means that we first estimate relative effect sizes for all genes. The resulting hit lists is then further extended using a network propagation algorithm to correct for false negatives.

Here we show an example data analysis using a pan-viral data set of three RNAi screening studies.

Data analysis tutorial

This tutorial walks you to the basic functionality of perturbatr.

Creating a `PerturbationData` object

You supposedly start with something like a data.frame or tibble:

  head(rnaiscreen)

To analyse a data set, we first need to create a perturbation data set:

  rnaiscreen <- methods::as(rnaiscreen, "PerturbationData")

Coercing your data.frame to PerturbationData will automatically warn you if your table is formatted wrongly. You need at least the following column names order to be able to do analysis of perturbation screens using perturbatr:

Condition: an identifier that best describes the respective screen. For instance this can be the name of a virus for pathogen screens, the name of a cell line, organoid or the like. The condition describes a single data set, i.e. if you want to integrate multiple different data sets, make sure to give each a different condition.
Replicate: an integer representing the replicate number of a screen.
GeneSymbol: the HUGO identifier, ENTREZ id, etc. as character.
Perturbation: a siRNA id or gRNA id that describes the knockout/knockdown for the gene.
Readout: a normalized readout like a log-fold change for gRNAs, a GFP signal, etc.
Control: vector of integers representing perturbations that have been used as negative or positive controls. A negative control is marked with '-1', a positive control with '1' and a normal sample with '0'.

Depending on how you want to model the readout using the hierarchical model, you might want to add additional columns. For the sake of simplicity this suffices though.

Working with `PerturbationData` S4 objects

A PerturbationData object consists of a single slot that stores your data. We bundled your data into an S4 object such that dispatch is easier to handle and to make sure that your data set has the correct columns:

  rnaiscreen
  dataSet(rnaiscreen)

PerturbationData has some basic filter and rbind functionality. Similar to dplyr::filter you can select rows by some predicate(s). In the example below we extract all rows from the data set that have a positive readout.

  perturbatr::filter(rnaiscreen, Readout > 0)

Filtering on multiple rows works by just adding predicates:

  perturbatr::filter(rnaiscreen, Readout > 0, Replicate == 2)

If you want to combine data sets you can call rbind, which will automatically dispatch on PerturbationData object:

  dh <- perturbatr::filter(rnaiscreen, Readout > 0, Replicate == 2)
  rbind(dh, dh)

Data analysis using a hierarchical model and network diffusion

Finally, after having set up the data set, we analyse it using a hierarchical model and network diffusion. As noted above, if you want to analyse multiple data sets, make sure that every data set corresponds to a unique Condition.

First, let's have a rough look at the data set that we are using:

  plot(rnaiscreen)

We have roughly the same number of replicates per gene, but the HCV screen has less genes than the SARS and DENV data sets. That is no problem however, because we automatically filter such that the genes are same. We also automatically remove positive controls.

Next we rank the genes using a hierarchical model which requires describing the model to use as R formula. The data has the following covariates:

  dataSet(rnaiscreen) %>% str()

Here, variables like Replicate, Plate, RowIdx/ColIdx should not be associated with a change in the response Readout as we normalized the data and corrected for batch effects. However, the Readouts should be different between ScreenTypes:

  dataSet(rnaiscreen) %>% pull(ScreenType) %>% unique()

where E/R represents that the screen has measured the effect of a gene knockdown during the entry and replication stages of the viral lifecycle while A/R repesents the gene knockdown's effect having been measures during the assembly and release stages of the lifecycle. In the life cycle of positive-sense RNA viruses we know that viruses make use of different host factors during their life cycle. That means while some genes are required during entry and replication, others might play a role in assembly and release of the virions. So we have reason to believe that the stage of the infection also introduces a clustering effect. Hence, we could model the data as:

$$y_{gctp} \mid\gamma_g, \delta_{gc}, \zeta_t , \xi_{ct} \sim \mathcal{N}(x_c \beta + \gamma_g + \delta_{cg} + \zeta_t + \xi_{ct}, \sigma^2),$$ where $y_{gctp}$ is the readout for gene $g$, virus $c$, stage of the viral lifecycle $t$ (E/R vs A/R) and $p$ is the perturbation (siRNA) used for gene $g$. We estimate the parameters of the model using lme4 [@bates2014lme4]:

  frm <- Readout ~ Condition +
                   (1|GeneSymbol) + (1|Condition:GeneSymbol) +
                   (1|ScreenType) + (1|Condition:ScreenType)
  res.hm <- hm(rnaiscreen, formula = frm)

Plotting the res.hm objet yields a list of multiple plots. The first plot shows the first 25 strongest gene effects ranked by their absolute effect sizes. Most of the genes are colored in blue which represents that a gene knockdown leads to an inhibition of viral growth on a pan-viral level. Bars colored in red represent genes for which a knockdown results in increased viral viability. If you are interested in the complete ranking of genes, use geneEffects(res.hm).

  pl <- plot(res.hm)
  pl[[1]]

The second plots shows the nested gene effects, i.e. the estimated effects of a gene knockdown for a single virus. The genes shown here are the same as in the first plot, so it might be possible that there are nested gene effects that are stronger which are just not plotted. You can get all nested gene effects using nestedGeneEffects(res.hm).

  pl[[2]]

Next we might want to smooth the effect from the hierarchical model using network diffusion, by that possibly reduce the number of some false negatives. For that we need to supply a graph as a data.frame and call the diffuse function:

  graph <- readRDS(
    system.file("extdata", "graph_small.rds",package = "perturbatr"))
  diffu <- perturbatr::diffuse(
        res.hm, graph=graph, delete.nodes.on.degree=1, r=0.35,
        take.largest.component=TRUE, correct.for.hubs=FALSE)

If we plot the results we get a list of reranked genes. Note that the ranking uses the network diffusion computes a stationary distribution of a Markov random walk with restarts.

  plot(diffu)

Further note that we used a very small network here. You might want to redo this analysis with the full graph which is located in system.file("extdata", "graph_full.rds",package = "perturbatr").