Essentiality associated with continuous expression: Project Achilles (Cheung et al. 2011) per-gene continuous expression vs. essentiality

 

DOWNLOAD R CODE FILE FOR THIS TUTORIAL

 

First DOWNLOAD SIMEM R CODE zip bundle containing the R code files loaded below.

  Also note that, while the code below uses relative paths, you should edit these to point to where you've stored the code, annotation or data files.

### To install required packages, uncomment and run this
# source("http://www.bioconductor.org/biocLite.R")
# biocLite(c("Biobase", "preprocessCore", "genefilter"))
# install.packages(c("blme", "doParallel", "ggplot2", "locfit", "MASS", "plyr", "reshape"))

suppressPackageStartupMessages(library("Biobase"))
suppressPackageStartupMessages(library("blme"))
suppressPackageStartupMessages(library("doParallel"))
suppressPackageStartupMessages(library("genefilter"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("locfit"))
suppressPackageStartupMessages(library("MASS"))
suppressPackageStartupMessages(library("plyr"))
suppressPackageStartupMessages(library("preprocessCore"))
suppressPackageStartupMessages(library("reshape"))


source("../../R/data_format_lib.R")
source("../../R/model_lib.R")
source("../../R/simem_lib.R")

 

The Cheung et al. 2011 Project Achilles screen data formatted as an ExpressionSet, including measurement weights, can be downloaded HERE.

load("../../data/shrna/achilles_screens_with_weights.eset")
fdat = fData(achilles_screens)
pheno = pData(achilles_screens)

achilles_screens

 

Project Achilles screens were performed using a pool of ~54000 hairpins/reagents mapping to ~12000 genes. The reagent annotation table we'll load, including the weight column which will be used to exclude some reagents, can be downloaded HERE. In this case, any reagents with a mean intensity measurement below 5 (on the log2 scale) in the universal T0 sample are excluded in the analysis. Furthermore, the shRNA reagents in the Achilles screens are identified using the probeset_id column. We'll specify later how to alter the simem() function to recognize this fact, since by default simem() assumes reagents identified by a trcn_id column.

hp = read.delim("../../data/annotations/achilles_hairpin_annotations.txt", header=T, as.is=T, check.names=F)
hpWeights = hp[,c("probeset_id", "gene_id", "weight")]

 

The formatted CCLE microarray-based expression data for these genes can be downloaded HERE.

expr = read.delim("../../data/expression/ccle_achilles_expression.txt", header=T, as.is=T, check.names=F)

# Perform the analysis for ids contained in both the expression matrix and the shRNA data.
commonIds = intersect(expr$gene_id, fdat$gene_id)

# Subset the shRNA data to only include the Achilles screens profiled by CCLE expressions
samplesToInclude = which(pheno$cell_line %in% colnames(expr))
achilles = achilles_screens[,samplesToInclude]

 

When we want to specify a different set of covariate values (such as expression) for each gene, we do not specify the covariate and annotationsPerCellLine input parameters for the simem() function. These two parameters are used only when the same covariate values (such as cancer subtype) apply to all genes. Instead, we'll specify the annotationsPerId = expr input parameter. In order for the expr data frame to be properly formatted, the first column of expr needs to be gene_id, containing the integer Entrez gene ids. The second and subsequent columns need to be screen/cell line names.

For the purposes of this example, to reduce computation time, we'll specify a few known oncogenes whose essentiality increases with expression.

testIds = c(598, #BCL2L1
            3845, #KRAS
            7849, #PAX8
            6663 #SOX10
)

 

By default, the simem() function assumes that columns identifying gene ids, symbols, reagent ids, cell lines, replicates and time-points are specified in the screen data ExpressionSet. The default column names assumed by the simem() function are obtained using getDefaultVariableMap():

t(getDefaultVariableMap())

 

As noted, the Achilles reagents are identified using the fdat$probeset_id column. We'll specify this in the simem() function by adding a variableMap=vars input parameter, as follows:

vars = getDefaultVariableMap()
vars$reagentId = "probeset_id"
vars$timeNum = "timeNum"
vars$timeGroup = "timepointIndex"

 

If you want to perform a genome-wide analysis, simply omit the geneIds = commonIds parameter. This analysis typically takes several hours but can be greatly reduced using the parallelNodes parameter on multi-core Linux or OS X systems.

 

Since the Achilles data is an end-point assay, we must specify the endPoint=TRUE parameter for the simem() function. Furthermore, signal-noise measurement weights do not apply to this analysis case. To specify an analysis using precision weights (inverseVarianceWeights = TRUE), we must ensure that we've added these weights to the ExpressionSet beforehand (DETAILED HERE).

 

We're now ready to predict genes whose essentiality is significantly associated with expression.

results = simem(achilles,
               geneIds=testIds,
               reagentWeights=hpWeights,
               annotationsPerId=expr, 
               analyzeReagents=TRUE,
               inverseVarianceWeights=TRUE,
               endPoint=TRUE,
               parallelNodes=1,
               variableMap=vars)

 

For illustration purposes, we'll rerun the same code using multiple (2) parallel nodes.

results = simem(achilles,
               geneIds=testIds,
               reagentWeights=hpWeights,
               annotationsPerId=expr, 
               analyzeReagents=TRUE,
               inverseVarianceWeights=TRUE,
               endPoint=TRUE,
               parallelNodes=2,
               variableMap=vars)

 

Here are the gene-level summaries (more detailed gene-level results can be obtained using results$gene_detailed). The difference parameter can be interpreted as the magnitude of the average change in dropout slope associated with each unit increase in the gene's log2 microarray intensity value. In other words, on average the dropout is faster (more negative), indicating increased essentiality, in cell lines with higher expression.

options(width=100)

results$gene

 

Since analyzeReagents = TRUE parameter was specified, per-reagent context-specific essentiality predictions are also available (more detailed gene-level results can be obtained using results$reagent_detailed)

options(width=100)
reagent = results$reagent
reagent[order(reagent$symbol, reagent$probeset_id), ]


neellab/simem documentation built on May 23, 2019, 1:31 p.m.