Two-Class Differential Essentiality Analysis - With Different Class Assignments Per Gene (e.g., For Per Gene Amplification vs. Essentiality analysis)


Previously we detailed how to perform a two-class analysis using HER2+ status as an example (see HER2+ two class analysis). This analysis assumes that the cell line class assignments are identical for all genes. In other words, the HER2+ cell lines are fixed, and the essentiality of each gene is considered in the same HER2+ cell lines vs. the remaining HER2- cell lines.

However, an often-asked analysis question is whether the essentiality of a gene is associated with a change in a genomic variable specific to that gene. An example highlighted in our manuscript is the analysis of a gene's copy gain or loss vs. its essentiality. We may be interested in determining which genes are more essential in copy-gained cell lines (putative copy number-associated driver oncogenes), or which genes are more essential in copy-loss cell lines (putative CYCLOPS genes, see Nijhawan et al. 2013).

For this example, we detail how to generate predictions of copy gain-associated oncogene drivers from our Breast data. The workflow for this analysis is similar to the HER2+ example, except that we will detail how to find, format and input per-gene copy number status.

As before, first download the simem R code, and remember that while the code below uses relative paths, you should edit these to point to where you've stored the code, annotation or data files.

### To install required packages, uncomment and run this
# source("http://www.bioconductor.org/biocLite.R")
# biocLite(c("Biobase", "preprocessCore", "genefilter"))
# install.packages(c("blme", "doMC", "ggplot2", "locfit", "MASS", "plyr", "reshape"))

suppressPackageStartupMessages(library("Biobase"))
suppressPackageStartupMessages(library("blme"))
suppressPackageStartupMessages(library("doMC"))
suppressPackageStartupMessages(library("genefilter"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("locfit"))
suppressPackageStartupMessages(library("MASS"))
suppressPackageStartupMessages(library("plyr"))
suppressPackageStartupMessages(library("preprocessCore"))
suppressPackageStartupMessages(library("reshape"))


source("../../R/data_format_lib.R")
source("../../R/model_lib.R")
source("../../R/simem_lib.R")

 

The screen data formatted as an ExpressionSet, loaded below, can be downloaded here.

load("../../data/shrna/breast_screens_with_weights.eset")
breast_screens

 

The reagent annotation table we're about to load, including the weight column which will be used to exclude some reagents, can be downloaded here.

hp = read.delim("../../data/annotations/hairpin_annotations.txt", header=T, as.is=T, check.names=F)
hpWeights = hp[,c("trcn_id", "gene_id", "weight")]

 

The per-gene breast cell line CNA log-R ratios can be downloaded here here. We will then dichotomize the continuous log R-Ratio values using a threshold of 0.2, above which we consider a gene to exhibit copy gain in the CNA data. We will also remove all identifier columns except for gene_id, which allows us to match a gene's copy gain status across cell lines to its essentiality. The simem() function will try to match gene ids from the essentiality data and the copy number data, and only analyze a gene if its gene_id is present in both datasets. Thus it is not critial to filter the gene_ids of the copy number data to match those of the essentiality data.

cna = read.delim("../../data/cna/breast_cna_lrr.txt", header=T, as.is=T, check.names=F)
# Store the gene_ids
geneIds = cna$gene_id
# Separate the log R-Ratio values and dichotomize them
lrr = as.matrix(cna[,-grep("ensembl_id|gene_id|symbol", colnames(cna))])
gains = ifelse(lrr >= 0.2, "gain", "other")
gains = cbind.data.frame(gene_id=geneIds, gains, stringsAsFactors=F)

 

In order to run a per-gene analysis, we must omit the covariate parameter from the simem() function (or alternately, specify covariate = NULL), and set the annotationsPerId = gains parameter. The simem() function will then extract the cell line class assignments ("gain" or "other") for each gene from the gains data frame.

 

As before, we want to determine whether a gene is more essential in "gain" cell lines compared to all "other", so we'll set the covariateFactorOrder = c("other", "gain") parameter in the simem() function.

 

For the purposes of this example, to reduce computation time, we'll perform the analysis for some copy-gain-driven Breast cancer oncogenes

geneList = c(596, #BCL2
            595, #CCND1
            1956, #EGFR
            2064, #ERBB2
            2099, #ESR1
            2263, #FGFR2
            7022) #TFAP2C

 

If we want to perform a genome-wide analysis, simply omit the geneIds = signalingPathway parameter. This analysis typically takes ~18 hours, but the computation type can be dramatically reduced by parallelizing the modeling process on multiple processor cores (when available). To use 6 processor cores, for example, specify the parallelNodes = 6 parameter. This option is only available on Linux or OS X at present.

 

If we specify an analysis using both precision and signal-noise measurement weights (inverseVarianceWeights = TRUE and signalProbWeights = TRUE parameters, respectively), we must ensure that we've added these weights to the ExpressionSet beforehand (detailed here).

 

Combining all the above, we're ready to perform the differential essentiality analysis.

results = simem(screens = breast_screens,
                geneIds = geneList,
                reagentWeights = hpWeights,
                annotationsPerCellLine = status,
                inverseVarianceWeights = TRUE,
                signalProbWeights = TRUE,
                analyzeReagents = TRUE,
                covariateFactorOrder = c("other", "gain"), 
                parallelNodes = 1
                )

 

As before, here are the predictions for genes known to be more essential in association with copy number gains.

options(width=100)

results$gene

 

And here are the hairpin-level results.

options(width=100)
reagent = results$reagent
reagent[order(reagent$symbol, reagent$trcn_id), ]

 



neellab/simem documentation built on May 23, 2019, 1:31 p.m.