In order to complete this analysis, the level 4 dataset should be obtained from the lincs project. In addition, the ExpSpect package requires the following packages: R6
, rhdf5
, hgu133plus2.db
, RCurl
, RJSONIO
.
The expression data matrix is large (~120GB), and may be obtained using the s3cmd command line tool on Linux-like systems:
# configure s3cmd with your keys s3cmd --configure # now get the data file. takes 30-60 minutes with a fast data connection. s3cmd get s3://data.lincscloud.org/l1000/level4/zspc_n1328098x22268.gctx s3cmd get s3://data.lincscloud.org/l1000/metadata/inst.info
The instance metadata from LINCS is provided as tab delimited file (inst.info) which is unwieldy to work with in an R session due to it size. Therefore, the first step is to convert this file to hdf5 format using a helper function provided by the ExpSpect package:
library(ExpSpect) # adjust path below as needed. lincs <- LINCS$new("/mnt/lincs/zspc_n1328098x22268.gctx") lincs <- LINCS$new("/mnt/lincs/gex_epsilon_n1429794x978.gctx") # Generate hdf5 version of info file. Only needs to be done once. lincs$info2hdf5("/mnt/lincs/inst.info", "/desired/path/to/inst_info.gctx") # now tell your lincs object where you saved the file just created lincs$setInfoFile("/mnt/lincs/inst_info.gctx")
Now we are ready to begin.
First we will extract some signatures of interest from the LINCS expression data. To begin we will restrict our analysis to a a single cell line. Latest information on the cell lines profiled in the LINCS expression data set can be retrieved as follows.
View(lincs$cellInfo())
We can then restrict our data set to one or more cell lines of interest as well as restricting our analysis to clinically relevant (i.e., FDA approved) drugs. Finally, we will restrict our analysis to just the L1000 landmark genes that were directly measured by the LINCS gene expression assays. All of these filters can be applied using convenience functions provided by the LINCS class. Because there are tens of thousands of rows and over 1 million columns of data to filter, the following steps each take a little time (for example, on a c4.2xlarge AWS instance the following steps take a total of 42 seconds).
lincs$selectCell(c("HA1E")) lincs$selectCell(c("FIBRNPC")) lincs$selectFDA() lincs$selectL1000()
Internally these convenience functions call metadata
, setRows
, and setCols
methods of the LINCS class. These methods are public and you can call them yourself to set other filters on the dataset. For example, you could achieve the equivalent of selectCell(c("FBRNPC"))
like this:
cids <- lincs$cellInfo('"cell_id":"FIBRNPC"')$cellid lincs$setCols(which(lincs$metadata(rows=4) %in% cids))
Note that by default setCols
and setRows
(as well as selectFDA
and selectCell
that utilize them) filters based on the currently selected set of columns and rows. The metadata
method returns only data for the currently selected set of columns, so this should be intuitive when creating a new filter based on values from metadata
. If for some reason you want to restore the full set of columns or rows before applying your filter, use the reset=TRUE
option for setCols
and setRows
(and then ensure the column or row indexes you supply relate to the full dataset, not just any currently selected rows or columns). The selectL1000
option, in contrast, resets any current filter on the data rows since it is assumed you want all of the L1000 genes.
You can now extract the data matrix for the selected genes and instances. Extracting the million or so datapoints for this example takes less than a minute on our test server. Extracting larger datasets will of course take longer.
datamatrix <- lincs$data()
If you plan to use a given data extract repeatedly, it is a good idea to save the resulting datamatrix to disk as an .rda
file as this will be much faster to load than re-extracting the data (due to the fact that the many file seeks required to extract your selected data from the gctx file is time consuming given its size despite the inherent speed of the rhdf5
package).
Although there are only roughly 980 landmark genes, the imputation that the LINCS project performs to reconstitute the entire genome from the landmark genes results in multiple probesets per entrez id in the level 4 dataset (even for the landmark genes), giving about 1700 rows of data for these 980 genes. The resulting probeset ids are those corresponding to the Affymetrix HG U133 Plus 2.0 array. To facilitate comparison to other datasets from other platforms, it is useful to summarize this data down to a common gene identifier. The ExpSpect package defines a utility function to summarize the data from probesets down to entrez id (as mapped by the hgu133plus2.db annotation package from Bioconductor). The default summary method is mean
, but you may supply an alternative summarization function if you wish.
This step is not necessary if you are going to be interrogating this data with gene expression data that also uses Affymetrix HG U133 Plus 2.0 probeset ids as the gene identifier (such as preprocessed data from GEO using that platform).
# takes a couple of minutes datamatrix_entrez <- lincs$summarizeGenes(lincs$data(), FUN=mean)
Next we will give an example of calculating enrichment scores based on the data we extracted above. In this example, we will be comparing the imputed LINCS fold changes with pre-processed data from GEO. The data we will be using was produced using the Affymetrix HG U133 Plus 2.0 arrays (GPL570), so we do not need to summarize to a different common gene id. Instead we can use the datamatrix object from above directly (skipping the summarizeGenes step).
#library(GEOquery) #sirolim <- getGEO('GSE22224', destdir="/mnt/lincs/GEO")[[1]] data("fdadrugs") # maps lincs pert_iname to common drug name as recognized by FDA data("ha1e") # pre-extracted LINCS object, restricted to ha1e cell line, L1000 genes, FDA approved drugs data("sirolim") # sirolimus data obtained from GEO data <- exprs(sirolim) # identify subgroups of interest treat <- grep("SRL", pData(sirolim)$characteristics_ch1.2) ctrl <- grep("none", pData(sirolim)$characteristics_ch1.2) #calculate the scores. when using permutation, this takes a few minutes expct <- ExpSpect$new() expct$calcScores(lincs$data(), data[,treat], data[,ctrl], 'excount', perm=TRUE, iterations=1000, parallel=TRUE) expct$calcScores(lincs$data(), data[,treat], data[,ctrl], 'excos', perm=TRUE, iterations=1000) expct$calcScores(lincs$data(), data[,treat], data[,ctrl], 'cmap', perm=FALSE) # rennotate perturbagens to common drug names and plot ixm<- match(lincs$metadata()[2,], fdadrugs$pert_id) expct$labels <- fdadrugs$pert_iname[ixm] expct$plot(aggregate=TRUE, FUN=max, squelch=3, highlight="irolimus", main="ExpSpect analysis of siroliums exposed PBMCs")
library(GEOquery) asthma <- getGEO('GSE16032', destdir="/mnt/lincs/GEO")[[1]] save(asthma, file="asthma.rda") load("asthma.rda") data("fdadrugs") # maps lincs pert_iname to common drug name as recognized by FDA # identify subgroups of interest treat <- grep("Acute", pData(asthma)$title) ctrl <- grep("Conv", pData(asthma)$title) data <- exprs(asthma) #calculate the scores. when using permutation, this takes a few minutes data("HA1E") # pre-extracted LINCS object, restricted to ha1e cell line, L100 genes, FDA approved drugs expct <- ExpSpect$new() expct$calcScores(lincs$data(), data[,treat], data[,ctrl], 'excount', perm=TRUE, parallel=TRUE, iterations=1000) # reannotate perturbagens to common drug names and plot ixm<- match(lincs$metadata()[2,], fdadrugs$pert_id) expct$labels <- fdadrugs$pert_iname[ixm] f <- function(x) { return(quantile(x[which(x>0)], 0.8)) } sc <- expct$scores sc <- expct$scores[tt] lb <- expct$labels[tt] expct$plot(aggregate=TRUE, FUN=f, squelch=5, labels=20, highlight=NA, main="ExpSpect analysis of asthmatic patients") ix <- match(expct$labels, atc[,'substance']) expct$labels <- atc[ix, 3]
library(GEOquery) paclitax <- getGEO('GSE50830', destdir="/mnt/lincs/GEO")[[1]] data("fdadrugs") # maps lincs pert_iname to common drug name as recognized by FDA data("ha1e") # pre-extracted LINCS object, restricted to ha1e cell line, L1000 genes, FDA approved drugs # identify subgroups of interest treat <- grep("PTX", pData(paclitax)$title) ctrl <- grep("control", pData(paclitax)$title) treat <- 1:3 ctrl <- 7:9 data <- exprs(paclitax) #calculate the scores. when using permutation, this takes a few minutes expct <- ExpSpect$new() expct$calcScores(lincs$data(), data[,treat], data[,ctrl], 'excount', perm=TRUE, iterations=10, parallel=TRUE) # rennotate perturbagens to common drug names and plot ixm<- match(lincs$metadata()[2,], fdadrugs$pert_id) expct$labels <- fdadrugs$pert_iname[ixm] ix <- match(expct$labels, atc[,'substance']) expct$labels <- atc[ix, 4] f <- function(x) { return(quantile(x[which(x>0)], 0.8)) } expct$plot(aggregate=TRUE, FUN=max, squelch=3, labels=10, highlight="paclitax", main="ExpSpect analysis of paclotaxel exposed cell lines")
The following code snippet will create Rdata files containing lincs objects for each cell id in the LINCS data file.
for(i in unlist(lincs$cellInfo()$cellid)) { print(i) lincs <- LINCS$new("/mnt/lincs/zspc_n1328098x22268.gctx") lincs$setInfoFile("/mnt/lincs/inst_info.gctx") tryCatch({ lincs$selectCell(c(i)) lincs$selectFDA() lincs$selectL1000() tt <- lincs$data() # force data load into memory. larger Rdata file but data available immediately. save(lincs, file=paste('data/', i, '.rda', sep="")) }, warning = function(w) { print(w) }, error = function(e) { print(e) }) }
GSE13849
library(GEOquery) jia <- getGEO('GSE13849', destdir="/mnt/lincs/GEO")[[1]] save(jia, file="data/jia.rda") data("fdadrugs") # maps lincs pert_iname to common drug name as recognized by FDA data("HA1E") # pre-extracted LINCS object, restricted to ha1e cell line, L1000 genes, FDA approved drugs # identify subgroups of interest treat <- grep("RF+", pData(jia)$title) ctrl <- grep("CTRL", pData(jia)$title) data <- exprs(jia) #calculate the scores. when using permutation, this takes a few minutes expct <- ExpSpect$new() expct$calcScores(lincs$data(), data[,treat], data[,ctrl], 'excount', perm=TRUE, iterations=200) # reannotate perturbagens to common drug names and plot ixm<- match(lincs$metadata()[2,], fdadrugs$pert_id) expct$labels <- fdadrugs$pert_iname[ixm] f <- function(x) { return(quantile(x[which(x>0)], 0.75)) } hl <- atc[grep("non-ster", atc[,3]),5] hl <- c(hl, 'carprofen') expct$plot(aggregate=TRUE, FUN=f, squelch=5, labels=20, highlight=hl, main="ExpSpect analysis of NSAID exposed PBMCs") ix <- match(expct$labels, atc[,'substance']) expct$labels <- atc[ix, 3] expct$labels[which(atc[ix,5] %in% hl)] <- "non-steroidal anti-inflammatories"
Plot plates
library(ggplot2) lincs <- LINCS$new("../q2norm_n1328098x22268.gctx") md <- lincs$metadata() plate <- gsub("_X.*", "", md['rna_plate',]) col <- rep(1:36,36)[1:length(res)] row <- rep(35:1, each=36)[1:length(res)] filter <- function(field, values) { res <- tapply(tolower(md[field,]), INDEX = plate, FUN = function(x) { sum(x %in% tolower(values)) > 1}) } filter2 <- function(field) { res <- tapply(tolower(md[field,]), INDEX = plate, FUN = function(x) { length(unique(x)) }) } data <- data.frame(val=filter("pert_type", c("ctl_vehicle", "ctl_vector", "ctl_untreat")), row=row, col=col) data <- data.frame(val=factor(filter2("pert_dose")), row=row, col=col) ggplot(data=data, aes(x=col, y=row, col)) + geom_point(aes(fill=val), size=5, pch=21, color="gray") + labs(title="Number of dose levels on each plate") + scale_fill_manual(values = colorRampPalette(colors=c("gray", "orange"))(length(unique(data$val)))) #plate <- gsub("_X.*", "", md['rna_plate',]) #well <- md['det_well',] #well <- gsub(",.*", "", well) #cell <- md['cell_id',] #pert <- md['pert_desc',] #plateRow <- strtoi(sapply(substr(well, 1, 1), charToRaw), base=16) - 64 #plateRow <- match(substr(well, 1, 1), LETTERS) #plateCol <- as.numeric(substr(well, 2, 3)) #p <- plate[1] #plt <- matrix(0, nrow=16, ncol=24) #ix <- which(plate == p) #ixx <- which(pert[ix] == "DMSO") # populate the plate #for(i in ix) { # plt[plateRow[i], plateCol[i]] <- pert[i] #}
Fetch is_gold data and append to metadata.
library(httr) url <- "http://api.lincscloud.org/a2/siginfo?q={%22is_gold%22:%22true%22}&user_key=lincsdemo&f={%22distil_id%22:1}&c=TRUE" count <- content(GET(url))$count url <- "http://api.lincscloud.org/a2/siginfo?q={%22is_gold%22:%22true%22}&user_key=lincsdemo&f={%22distil_id%22:1}&l=1000&sk=" d <- numeric(0) for(s in seq(0,count, 1000)) { res <- GET(paste(url, s, sep="")) d <- c(d, unlist(content(res, as = "parsed"))) print(s) } ix <- which(names(d) == "_id") d <- d[-ix] is_gold <- colnames(md) %in% d md['is_gold',] <- is_gold saveRDS(md, "data/metadata.rds")
get l1000 ids and level 3 ids
l1000_ps <- gsub(" ", "", h5read("../gex_epsilon_n1429794x978.gctx", "/0/META/ROW/id")) l3_ps <- gsub(" ", "", h5read("../q2norm_n1328098x22268.gctx", "/0/META/ROW/id")) ix <- which(l3_ps %in% l1000_ps) # it's the first 978 rows! yay!
data <- h5read("../q2norm_n1328098x22268.gctx", "0/DATA/0/matrix", index=list(c(1:978), c(1:100)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.