library(knitr) opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
The The Library of Integrated Network-Based Cellular Signatures (LINCS)
program has created over 1 million transcriptional profiles
[@pmid29195078, @pmid29199020] that
detail the effects of thousands of drugs and genetic perturbagens in
gene expression space (known as the "L1000" dataset as it is based on the
measured expression of approximately 1000 highly variable genes).
This creates a Rosetta stone of sorts, allowing
us to translate between disease states, drug treatments, and genetics.
This resource holds the promise of unlocking new therapies for diseases, and
deeper understanding of how cells respond to drugs and gene perturbation.
The LINCS L1000 data set has three components: the gene expression data itself (available as a "GCTX" file which is in HDF5 format), a metadata file (as an "INFO" file, in tab delimited format), and additional metadata that is made available through a web-based API (http://clue.io). Analysis of this dataset thus requires familiarity with the contents and formats of these three resources, as well as the R tools required to interact with them. The goal of the slinky package is to relieve the user of details of file access and web calls and provide a simplified interface to these various resources.
For example, consider the high level method function loadL1K
. If
provided a single argument (the name of a perturbagen of interest, for
example "amoxicillin"), the function does the following:
For users more familiear with the underlying data resources,
much more granular control can be achieved through optional arguments to
loadL1K
or to lower level functions including clue
, and
readGCTX
. Examples of these and other functions provided by this
package are described in the balance of this vignette.
The gene expression data derived from a single cell culture well treated with a single perturbagen has historically been referred to as an "instance" in the CMAP/LINCS program. I will use "instance" in this way. For any given perturbagen, there will be many instances encompassing different doses, durations, and cell lines as well as technical replicates.
Slinky uses the rhdf5 package to access slices of the L1000 data from disk. As a result, you only need enough memory to hold the data slices you are working with, and possibly the metadata as well. (The entire metadata data frame for all 1.3 million instances requires a modest 220MB of RAM). The development machine on which the slinky package was primarily developed has 16GB of RAM and can load and manipulate tens of thousands of instances from the L1000 dataset without difficulty. Of course, YMMV depending on the specifications of your computer.
The examples in this vignette can be completed with the demonstration gctx
and info
files that are installed along with the package. Note: the
expression data in the demo gctx file has been truncated from double precsion
floating point to integer to keep the package size under 4MB as required by
the Bioconductor project.
To move on and conduct your own analysis, you will need access to
a LINCS L1000 data file, such as the Phase I Level 3 data file.
This file may already be available from
your local bioinformatics core. If you need or want to download it yourself,
note that these datafiles are quite large (up to 40GB). A robust multithreaded download
client makes this much faster and less prone to failures due to connectivity
hiccups. For example, you might try:
```{bash, eval=FALSE}
aria2c -x 8 -s 8 https://goo.gl/3TigFI gzip -d *.gz
Alternatively, you can use the slinky package itself to fetch this file. Again, this will take a while and may fail if your connection is not rock solid. ```r sl <- new("Slinky") download(sl, type = "expression", level="3")
You will also need the metadata describing the instances in that file.
You could download it yourself, but note that the slinky package will automatically
fetch the default phase I info file automatically if and when it needs it,
and even try to place it in the package
installation directory so it is there the next time. (If it cannot be moved to
that directory due to permission issues, it will just be left in your current
working directory. The slinky
package also checks in the current working
directory for the file before downloading it again.)
If using phase 2 data, you must download the corresponding
INFO file yourself. Automated support for the phase 2 data is planned for our
next update.
download(sl, type = "info")
Details on the LINCS data files are available here.
Access to the LINCS api and clue.io requires a user key, which can be obtained
from the LINCS by registering at https://clue.io (free for non-commercial use).
You can provide your key directly in the call to
Slinky$new("your_key_here")
, or define the environment variable
CLUE_API_KEY
and set it to your key. Alternatively, you can store your key as
a single line in a text file so it does not get included in your project files
(which may, among other thing, be stored in public repositories, etc.).
library(slinky) #update following lines with your details: key <- "YOUR_API_KEY" gctx <- "/path/to/GSE92742_Broad_LINCS_Level3_INF_mlr12k_n1319138x12328.gctx" info <- "/path/to/GSE92742_Broad_LINCS_inst_info.txt.gz" sl <- Slinky(key, gctx, info)
For the purposes of this vignette, we will use a small subset of the L1000
data distributed with this package, and the demo API key from clue.io.
Please do not use the demo key for actual analysis. Note: the data in this
demonstration dataset has been rounded to integer to keep the package size
under 4MB as required by the BioConductor project.
library(slinky) user_key <- httr::content(httr::GET("https://api.clue.io/temp_api_key"), as = "parsed")$user_key sl <- Slinky(user_key, system.file("extdata", "demo.gctx", package = "slinky"), system.file("extdata", "demo_inst_info.txt", package = "slinky"))
There are several ways to identify a useful slice of the L1000 data set to load. The first way is to figure out the column and row indices you are interested in and specify these directly. For example, you might do this by filtering the info file that accompanies the L1000 data from GEO. This approach does not even require an API key since it does not call the API.
(It may be useful to note that the "landmark" genes are the first 978 rows of the dataset. The remaining rows contain the inferred expression for the rest of the transcriptome.)
The info file specified at object instantiation is loaded when the object
is created and its data is made available through the metadata
accessor
slot. We can use this data to identify which columns of L1000 expression
data we wish to load. Note that for performance, it is much faster to subset
the slinky object rather than subset the data. The latter will load the
entire dataset which may be very large. An example is given below
col.ix <- which(metadata(sl)$pert_iname == "amoxicillin" & metadata(sl)$cell_id == "MCF7") data <- readGCTX(sl[,col.ix]) ## This would be slower: #data <- readGCTX(sl)[,col.ix]
Note that loading contiguous data out of the gctx file is much faster than data
spread throughout the file. For example, loading 1000 consecutive instances
(over the network) takes 3.4 seconds in our environment, while loading 1000
random instances takes almost 10 times that long. When the info file is
loaded, it is reordered to match the order of columns in the gctx file.
(Interestingly, the info file provided by GEO is not in the same order as the
gctx file). Depending on your use case, you may be able to optimize loading
by accessing more nearly contiguous data slices as needed.
While not quite as fast, querying the clue.io API is simpler, ensures you have the most uptodate metadata, and also allows you to query on metadata not present in the info file provided via GEO. For example, we could identify all the instances that are not only perturbed by amoxicillin in the A375 cell line, but are also designated as "gold" by the LINCS program (by virtue of having low variance across replicates):
amox_gold <- clueInstances(sl, where_clause = list("pert_type" = "trt_cp", "pert_iname" = "amoxicillin", "cell_id" = "MCF7", "is_gold" = TRUE), poscon = "omit")
Note the poscon = "omit"
argument. There are a small number of instances
that have a pert_type
of trt_poscon
in clue.io's profiles
endpoint but
are recoded as trt_cp
in the sigs
endpoint. Since the is_gold
metadata
must be obtained from the sigs
endpoint, this can cause confusion/errors in
downstream analysis. Therefore, these samples can be omitted from query
results. In fact, this is the default behavior of the clueInstances
method
(the argument was simply specified above for emphasis).
You could then load these instances by "brute force" as we did above.
ix <- which(colnames(sl) %in% amox_gold) amox_gold_data <- readGCTX(sl[ ,ix])
But it is usually more useful to keep this data annotated with its
corresponding metadata in the form of an SummarizedExperiment
. This can be
easily done with the toSummarizedExperiment
method.
amox_gold_sumex <- as(sl[ ,ix], "SummarizedExperiment")
In fact, you could have skipped the trouble of looking up the ids in the first
place and simply passed the where_clause
to loadL1K
directly.
amox_gold_sumex <- loadL1K(sl, where_clause = list("pert_type" = "trt_cp", "pert_iname" = "amoxicillin", "cell_id" = "MCF7", "is_gold" = TRUE))
Note that by default, toSummarizedExperiment
loads all the genes in the
dataset. If only the L1000 landmark genes are desired, you can either pass
the inferred = FALSE
argument, or explicitly request just the first 978 rows.
Both methosd are shown below.
amox_gold_sumex <- loadL1K(sl, where_clause = list("pert_type" = "trt_cp", "pert_iname" = "amoxicillin", "cell_id" = "MCF7", "is_gold" = TRUE), inferred = FALSE) # equivalent to amox_gold_sumex <- loadL1K(sl[1:978, ], where_clause = list("pert_type" = "trt_cp", "pert_iname" = "amoxicillin", "cell_id" = "MCF7", "is_gold" = TRUE), inferred = FALSE) # equivalent to amox_gold_sumex <- loadL1K(sl[1:978, ], where_clause = list("pert_type" = "trt_cp", "pert_iname" = "amoxicillin", "cell_id" = "MCF7", "is_gold" = TRUE), inferred = FALSE) amox_gold_sumex <- amox_gold_sumex[1:978, ]
If you want some other subset of genes, simply subset the Slinky
object
accordingly. The entrez gene ids of the genes in the object can be retrieved by
the rownames
method (and likewise, the distil_ids
can be retrieved using
the colnames
methodD).
rownames(sl)[1:5] colnames(sl)[1:5] # note subsetting first will be faster as it avoids loading in the entire # set of names from the gctx file: rownames(sl[1:5, ]) colnames(sl[, 1:5]) # sanity check all.equal(as.character(colnames(sl)), as.character(metadata(sl)$distil_id))
Here is a more complex example. Let us identify those instances treated with
FDA approved compounds that are part of the "gold" subset of highly consistent
instances (as defined by LINCS). Definitively identifying FDA approved drugs
is a little tricky due to subtle (and not so subtle) differences in
vocabularies, but we exploit the "repositioning" data available at the
rep_drugs
endpoint to achieve our ends. Since we want to access the clue
API directly, we will use the lower level clue
method here rather
than loadL1K
.
fda <- clue(sl, "rep_drugs", where_clause = list("status_source" = list(like = "FDA Orange"), "final_status" = "Launched", "animal_only" = "0", "in_cmap" = TRUE), verbose = FALSE) fda_pert <- clueInstances(sl, poscon = "omit", where_clause = list("pert_type" = "trt_cp", "is_gold" = TRUE, "pert_iname" = list("inq" = fda$pert_iname)), verbose = FALSE)
Note the inq
syntax used above to provide an array of possible values to
match. This is a vernacular of the loopback
framework on which the clue.io
service is built. Further details can be obtained from the
loopback website, or
the numerous examples at the clue.io API.
We do not have access to most of these instances in the demo GCTX file, but
with the full GCTX file we could then load this data by passing the list
of ids as an argument to the loadL1K
method as follows:
fda_gold_sumex <- loadL1K(sl, ids = fda_pert))
Reading all 30320 instances from the full gctx file from
GEO takes about 5 minutes in our environment (in part because these instances
are scattered all accross the file). If you plan on reusing larger slices of
the data like this, it would be a good idea to save them as .rds
files for
rapid loading in the future.
It is often desirable to identify suitable controls for the perturbed samples
in your desired dataset. A simplistic approach would be just to identify all
the instances that were treated with the appropriate control. For instances of
type trt_cp
(compound perturbed instances), the vehicle is usually, though
not always, DMSO. For instances of type trt_sh
and trt_oe
, the control
would be empty vector.
The slinky package can query clue.io to identify the appropriate controls for a list of ids.
veh <- clueVehicle(sl, amox_gold, verbose=FALSE)
We could then identify the corresponding samples treated with that vehicle in our data set using any of the approaches outlined above. For example
ix <- which(metadata(sl)$pert_iname %in% veh$pert_vehicle) amox_and_control <- loadL1K(sl, ids = c(amox_gold, metadata(sl[, ix])$inst_id))
However, there are 54,707 DMSO treated instances in LINCS. You may not want
all of them every time you need controls for compound perturbed samples. A
better strategy might be to restrict the controls to those on the same plate as
your perturbed samples. The slinky package can retrieve the appropriate same
plate controls for trt_cp
, trt_sh
, and trt_oe
instance types, including a
mixture of those types.
ids.ctrl <- controls(sl, ids = amox_gold)$distil_id amox_and_control <- loadL1K(sl, ids = c(amox_gold, ids.ctrl))
In fact, you can skip the step of identifying the same-plate controls
altogether and simply specify controls=TRUE
when calling
loadL1K
:
amox_and_control <- loadL1K(sl, ids = amox_gold, controls = TRUE)
A common goal in analysis of the LINCS L1000 data is to identify gene signatures that describe the transcriptional footprint of a compound or gene. This might be done with a goal of matching drugs to diseases, identifying novel targets of drugs, or deconvoluting the key pathways involved in a disease.
The slinky package facilitates this analysis by identifying and loading
suitable data subsets necessary to either define or apply gene signatures. The
package takes this one step further by formatting the data for characteristic
direction analysis [@pmid24650281] as implemented in the GeoDE
package.
This functionality is wrapped in the diffexp
function.
Our goal is to incorporate other popular methods of differential gene expression analysis into this pipeline (analagous to the uniform interface the caret package provides for machine learning).
You can provide the diffexp
function two expression sets (treated and
control) to analyze. Alternatively, you can simply provide
the treated samples as an expression set and diffexp
will automatically
identify and load the corresponding same-plate control samples in the same
manner as discussed above.
Alternatively, you can simply specify a perturbagen of interest as the "treat"
argument. (You can also provide additional parameters to the "where_clause"
argument to further narrow your query). Then diffexp
will load the
corresponding data, and identify and load the corresponding same-plate
controls.
cd_vector <- diffexp(sl, treat = "amoxicillin", split_by_plate = FALSE, verbose = FALSE) head(cd_vector)
Note that not all the matching samples were found in our small demonstration gctx file. The function ran using the available data, but presents a message that some data was missing.
If split_by_plate = FALSE
, the function returns a vector of scores for each
gene in the dataset representing the magnitude and sign of that gene's
component vector
(i.e. its contribution to the characteristic direction of your dataset). In an
(arguably) ideal situation, you would only compare instances to controls from
the same plate. But to perform the characteristic direction analysis (or
any other imagineable statistical analysis) requires at least two samples in
the treated group and two in the control group. We do not always have this
luxury. So in the case of the amoxicillin, we have to combine our treated and
control samples across plates into a single analysis.
However, even in this scenario, the control samples identified by diffexp
still come from the same plates as the treated samples (even if there is only
one treated or one control on a given plate).
In the next example, we are able to split the analysis into plates because the perturbagen is present in replicates on each plate.
cd_vecs <- diffexp(sl, treat = "E2F3", where_clause = list("pert_type" = "trt_sh", "cell_id" = "MCF7"), split_by_plate = TRUE, verbose = FALSE) cd_vecs[1:5,1:3]
Here, diffexp
returns a matrix of scores, one row per gene and one column per
plate. The plates could then be summarized in some fashion. One nice approach
is to take the rank product [@pmid15327980] of the scores which identifies
those genes that are most consistently up regulated (the smaller the rank
product, the more likely the gene is to be differentially expressed.) This
approach can also identify down regulated genes, of course, provided the values
are sorted in the opposite direction.
# negate the values so 1 = the most up regulated gene # (rank sorts in ASCENDING order, which is not what we want) ranks <- apply(-cd_vecs, 2, rank) n <- nrow(ranks) rp <- apply(ranks, 1, function(x) { (prod(x/n))})
If the user has the org.Hs.eg.db
package installed, then the gene symbols
corresponding to the most upregulated genes in this case could be identified.
suppressMessages(library(org.Hs.eg.db)) entrez_ids <- names(sort(rp, decreasing=FALSE))[1:5] entrez_ids <- entrez_ids[which(entrez_ids %in% ls(org.Hs.egSYMBOL))] as.vector(unlist(mget(entrez_ids, org.Hs.egSYMBOL)))
This list of genes then could be used as a signature for E2F3 knockdown (though ideally some permutation based significance analysis would be performed first to further filter the gene list).
Having ready access to both the expression data and metadata facilitates visualization of the data, including some basic quality control measures. For example, we could explore to what extent batch (plate) effects are driving differences between samples.
suppressMessages(library(ggplot2)) suppressMessages(library(Rtsne)) sumex <- loadL1K(sl[seq_len(978), seq_len(131)]) set.seed(100) ts <- Rtsne(t(SummarizedExperiment::assays(sumex)[[1]]), perplexity = 10) tsne_plot <- data.frame(x = ts$Y[,1], y = ts$Y[,2], treatment = sumex$pert_iname, plate = sumex$rna_plate) ggplot(tsne_plot) + geom_point(aes(x = x, y = y, color = plate)) + labs(x = "TSNE X", y = "TSNE Y") + theme(axis.title = element_text(face = "bold", color = "gray"))
Ideally, the points would cluster by treatment and not by plate. Alas, as can be seen in the resulting figure above. Compare that to the same plot colored by treatment.
ggplot(tsne_plot) + geom_point(aes(x = x, y = y, color = treatment)) + labs(x = "TSNE X", y = "TSNE Y") + theme(axis.title = element_text(face = "bold", color = "gray"))
This underscores the importance of constructing your analysis of this data set in a way that controls for batch effects.
Note that presently, the Slinky object does not support loading multiple dataset (e.g. the phase 2 and phase 1 data together). However, you can create a separate Slinky object for each dataset, extract data of interest into \code{summarizedExperiment} objects with the \code{loadL1K} method and then simply merge them with cbind. Note that you will need to first verify that the two objects contain the same genes. For example:
sl1 <- Slinky(key, gctx1, info1) sl2 <- Slinky(key, gctx2, info2) ix <- which(match(rownames(sl1), rownames(sl2))) ix.na <- which(is.na(ix)) sl1 <- sl1[-ix.na, ] ix <- ix[-ix.na] sl2 <- sl2[ix,] sl <- cbind(sl1, sl2)
For the foreseeable future, our main emphasis will be to expand the methods that slinky supports for identifying differentially expressed genes and/or calculating enrichment of gene signatures in samples. High priority will of course be given to any bugs identiftied in the software, as well as other features that seem to be of high value to any other users of the software. Users are encouraged to either email the package maintainer or raise issues on the package's github repository with either bug reports or feature requests.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.