Digital Expression Explorer 2 (or DEE2 for short) is a repository of processed RNA-seq data in the form of counts. It was designed so that researchers could undertake re-analysis and meta-analysis of published RNA-seq studies quickly and easily. As of April 2020, over 1 million SRA runs have been processed.
This package provides an interface to access these expression data programmatically.
if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("getDEE2")
The first step is to download the list of accession numbers of available
datasets with the
getDEE2Metadata function, specifying a species name.
options for species currently are:
If the species name is incorrect, an error will be thrown.
mdat <- getDEE2Metadata("celegans") head(mdat)
If you have a SRA project accession number in mind already (eg: SRP009256) then we can see if the datasets are present.
mdat[which(mdat$SRP_accession %in% "SRP009256"),]
DEE2 data is centred around SRA run accessions numbers, these SRR_accessions can be obtained like this:
mdat1 <- mdat[which(mdat$SRP_accession %in% "SRP009256"),] SRRvec <- as.vector(mdat1$SRR_accession) SRRvec
The general syntax for obtaining DEE2 data is this:
First, the function queries the metadata to make sure that the requested datasets are present. If metadata is not specified, then it will download a 'fresh' copy of the metadata. It then fetches the requested expression data and constructs a SummarizedExperiment object. The 'counts' parameter controls the type of counts provided:
GeneCounts STAR gene level counts (this is the default)
TxCounts Kallisto transcript level counts
Tx2Gene transcript counts aggregated (sum) to the gene level.
If 'outfile' is defined, then files will be downloaded to the specified path. If it is not defined, then the files are downloaded to a temporary directory and deleted immediately after use.
The SRR numbers need to exactly match those in SRA.
Here is an example of using the SRR vector as defined above.
suppressPackageStartupMessages(library("SummarizedExperiment")) x <- getDEE2("celegans",SRRvec,metadata=mdat,counts="GeneCounts") x # show sample level metadata colData(x)[1:7] # show the counts head(assays(x)$counts)
You can directly specify the SRR accessions in the command line, but be sure to type them correctly. In case SRR accessions are not present in the database, there will be a warning message.
x <- getDEE2("celegans",c("SRR363798","SRR363799","SRR3581689","SRR3581692"), metadata=mdat,counts="GeneCounts")
In this case the accessions SRR3581689 and SRR3581692 are A. thaliana accessions and therefore not present in the C. elegans accession list.
DEE2 data are perfectly suitable for downstream analysis with edgeR, DESeq2, and many other gene expression and pathway enrichment tools. For more information about working with SummarizedExperiment refer to the rnaseqGene package which describes a workflow for differential gene expression of SummarizedExperiment objects.
The function to obtain DEE2 in the legacy format is provided for completeness but is no longer recommended. It gives DEE2 data in the form of a list object with slots for gene counts, transcript counts, gene length, transcript length, quality control data, sample metadata summary, sample metadata (full) and any absent datasets.
x <- getDEE2("celegans",SRRvec,metadata=mdat,legacy=TRUE) names(x) head(x$GeneCounts) head(x$TxCounts) head(x$QcMx) head(x$GeneInfo) head(x$TxInfo)
The DEE2 webpage has processed many projects containing dozens to thousands of runs (available here). These large project datasets are easiest to access with the "bundles" functionality described here. The three functions are:
list_bundles downloads a list of available bundles for a species
query_bundles checks whether a particular SRA project or GEO series
accession number is available
getDEE2_bundle fetches the expression data for a particular accession
and loads it as a SummarizedExperiment object
In this first example, we search for a dataset with SRA project accession number SRP058781 and load the gene level counts.
bundles <- list_bundles("athaliana") head(bundles) query_bundles(species="athaliana",query="SRP058781", col="SRP_accession",bundles=bundles) x <- getDEE2_bundle("athaliana", "SRP058781", col="SRP_accession",counts="GeneCounts") assays(x)$counts[1:6,1:4]
Similarly, it is possible to search with GEO series numbers, as in the next example.
x <- getDEE2_bundle("drerio", "GSE106677", col="GSE_accession",counts="GeneCounts") assays(x)$counts[1:6,1:4]
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.