getDEE2: Programmatic access to the DEE2 RNA expression dataset

Background

Digital Expression Explorer 2 (or DEE2 for short) is a repository of processed RNA-seq data in the form of counts. It was designed so that researchers could undertake re-analysis and meta-analysis of published RNA-seq studies quickly and easily. As of April 2020, over 1 million SRA runs have been processed.

For further information about the resource, refer to the journal article and project homepage.

This package provides an interface to access these expression data programmatically.

Getting started

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("getDEE2")
library("getDEE2")

Searching for datasets of interest starting with accession numbers

The first step is to download the list of accession numbers of available datasets with the getDEE2Metadata function, specifying a species name. options for species currently are:

If the species name is incorrect, an error will be thrown.

mdat <- getDEE2Metadata("celegans")
head(mdat)

If you have a SRA project accession number in mind already (eg: SRP009256) then we can see if the datasets are present.

mdat[which(mdat$SRP_accession %in% "SRP009256"),]

DEE2 data is centred around SRA run accessions numbers, these SRR_accessions can be obtained like this:

mdat1 <- mdat[which(mdat$SRP_accession %in% "SRP009256"),]
SRRvec <- as.vector(mdat1$SRR_accession)
SRRvec

Fetching DEE2 data using SRA run accession numbers

The general syntax for obtaining DEE2 data is this:

getDEE2(species,SRRvec,metadata,outfile="NULL",counts="GeneCounts")

First, the function queries the metadata to make sure that the requested datasets are present. If metadata is not specified, then it will download a 'fresh' copy of the metadata. It then fetches the requested expression data and constructs a SummarizedExperiment object. The 'counts' parameter controls the type of counts provided:

If 'outfile' is defined, then files will be downloaded to the specified path. If it is not defined, then the files are downloaded to a temporary directory and deleted immediately after use.

The SRR numbers need to exactly match those in SRA.

Here is an example of using the SRR vector as defined above.

suppressPackageStartupMessages(library("SummarizedExperiment"))
x <- getDEE2("celegans",SRRvec,metadata=mdat,counts="GeneCounts")
x
# show sample level metadata
colData(x)[1:7]
# show the counts
head(assays(x)$counts)

You can directly specify the SRR accessions in the command line, but be sure to type them correctly. In case SRR accessions are not present in the database, there will be a warning message.

x <- getDEE2("celegans",c("SRR363798","SRR363799","SRR3581689","SRR3581692"),
    metadata=mdat,counts="GeneCounts")

In this case the accessions SRR3581689 and SRR3581692 are A. thaliana accessions and therefore not present in the C. elegans accession list.

Downstream analysis

DEE2 data are perfectly suitable for downstream analysis with edgeR, DESeq2, and many other gene expression and pathway enrichment tools. For more information about working with SummarizedExperiment refer to the rnaseqGene package which describes a workflow for differential gene expression of SummarizedExperiment objects.

Legacy function

The function to obtain DEE2 in the legacy format is provided for completeness but is no longer recommended. It gives DEE2 data in the form of a list object with slots for gene counts, transcript counts, gene length, transcript length, quality control data, sample metadata summary, sample metadata (full) and any absent datasets.

x <- getDEE2("celegans",SRRvec,metadata=mdat,legacy=TRUE)
names(x)
head(x$GeneCounts)
head(x$TxCounts)
head(x$QcMx)
head(x$GeneInfo)
head(x$TxInfo)

Large project bundles

The DEE2 webpage has processed many projects containing dozens to thousands of runs (available here). These large project datasets are easiest to access with the "bundles" functionality described here. The three functions are:

  1. list_bundles downloads a list of available bundles for a species

  2. query_bundles checks whether a particular SRA project or GEO series accession number is available

  3. getDEE2_bundle fetches the expression data for a particular accession and loads it as a SummarizedExperiment object

In this first example, we search for a dataset with SRA project accession number SRP058781 and load the gene level counts.

bundles <- list_bundles("athaliana")
head(bundles)
query_bundles(species="athaliana",query="SRP058781",
    col="SRP_accession",bundles=bundles)
x <- getDEE2_bundle("athaliana", "SRP058781",
    col="SRP_accession",counts="GeneCounts")
    assays(x)$counts[1:6,1:4]

Similarly, it is possible to search with GEO series numbers, as in the next example.

x <- getDEE2_bundle("drerio", "GSE106677",
    col="GSE_accession",counts="GeneCounts")
    assays(x)$counts[1:6,1:4]

Session Info

sessionInfo()


Try the getDEE2 package in your browser

Any scripts or data that you put into this service are public.

getDEE2 documentation built on Nov. 8, 2020, 7:46 p.m.