restfulSE -- experiments with HDF5 server content wrapped in SummarizedExperiment

suppressPackageStartupMessages({
library(restfulSE)
library(GO.db)
library(org.Hs.eg.db)
library(SummarizedExperiment)
library(ExperimentHub)
library(AnnotationHub)
})

restfulSE

This R package includes proof-of-concept code illustrating several approaches to SummarizedExperiment design with assays stored out-of-memory.

HDF5 server backed SummarizedExperiment

HDF Server "extends the HDF5 data model to efficiently store large data objects (e.g. up to multi-TB data arrays) and access them over the web using a RESTful API." In this restfulSE package, several data structures are introduced

We maintain, thanks to a grant from the National Cancer Institute, the server http://h5s.channingremotedata.org:5000/. Visit this URL to get a flavor of the server structure: datasets, groups, and datatypes are high-level elements to be manipulated to work with data values from the server.

Illustration with 10x genomics 1.3 million neurons

We used Martin Morgan's TENxGenomics package to transform the sparse-formatted HDF5 supplied by 10x into a dense HDF5 matrix to support natural slicing. Thanks to native compression in HDF5, the data volume expansion is modest.

A helper function in the restfulSE package creates a RESTfulSummarizedExperiment instance that points to the full numerical dataset.

library(restfulSE)
my10x = se1.3M()
my10x

As an exercise, we acquire the ENSEMBL identifiers for mouse genes annotated to hippocampus development, which has GO ID GO:0021766, and check counts for 10 genes on 6 samples:

library(org.Mm.eg.db)
hippdev = select(org.Mm.eg.db, 
    keys="GO:0021766", keytype="GO", column="ENSEMBL")$ENSEMBL
hippdev = intersect(hippdev, rownames(my10x))
unname(assay(my10x[ hippdev[1:10], 10001:10006]))

The result:

      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]    0    0    0    0    0    0
 [2,]    0    0    0    0    0    0
 [3,]    0    0    0    1    0    0
 [4,]    0    1    2    6    5    0
 [5,]    0    0    0    0    0    0
 [6,]    1    2    4    8    7    3
 [7,]    0    0    0    0    0    0
 [8,]    0    0    0    0    0    2
 [9,]    0    0    0    0    0    0
[10,]    3    0    3    0    1    9

Illustration with GTEx tissue expression

We exported the content of the recount2 GTEx gene-level quantifications to our HDF5 server. A convenience function is available:

tiss = gtexTiss()
tiss

We'll use this remote data as a tool for investigating transcriptional patterns in brain anatomy. We can identify the samples from brain using the 'smtsd' colData element:

binds = grep("Brain", tiss$smtsd)
table(tiss$smtsd[binds][1:100]) # check diversity in 100 samples

We'll identify genes annotated to neurotrophic functions using another convenience function in this package:

ntgenes = goPatt(termPattern="neurotroph")
head(ntgenes)

Some details

Motivation

Extensive human and computational effort is expended on downloading and managing large genomic data at site of analysis. Interoperable formats that are accessible via generic operations like those in RESTful APIs may help to improve cost-effectiveness of genome-scale analyses.

In this report we examine the use of HDF5 server as a back end for assay data, mediated through the RangedSummarizedExperiment API for interactive use.

A modest server configured to deliver HDF5 content via a RESTful API has been prepared and is used in this vignette.

Executive summary

We want to provide rapid access to array-like data. We'll work with the Banovich 450k data as there is a simple check against an in-memory representation.

suppressPackageStartupMessages({
library(restfulSE)
library(SummarizedExperiment)
library(Rtsne)
library(rhdf5client)
})
library(restfulSE)
bigec2 = H5S_source("http://h5s.channingremotedata.org:5000")
bigec2
dsmeta(bigec2)[1:2,] # two groups
dsmeta(bigec2)[1,2][[1]] # all dataset candidates in group 1

We use double-bracket subscripting to grab a reference to a dataset from an H5S source.

banref = bigec2[["assays"]] # arbitrary name assigned long ago
banref

We build a SummarizedExperiment by combining an assay-free RangedSummarizedExperiment with this reference.

ehub = ExperimentHub::ExperimentHub()
myfiles <- AnnotationHub::query(ehub , "restfulSEData")
banoSE = myfiles[["EH551"]] # no assay data
ds = H5S_Array("http://h5s.channingremotedata.org:5000", "assays")
assays(banoSE) = SimpleList(betas=ds)
banoSE

We can update the SummarizedExperiment metadata through subsetting operations, and then extract the relevant assay data. The data are retrieved from the remote server with the assay method.

rbanoSub = banoSE[5:8, c(3:9, 40:50)] 
assay(rbanoSub) 

10xGenomics examples

t-SNE for a set of genes annotated to hippocampus

We have used Martin Morgan's TENxGenomics package to create a dense HDF5 representation of the assay data, and placed it on the bigec2 server. The metadata are available as se100k in this package; we have used EnsDb.Mmusculus.v79 to supply gene ranges where available; genes reported but without addresses are addressed at chr1:2 with width 0. The rows are sorted by genomic address within chromosomes.

tenx100k = se100k()
tenx100k

We will subset genes annotated to hippocampus development. Here are some related categories:

12092 GO:0021766                      hippocampus development
12096 GO:0021770            parahippocampal gyrus development
34609 GO:0097410      hippocampal interneuron differentiation
34631 GO:0097432 hippocampal pyramidal neuron differentiation
34656 GO:0097457                      hippocampal mossy fiber
35169 GO:0098686       hippocampal mossy fiber to CA3 synapse
42398 GO:1990026            hippocampal mossy fiber expansion
library(org.Mm.eg.db)
atab = select(org.Mm.eg.db, keys="GO:0021766", keytype="GO", columns="ENSEMBL")
hg = atab[,"ENSEMBL"]
length(hgok <- intersect(hg, rownames(tenx100k)))

This is a very scattered collection of rows in the matrix. We acquire expression measures for genes annotated to hippocampus on 4000 samples. t-SNE is then used to project the log-transformed measures to the plane.

hipn = assay(tenx100k[hgok,1:4000])  # slow
d = dist(t(log(1+hipn)), method="manhattan")
proj = Rtsne(d)
plot(proj$Y)

A set of genes related to the visual cortex

Tasic et al. (Nature neuro 2016, DOI 10.1038/nn.4216) describe single cell analysis of the adult murine brain, identify clusters of cells with distinct transcriptional profiles and anatomic location, and enumerate lists of genes that discriminate these clusters. The tasicST6 DataFrame provides details.

#data("tasicST6", package = "restfulSEData")
ehub = ExperimentHub::ExperimentHub()
myfiles <- AnnotationHub::query(ehub , "restfulSEData")
myfiles[["EH557"]] -> tasicST6
tasicST6

Key high-level discrimination concerns cells regarded as GABAergic vs. glutamatergic (inhibitory vs excitatory neurotransmission).

Background

Banovich et al. published a subset of DNA methylation measures assembled on 64 samples of immortalized B-cells from the YRI HapMap cohort.

library(restfulSE)
#data("banoSEMeta", package = "restfulSEData")
ehub = ExperimentHub::ExperimentHub()
myfiles <- AnnotationHub::query(ehub , "restfulSEData")
myfiles[["EH551"]] -> banoSEMeta
banoSEMeta

The numerical data have been exported using H. Pages' saveHDF5SummarizedExperiment applied to the banovichSE SummarizedExperiment in the yriMulti package. The HDF5 component is simply copied into the server data space on the remote server.

Hierarchy of server resources

Server

Given the URL of a server running HDF5 server, we create an instance of H5S_source:

mys = H5S_source(serverURL="http://h5s.channingremotedata.org:5000")
mys

Groups

The server identifies a collection of 'groups'. For the server we are working with, only one group, at the root, is of interest.

groups(mys)

Links for a group

There is a class to hold the link set for any group:

lin1 = rhdf5client::links(mys,1)
lin1


Try the restfulSE package in your browser

Any scripts or data that you put into this service are public.

restfulSE documentation built on May 2, 2018, 3:31 a.m.