suppressPackageStartupMessages({
suppressMessages({
library(hsdsRclient)
})
})

Introduction

We are learning how to design R code to work with HSDS, an object store interface to HDF5. The hsdsRclient package will likely merge with rhdf5client soon (Q4 2017).

Please note the version of rhdf5client used here. There have been issues with the handling of binary transfer from HSDS that are not fully resolved.

sessionInfo()

Some very basic benchmarking

There is a long-running server on XSEDE jetstream that can be contacted from outside. We demonstrate sample-level sums for the 10x 1.3 million neuron dataset.

hostPath = "/home/john/tenx_full.h5"
serverPort = "http://149.165.156.174:5101/"
txsrc = HSDS_source(serverPort, hostPath) # top level object
tx2 = HSDS_dataset(txsrc) # get a sliceable reference
apply(tx2[1:6, 1:27998],1,sum) # compute sample-level sums for first six

Let's benchmark this activity and then drill a little deeper to the communication details.

library(microbenchmark)
microbenchmark(apply(tx2[1:6, 1:27998],1,sum), times=5) 

There is a slot called transfermode in tx2:

tx2@transfermode

Change to 'binary':

tx2@transfermode = "binary"
microbenchmark(txs6 <- apply(tx2[1:6, 1:27998],1,sum), times=5) 
txs6

A modest improvement.

Distributing the request over multiple ports

Let's do 500 samples with the simple call to binary interface.

microbenchmark(s500 <- apply(tx2[1:500, 1:27998],1,sum), times=2) 

Some answers:

s500[1:6]
summary(s500)

It seems difficult to use multicore computing with the communications required here. So we use socket-based job distribution on a multicore machine.

library(BiocParallel)
spar = SnowParam(2, type="SOCK")
register(spar)

With SnowParam, a complete environment needs to be prepared on each slave. We encapsulate the retrieval task in function encaps.

encaps = function (specl = list(inds = 1:2, port = "5101"))
{
    stopifnot(all.equal(names(specl), c("inds", "port")))
    library(hsdsRclient)
    hostPath = "/home/john/tenx_full.h5"
    serverPort = sprintf("http://149.165.156.174:%s/", specl$port)
    txsrc = HSDS_source(serverPort, hostPath)
    tx2 = HSDS_dataset(txsrc)
    apply(tx2[specl$inds, 1:27998], 1, sum)
}

We will now query port 5101 with two parallel requests.

system.time(sans <- bplapply(list(
     list(inds=1:250, port="5101"),
     list(inds=251:500, port="5101")), encaps))
summary(unlist(sans))

Now use two different ports.

system.time(sans <- bplapply(list(
     list(inds=1:250, port="5101"),
     list(inds=251:500, port="5102")), encaps))
summary(unlist(sans))

Now use four different ports.

system.time(sans <- bplapply(list(
     list(inds=1:125, port="5101"),
     list(inds=126:250, port="5102"),
     list(inds=251:375, port="5103"),
     list(inds=376:500, port="5104")), encaps))
summary(unlist(sans))

Now use four different ports, full initial yield per port.

system.time(sans <- bplapply(list(
     list(inds=1:500, port="5101"),
     list(inds=501:1000, port="5102"),
     list(inds=1001:1500, port="5103"),
     list(inds=1501:2000, port="5104")), encaps))
summary(unlist(sans))

Do it sequentially.

system.time(sans <- lapply(list(
     list(inds=1:500, port="5101"),
     list(inds=501:1000, port="5102"),
     list(inds=1001:1500, port="5103"),
     list(inds=1501:2000, port="5104")), encaps))
summary(unlist(sans))


vjcitn/hsdsRclient documentation built on May 14, 2019, 2:01 a.m.