About

The R package yolo is designed to subset large data based on column and row attributes based on the familiar RangedSummarizedExperiment (rse) and SummarizedExperiment (rse) structures without holding the matrix of values in memory. To achieve this, an rseHandle S4 object is defined that inherits the RangedSummarizedExperiment class with the addition of two other slots that map the current object's row and column indicies to the original indicies in the file(s). Analogously, the seHandle inherits the SummarizedExperiment class when the rowspace is not a GRanges object. Jointly, we refer to the union of rseHandle and seHandle objects as yoloHandle objects.

The getvalues command can then evaluate an yoloHandle object and pull the data from the hard disk into memory. While adding and subsetting an yoloHandle object is endomorphic (i.e. returns the same rseHandle or seHandle supplied by the user), the output of the getvalues is a RangedSummarizedExperiment object or a SummarizedExperiment object depending on which is evaluated.

Libraries

library(GenomicRanges)
library(SummarizedExperiment)
library(yolo)
library(RSQLite)
library(rhdf5)

Build Data

In the current implementation of yolo, we support storing data in HDF5 and sqlite file formats. Tables in these files may either be sparse (three columns) or in a normal matrix representation. Though not directly part of the this package, we show examples how to export R data objects and files to HDF5 and SQLite file formats using the rhdf5 and RSQLite packages.

Notes:

1) the combination of "sparse" and "hdf5" are not supported. 2) all parameters throughout these functions should have no capital letters by convention.

sparse SQLite

Below is one simple implementation of converting a .csv file that is in a sparse matrix format into a .sqlite object.

f1name <- "d1.sqlite"

db <- dbConnect(SQLite(), dbname=f1name)
ft <- list(row="INTEGER", column="INTEGER", value="INTEGER")
df1 <- read.table(system.file("extdata", "dat1.csv", package = "yolo"), sep = "," , header = TRUE)
head(df1)
dbWriteTable(conn=db, name="data", value=df1, field.types=ft)
dbDisconnect(db)

The commands above create the "d1.sqlite" file, which can be linked to appropriate column and row data to create an rseHandle object. First, we import these data--

readt <- read.table(system.file("extdata", "dat1_row.bed", package = "yolo"))
rowData1 <- GRanges(setNames(readt,  c("chr", "start", "stop")))
colData1 <- read.table(system.file("extdata", "dat1_col.txt", package = "yolo"))

Next, we can build our rseHandle object using the following function below.

d1 <- yoloHandleMake(rowData1, colData1, lookupFileName = f1name)
d1

The yoloHandleMake function necessarily takes a GRanges of the rowData, when wanting to create an rseHandle and a DataFrame object when constructing an seHandle object. In both cases, the constructor also takes an object that can be coerced into a DataFrame for the colData and a valid file name that contains the values of the matrices on the backend. When the constructor function is called, other checks will determine the validity of the construction to ensure that the specified objects will play nicely together. In other words, the constructor checks to make sure the dimensions of the rowData and colData represent the dimensions in the backend file.

Three other parameters can be specified in the yoloHandleMake function. Namely, the lookupTableName specifies the index/table of the backend values. By default, the constructor assumes "data" as we specified in the dbWriteTable command earlier in the vignette. Another import parameter is the lookupFileType, which can be specified as either "sparse" (by default) or "normal". For a sparse matrix, we assume two columns labled "row" and "column" in addition to a third that has the specific values. (See the ft variable in the constructor above). For a "normal" matrix, the lookup simply indexes off of row and column positions, so that the names are not relevant for that operation. Finally, the lookupFileFormat can be either "HDF5" or "sqlite". The call to the yoloHandleMake function above utilized all default values--

(lookupTableName = "data", lookupFileType = "sparse", lookupFileFormat = "sqlite")

normal HDF5

Another implementation uses HDF5. Currently, yolo only supports the "normal" matrix implementation (sparse matricies are not supported). This is because the author couldn't find a way to filter to rows based on values. This package supports putting multiple tables in either an HDF5 or sqlite file, and the implementation would look similar to the following.

f2name <- "dat.hdf5"
h5createFile(f2name)

# Read and Reshape 3 data objects to a normal matrix
df1 <- read.table(system.file("extdata", "dat1.csv", package = "yolo"), sep = "," , header = TRUE)
dat1m <- reshape2::acast(df1, row ~ column, fill = 0)
df2 <- read.table(system.file("extdata", "dat2.csv", package = "yolo"), sep = "," , header = TRUE)
dat2m <- reshape2::acast(df2, row ~ column, fill = 0)
df3 <- read.table(system.file("extdata", "dat3.csv", package = "yolo"), sep = "," , header = TRUE)
dat3m <- reshape2::acast(df3, row ~ column, fill = 0)

# Write to file
h5write(dat1m, "dat.hdf5","dat1")
h5write(dat2m, "dat.hdf5","dat2")
h5write(dat3m, "dat.hdf5","dat3")

h5ls("dat.hdf5")

To create an rseHandle for the first dataset--

d1h <- yoloHandleMake(rowData1, colData1, lookupFileName = f2name, lookupTableName = "dat1",
                     lookupFileFormat = "HDF5", lookupFileType = "normal")
d1h

We'll also create an rseHandle object for the third data object referencing the same HDF5 file but different colData. (dat1 and dat3 were designed to have the same rowData).

colData3 <- read.table(system.file("extdata", "dat3_col.txt", package = "yolo"))
d3h <- yoloHandleMake(rowData1, colData3, lookupFileName = f2name, lookupTableName = "dat3",
                     lookupFileFormat = "HDF5", lookupFileType = "normal")

normal SQLite

For normal matrices, we recommend the HDF5 construct. However, if a user prefers SQLite, this is supported. Our package assumes 1) the existance of a "row_names" attribute in the table (automatically generated when row.names = TRUE as shown below) and 2) that each column name corresponds to the sample names (or names of the colData) in the collated object. Below is an example of this construction.

colnames(dat3m) <- rownames(colData3)
dat3m <- data.frame(dat3m)
db <- dbConnect(SQLite(), dbname=f1name)
dbWriteTable(conn=db, name="data3", value=dat3m, row.names=TRUE)
dbListFields(db, "data3")
dbDisconnect(db)

d3s <- yoloHandleMake(rowData1, colData3, lookupFileName = f1name, lookupTableName = "data3",
                     lookupFileFormat = "sqlite", lookupFileType = "normal")

Again, we recommend working with HDF5 files for normal matrices. For sparse matrices, SQLite is currently the only supported format.

Addition

Users can add multiple rseHandle objects together as long as two condtions are valid--

1) The rowRanges/rowData are the same 2) The names in colData are the same

Even though d1 pulls from a sparse sqlite file and d3h pulls from a normal HDF5 file, these two handles can be joined together because these two criteria are met. Notice that the resulting object has 35 samples.

d13 <- d1 + d3h
d13

This feature allows for samples from different experiments to be joined together at a high level again without reading any value data into memory aside from the column and row meta data.

Subsetting

Users can subset using the [ and subsetByOverlaps calls that they are accustomed to in a standard RangedSummarizedExperiment.

dss1 <- d13[,d13@colData$group == "group4"]
chr1reg <- GRanges(seqnames=c("chr1"),ranges=IRanges(start=c(3338300),end=c(3422000)))
dss2 <- subsetByOverlaps(dss1, chr1reg)
d_small <- dss2[c(2,3,6,7,10), c(2,6,7,3)]
d_small

Get values

Through this process of adding and subsetting, we've jumbled up our samples. Not to worry! Using the getvalues function, the representation of our matrix is data will be preserved through keeping track of the indices of our files, rows, and columns.

rse_small <- getvalues(d_small)
class(rse_small)
assay(rse_small, 1)

Up until this getvalues command, none of the values of the matrix were being stored into disk. Thus, we could add and remove samples as well as filter row regions based on GRanges/DataFrame or index logic and maintain the correct annotations corresponding to our data.

Cleanup

Without any use for our files on disk, we can tidy up and remove them.

file.remove(f1name)
file.remove(f2name)

Session info

sessionInfo()


caleblareau/yolo documentation built on May 13, 2019, 11:01 a.m.