The R package yolo is designed to subset large data based on column and row
attributes based on the familiar RangedSummarizedExperiment (rse) and
SummarizedExperiment (rse) structures without
holding the matrix of values in memory. To achieve this, an rseHandle S4 object
is defined that inherits the RangedSummarizedExperiment class with the addition
of two other slots that map the current object's row and column indicies to the
original indicies in the file(s). Analogously, the seHandle inherits the
SummarizedExperiment class when the rowspace is not a GRanges object. Jointly,
we refer to the union of rseHandle and seHandle objects as yoloHandle objects.
The getvalues command can then evaluate an
yoloHandle object and pull the data from the hard disk into memory. While adding and
subsetting an yoloHandle object is endomorphic (i.e. returns the same
rseHandle or seHandle supplied by the user),
the output of the getvalues is a RangedSummarizedExperiment object or a
SummarizedExperiment object depending on which is evaluated.
library(GenomicRanges) library(SummarizedExperiment) library(yolo) library(RSQLite) library(rhdf5)
In the current implementation of yolo, we support storing data in HDF5
and sqlite file formats. Tables in these files may either be sparse
(three columns) or in a normal matrix representation. Though not
directly part of the this package, we show examples how to export
R data objects and files to HDF5 and SQLite file formats using the
rhdf5 and RSQLite packages.
Notes:
1) the combination of "sparse" and "hdf5" are not supported. 2) all parameters throughout these functions should have no capital letters by convention.
Below is one simple implementation of converting
a .csv file that is in a sparse matrix format into a .sqlite object.
f1name <- "d1.sqlite" db <- dbConnect(SQLite(), dbname=f1name) ft <- list(row="INTEGER", column="INTEGER", value="INTEGER") df1 <- read.table(system.file("extdata", "dat1.csv", package = "yolo"), sep = "," , header = TRUE) head(df1) dbWriteTable(conn=db, name="data", value=df1, field.types=ft) dbDisconnect(db)
The commands above create the "d1.sqlite" file, which can be linked to appropriate
column and row data to create an rseHandle object. First, we import these data--
readt <- read.table(system.file("extdata", "dat1_row.bed", package = "yolo")) rowData1 <- GRanges(setNames(readt, c("chr", "start", "stop"))) colData1 <- read.table(system.file("extdata", "dat1_col.txt", package = "yolo"))
Next, we can build our rseHandle object using the following function below.
d1 <- yoloHandleMake(rowData1, colData1, lookupFileName = f1name) d1
The yoloHandleMake function necessarily takes a GRanges of the rowData,
when wanting to create an rseHandle and a DataFrame object when constructing
an seHandle object. In both cases, the constructor also takes
an object that can be coerced into a DataFrame for the colData and a valid
file name that contains the values of the matrices on the backend. When the
constructor function is called, other checks will determine the validity of
the construction to ensure that the specified objects will play nicely
together. In other words, the constructor checks to make sure the dimensions
of the rowData and colData represent the dimensions in the backend file.
Three other parameters can be specified in the yoloHandleMake function. Namely,
the lookupTableName specifies the index/table of the backend values. By default,
the constructor assumes "data" as we specified in the dbWriteTable command earlier
in the vignette. Another import parameter is the lookupFileType, which can be
specified as either "sparse" (by default) or "normal". For a sparse matrix,
we assume two columns labled "row" and "column" in addition to a third that has
the specific values. (See the ft variable in the constructor above). For a "normal"
matrix, the lookup simply indexes off of row and column positions, so that the
names are not relevant for that operation. Finally, the lookupFileFormat can
be either "HDF5" or "sqlite". The call to the yoloHandleMake function above
utilized all default values--
(lookupTableName = "data", lookupFileType = "sparse", lookupFileFormat = "sqlite")
Another implementation uses HDF5. Currently, yolo only supports the "normal"
matrix implementation (sparse matricies are not supported). This is because
the author couldn't find a way to filter to rows based on values. This package
supports putting multiple tables in either an HDF5 or sqlite file, and the
implementation would look similar to the following.
f2name <- "dat.hdf5" h5createFile(f2name) # Read and Reshape 3 data objects to a normal matrix df1 <- read.table(system.file("extdata", "dat1.csv", package = "yolo"), sep = "," , header = TRUE) dat1m <- reshape2::acast(df1, row ~ column, fill = 0) df2 <- read.table(system.file("extdata", "dat2.csv", package = "yolo"), sep = "," , header = TRUE) dat2m <- reshape2::acast(df2, row ~ column, fill = 0) df3 <- read.table(system.file("extdata", "dat3.csv", package = "yolo"), sep = "," , header = TRUE) dat3m <- reshape2::acast(df3, row ~ column, fill = 0) # Write to file h5write(dat1m, "dat.hdf5","dat1") h5write(dat2m, "dat.hdf5","dat2") h5write(dat3m, "dat.hdf5","dat3") h5ls("dat.hdf5")
To create an rseHandle for the first dataset--
d1h <- yoloHandleMake(rowData1, colData1, lookupFileName = f2name, lookupTableName = "dat1", lookupFileFormat = "HDF5", lookupFileType = "normal") d1h
We'll also create an rseHandle object for the third data object referencing
the same HDF5 file but different colData. (dat1 and dat3 were designed
to have the same rowData).
colData3 <- read.table(system.file("extdata", "dat3_col.txt", package = "yolo")) d3h <- yoloHandleMake(rowData1, colData3, lookupFileName = f2name, lookupTableName = "dat3", lookupFileFormat = "HDF5", lookupFileType = "normal")
For normal matrices, we recommend the HDF5 construct. However, if a user
prefers SQLite, this is supported. Our package assumes 1) the existance of
a "row_names" attribute in the table (automatically generated when
row.names = TRUE as shown below) and 2) that each column name
corresponds to the sample names (or names of the colData) in the
collated object. Below is an example of this construction.
colnames(dat3m) <- rownames(colData3) dat3m <- data.frame(dat3m) db <- dbConnect(SQLite(), dbname=f1name) dbWriteTable(conn=db, name="data3", value=dat3m, row.names=TRUE) dbListFields(db, "data3") dbDisconnect(db) d3s <- yoloHandleMake(rowData1, colData3, lookupFileName = f1name, lookupTableName = "data3", lookupFileFormat = "sqlite", lookupFileType = "normal")
Again, we recommend working with HDF5 files for normal matrices. For sparse matrices, SQLite is currently the only supported format.
Users can add multiple rseHandle objects together as long as
two condtions are valid--
1) The rowRanges/rowData are the same
2) The names in colData are the same
Even though d1 pulls from a sparse sqlite file and
d3h pulls from a normal HDF5 file, these two handles
can be joined together because these two criteria are met.
Notice that the resulting object has 35 samples.
d13 <- d1 + d3h d13
This feature allows for samples from different experiments to be joined together at a high level again without reading any value data into memory aside from the column and row meta data.
Users can subset using the [ and subsetByOverlaps calls that they
are accustomed to in a standard RangedSummarizedExperiment.
dss1 <- d13[,d13@colData$group == "group4"] chr1reg <- GRanges(seqnames=c("chr1"),ranges=IRanges(start=c(3338300),end=c(3422000))) dss2 <- subsetByOverlaps(dss1, chr1reg) d_small <- dss2[c(2,3,6,7,10), c(2,6,7,3)] d_small
Through this process of adding and subsetting, we've jumbled up our samples.
Not to worry! Using the getvalues function, the representation of our
matrix is data will be preserved through keeping track of the indices
of our files, rows, and columns.
rse_small <- getvalues(d_small) class(rse_small) assay(rse_small, 1)
Up until this getvalues command, none of the values of the matrix
were being stored into disk. Thus, we could add and remove samples
as well as filter row regions based on GRanges/DataFrame or index logic and
maintain the correct annotations corresponding to our data.
Without any use for our files on disk, we can tidy up and remove them.
file.remove(f1name) file.remove(f2name)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.