Description Semantics Loom Files Chunk-based iteration Extending loomR
loomR provides an interface for working with loom files in a loom-specific way. We provide routines for validating loom files, iterating with chunks through data within the loom file, and provide a platform for other packages to build support for loom files. Unlike other HDF5 pacakges, loomR actively protectes a loom file's structure, enabling the user to focus on their analysis and not worry about the integrity of their data.
Throughout all loomR-related documentation and writing, the following styles for distinguising between loom files,
loom
objects, and loomR will and be used. When talking about loom files, or the actual HDF5 file on disk,
the word 'loom' will be written in normal text. Capitalization will be done based on a language's rules for
capitalization in sentences. For English, that means if the word 'loom' appears at the beginning of a sentence
and is being used to refer to a loom file, it will be capilatized. Otherwise, it will be lowercase.
For loom
objects, or the object within R, the word 'loom' will always be lowercase and written in monospaced text.
When referring to the pacakge loomR, it will always be written in normal text with the 'l', 'o's, and 'm' lowercased and
the 'R' uppercased. This style will be used throughout documentation for loomR as well as any vignettes and tutorials
produced by the authors.
Loom files are an HDF5-based format for storing and interacting with large single-cell RNAseq datasets.
Each loom file has at least six parts to it:
the raw expression data (matrix
),
groups for gene- and cell-metadata (row_attrs
and col_attrs
, respectively),
groups for gene-based and cell-based cluster graphs (row_graphs
and col_graphs
, respectively),
and layers
, a group containing alternative representations of the data in matrix
.
Each dataset within the loom file has rules as to what size it may be, creating a structure for the entire loom file and all the data within.
This structure is enforced to ensure that data remains intact and retriveable when spread across the various datasets in the loom file.
matrix
The dataset that sets the dimensions for most other datasets within a loom file. This dataset has 'n' genes and 'm' cells. Due to the way that loomR presents data, this will appear as 'm' rows and 'n' columns. However, other HDF5 libraries will generally present the data as 'n' rows and 'm' columns
row_attrs
and col_attrs
These are one- or two-dimensional datasets where a specific dimension is of length 'n', for row attributes, or 'm', for column attributes. Within loomR, this must be the second dimension of two-dimensional datasets, or the length of one-dimensional datasets Most other HDF5 libraries will show this specific dimension as the first dimension for two-dimensional datasets, or the length of one-dimensional datasets.
row_graphs
and col_graphs
Unlike other datasets within a loom file, these are not controlled by matrix
. Instead, within these groups are groups for
specific graphs. Each graph group will have three datasets that represent the graph in
coordinate format: a
for row indices, b
for
column indices, and w
for values. Each dataset within a graph must be one-dimensional and all datasets within a graph must be
the same length. Not all graphs must be the same length as each other.
layers
Each dataset within layers
must have the exact same dimensions as matrix
As loom files can theoretically hold million-cell datasets, performing analysis on these datasets can be impossible due to the memory
requirements for holding such a dataset in memory. To combat this problem, loom
objects offer native chunk-based iteration through
the batch.scan
, batch.next
, map
, and apply
methods. This section will cover the former two methods; the latter
two are covered in the loomR tutorial.
batch.scan
and batch.next
are the heart of all chunk-based iteration in the loom
object. These two methods make
use of itertools::ichunk
object to chunk through the data in a loom file. Due to the way that R works, batch.scan
initializes the iterator and batch.next
moves through the iterator.
The batch.scan
method will break a dataset in the loom file into chunks, based on a chunk size given to it. batch.scan
will
work on any dataset, except for two-dimensional attributes and any graph dataset. When iterating over matrix
and the layers, the MARGIN
argument tells the loom
object which way to chunk the data. A MARGIN
of 1 will chunk over genes while a MARGIN
of 2 will chunk
over cells. For one-dimmensional attributes, MARGIN
is ignored. batch.scan
returns an integer whose length is the number of iterations
it takes to iterate over the dataset selected.
Pulling data in chunks is done by batch.next
. This method simply returns the next chunk of data. If return.data = FALSE
is passed,
batch.next
will instead return the indices of the next chunk. When using these methods, we recommend storing the results of batch.scan
and iterating through this vector to keep track of where the loom
object is in the iteration.
1 2 3 4 5 6 7 8 | # Set up the iterator on the `loom` object lfile
batch <- lfile$batch.scan(dataset.use = 'matrix', MARGIN = 2)
# Iterate through the dataset, pulling data
# If `return.data = FALSE` is passed, the indices
# of the next chunk will be returned instead
for (i in batch) {
data.use <- lfile$batch.next()
}
|
The loom
class is the heart of loomR. This class is written in the
R6 object style and can be extended in three ways.
For each of the following, one be discretionary when return
is used instead of invisible
. As loom
object are merely
handles to loom files, any function or method that modifies the file should not need to return anything. However, we recommend always returning
the loom
object invisibly, using invisible
. While not necessary for functionality, it means that objects in a user's environment
won't get overwritten if they try to reassign their loom
object to the output of a function. For functions and methods that don't modify the
loom file, and instead return data, then the return
function should be used.
The first way to extend loom
objects is by subclassing the object and making a new R6 class. This allows new classes to
declare custom R6 methods and gain access to all of the loom
object's methods, including S3- and S4-style methods.
New classes can also overwrite any methods for loom
objects, allowing the extender to change the core behaviour of loom
objects.
While this option allows the greatest control and access to the loom
object, it involves the greatest amount of work
as one would need to write a new R6 class and all the associated boilerplate code. As such, we recommend subclassing loom
objects
when a new class is needed, but would advise developers to use the other methods of extending loom
objects for simpler tasks.
The second way is by using S4-style methods can be written for loom
objects. loomR exports the loom
class as an S4 class, allowing
one to write highly-specialized methods that enforce class-specificity and can change behaviour based on the classes of other objects provided to
a function. S4 methods look like normal functions to the end user, but can do different things based on the class, or classes, of objects passed to it.
This allows for highly-customized routines without cluttering a package's namespace, as only the generic function is exported. S4 methods can also be
written for generics exported by other packages, assuming the said generic has been imported before writing new methods. Furthermore, generics
and methods can be kept internally, and R will dispatch the appropriate method as if the generic was exported. However, S4 methods have the drawback
of not autocompleting arguments in the terminal or RStudio. This means that the user may need to keep documentation open while using these methods,
which detracts from the user-friendliness of these methods. Finally, while there is less boilerplate in declaring S4 generics and methods than
declaring R6 classes and methods, there is still more to write than our last method. As such, we recommend S4 methods for anyone who needs method
dispatch for internal functions only.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #' @export SomeFunction
methods::setGeneric(
name = 'SomeFunction',
def = function(object, ...) {
return(standardGeneric(f = 'SomeFunction))
}
)
# Note, no extra Roxygen notes needed
methods::setMethod(
f = 'SomeFunction',
signature = c('object' = 'loom'),
definition = function(object, loom.param, ...) {
# do something
}
)
|
As R6 objects are based on S3 objects, the final way to extend loom
objects is by writing S3-style methods. These methods involve the
least amount of boilerplate to set up. S3 generics are written just like normal functions, albiet with a few differences. Firstly, they have
two arguments: the argument that determines the class for dispatching and ...
to pass other arguments to methods. Finally, the only
thing an S3 generic needs to do is call UseMethod
to allow R to dispatch based on the class of whatever the object is. Unlike S4 methods,
S3 methods provide tab-autocompletion for method-specific arguments, providing help messages along the way. This means that S3 methods are more
user-friendly than S4 methods. Like S4 methods, S3 methods can use S3 generics declared by other packages, with the same assumptions about
imports applying here as well. However, S3 methods cannot be kept internally, and must be exported for R to properly dispatch the method. This means
that a package's namespace will have n + 1 functions declared for every S3 generic, where n is the number of classes a method is declared for and the
one extra is for the generic. Furthermore, as the methods themselves are exported, anyone can simply use the method directly rather than go through
the generic and have R dispatch a method based on object class. Despite these drawbacks, S3 methods are how we recommend one extends loomR unless
one needs the specific features of R6 classes or S4-style methods.
1 2 3 4 5 6 7 8 9 10 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.