loomR-package: An R interface for loom files
In mojaveazure/loomR: An R interface for loom files

Description Semantics Loom Files Chunk-based iteration Extending loomR

loomR provides an interface for working with loom files in a loom-specific way. We provide routines for validating loom files, iterating with chunks through data within the loom file, and provide a platform for other packages to build support for loom files. Unlike other HDF5 pacakges, loomR actively protectes a loom file's structure, enabling the user to focus on their analysis and not worry about the integrity of their data.

Throughout all loomR-related documentation and writing, the following styles for distinguising between loom files, loom objects, and loomR will and be used. When talking about loom files, or the actual HDF5 file on disk, the word 'loom' will be written in normal text. Capitalization will be done based on a language's rules for capitalization in sentences. For English, that means if the word 'loom' appears at the beginning of a sentence and is being used to refer to a loom file, it will be capilatized. Otherwise, it will be lowercase. For loom objects, or the object within R, the word 'loom' will always be lowercase and written in monospaced text. When referring to the pacakge loomR, it will always be written in normal text with the 'l', 'o's, and 'm' lowercased and the 'R' uppercased. This style will be used throughout documentation for loomR as well as any vignettes and tutorials produced by the authors.

Loom files are an HDF5-based format for storing and interacting with large single-cell RNAseq datasets. Each loom file has at least six parts to it: the raw expression data (matrix), groups for gene- and cell-metadata (row_attrs and col_attrs, respectively), groups for gene-based and cell-based cluster graphs (row_graphs and col_graphs, respectively), and layers, a group containing alternative representations of the data in matrix. Each dataset within the loom file has rules as to what size it may be, creating a structure for the entire loom file and all the data within. This structure is enforced to ensure that data remains intact and retriveable when spread across the various datasets in the loom file.

matrix: The dataset that sets the dimensions for most other datasets within a loom file. This dataset has 'n' genes and 'm' cells. Due to the way that loomR presents data, this will appear as 'm' rows and 'n' columns. However, other HDF5 libraries will generally present the data as 'n' rows and 'm' columns
row_attrs and col_attrs: These are one- or two-dimensional datasets where a specific dimension is of length 'n', for row attributes, or 'm', for column attributes. Within loomR, this must be the second dimension of two-dimensional datasets, or the length of one-dimensional datasets Most other HDF5 libraries will show this specific dimension as the first dimension for two-dimensional datasets, or the length of one-dimensional datasets.
row_graphs and col_graphs: Unlike other datasets within a loom file, these are not controlled by matrix. Instead, within these groups are groups for specific graphs. Each graph group will have three datasets that represent the graph in coordinate format: a for row indices, b for column indices, and w for values. Each dataset within a graph must be one-dimensional and all datasets within a graph must be the same length. Not all graphs must be the same length as each other.
layers: Each dataset within layers must have the exact same dimensions as matrix

As loom files can theoretically hold million-cell datasets, performing analysis on these datasets can be impossible due to the memory requirements for holding such a dataset in memory. To combat this problem, loom objects offer native chunk-based iteration through the batch.scan, batch.next, map, and apply methods. This section will cover the former two methods; the latter two are covered in the loomR tutorial.

batch.scan and batch.next are the heart of all chunk-based iteration in the loom object. These two methods make use of itertools::ichunk object to chunk through the data in a loom file. Due to the way that R works, batch.scan initializes the iterator and batch.next moves through the iterator.

The batch.scan method will break a dataset in the loom file into chunks, based on a chunk size given to it. batch.scan will work on any dataset, except for two-dimensional attributes and any graph dataset. When iterating over matrix and the layers, the MARGIN argument tells the loom object which way to chunk the data. A MARGIN of 1 will chunk over genes while a MARGIN of 2 will chunk over cells. For one-dimmensional attributes, MARGIN is ignored. batch.scan returns an integer whose length is the number of iterations it takes to iterate over the dataset selected.

Pulling data in chunks is done by batch.next. This method simply returns the next chunk of data. If return.data = FALSE is passed, batch.next will instead return the indices of the next chunk. When using these methods, we recommend storing the results of batch.scan and iterating through this vector to keep track of where the loom object is in the iteration.

  # Set up the iterator on the `loom` object lfile
  batch <- lfile$batch.scan(dataset.use = 'matrix', MARGIN = 2)
  # Iterate through the dataset, pulling data
  # If `return.data = FALSE` is passed, the indices
  # of the next chunk will be returned instead
  for (i in batch) {
    data.use <- lfile$batch.next()
  }

The loom class is the heart of loomR. This class is written in the R6 object style and can be extended in three ways. For each of the following, one be discretionary when return is used instead of invisible. As loom object are merely handles to loom files, any function or method that modifies the file should not need to return anything. However, we recommend always returning the loom object invisibly, using invisible. While not necessary for functionality, it means that objects in a user's environment won't get overwritten if they try to reassign their loom object to the output of a function. For functions and methods that don't modify the loom file, and instead return data, then the return function should be used.

The first way to extend loom objects is by subclassing the object and making a new R6 class. This allows new classes to declare custom R6 methods and gain access to all of the loom object's methods, including S3- and S4-style methods. New classes can also overwrite any methods for loom objects, allowing the extender to change the core behaviour of loom objects. While this option allows the greatest control and access to the loom object, it involves the greatest amount of work as one would need to write a new R6 class and all the associated boilerplate code. As such, we recommend subclassing loom objects when a new class is needed, but would advise developers to use the other methods of extending loom objects for simpler tasks.

The second way is by using S4-style methods can be written for loom objects. loomR exports the loom class as an S4 class, allowing one to write highly-specialized methods that enforce class-specificity and can change behaviour based on the classes of other objects provided to a function. S4 methods look like normal functions to the end user, but can do different things based on the class, or classes, of objects passed to it. This allows for highly-customized routines without cluttering a package's namespace, as only the generic function is exported. S4 methods can also be written for generics exported by other packages, assuming the said generic has been imported before writing new methods. Furthermore, generics and methods can be kept internally, and R will dispatch the appropriate method as if the generic was exported. However, S4 methods have the drawback of not autocompleting arguments in the terminal or RStudio. This means that the user may need to keep documentation open while using these methods, which detracts from the user-friendliness of these methods. Finally, while there is less boilerplate in declaring S4 generics and methods than declaring R6 classes and methods, there is still more to write than our last method. As such, we recommend S4 methods for anyone who needs method dispatch for internal functions only.

  #' @export SomeFunction
  methods::setGeneric(
    name = 'SomeFunction',
    def = function(object, ...) {
      return(standardGeneric(f = 'SomeFunction))
    }
  )

  # Note, no extra Roxygen notes needed
  methods::setMethod(
    f = 'SomeFunction',
    signature = c('object' = 'loom'),
    definition = function(object, loom.param, ...) {
      # do something
    }
  )

As R6 objects are based on S3 objects, the final way to extend loom objects is by writing S3-style methods. These methods involve the least amount of boilerplate to set up. S3 generics are written just like normal functions, albiet with a few differences. Firstly, they have two arguments: the argument that determines the class for dispatching and ... to pass other arguments to methods. Finally, the only thing an S3 generic needs to do is call UseMethod to allow R to dispatch based on the class of whatever the object is. Unlike S4 methods, S3 methods provide tab-autocompletion for method-specific arguments, providing help messages along the way. This means that S3 methods are more user-friendly than S4 methods. Like S4 methods, S3 methods can use S3 generics declared by other packages, with the same assumptions about imports applying here as well. However, S3 methods cannot be kept internally, and must be exported for R to properly dispatch the method. This means that a package's namespace will have n + 1 functions declared for every S3 generic, where n is the number of classes a method is declared for and the one extra is for the generic. Furthermore, as the methods themselves are exported, anyone can simply use the method directly rather than go through the generic and have R dispatch a method based on object class. Despite these drawbacks, S3 methods are how we recommend one extends loomR unless one needs the specific features of R6 classes or S4-style methods.

  #' @export somefunction
  somefunction <- function(object, ...) {
    UseMethod('somefunction', object)
  }

  #' @export somefunction.loom
  #' @method somefunction loom
  somefunction.loom <- function(object, loom.param, ...) {
    # do something
  }