R/package.R

Defines functions .onLoad

#' @importFrom methods setOldClass
NULL

#' An R interface for loom files
#'
#' loomR provides an interface for working with loom files in a loom-specific way.
#' We provide routines for validating loom files,
#' iterating with chunks through data within the loom file,
#' and provide a platform for other packages to build support for loom files.
#' Unlike other HDF5 pacakges, loomR actively protectes a loom file's structure, enabling the
#' user to focus on their analysis and not worry about the integrity of their data.
#'
#' @section Semantics:
#' Throughout all loomR-related documentation and writing, the following styles for distinguising between loom files,
#' \code{loom} objects, and loomR will and be used. When talking about loom files, or the actual HDF5 file on disk,
#' the word 'loom' will be written in normal text. Capitalization will be done based on a language's rules for
#' capitalization in sentences. For English, that means if the word 'loom' appears at the beginning of a sentence
#' and is being used to refer to a loom file, it will be capilatized. Otherwise, it will be lowercase.
#' For \code{loom} objects, or the object within R, the word 'loom' will always be lowercase and written in monospaced text.
#' When referring to the pacakge loomR, it will always be written in normal text with the 'l', 'o's, and 'm' lowercased and
#' the 'R' uppercased. This style will be used throughout documentation for loomR as well as any vignettes and tutorials
#' produced by the authors.
#'
#' @section Loom Files:
#' Loom files are an HDF5-based format for storing and interacting with large single-cell RNAseq datasets.
#' Each loom file has at least six parts to it:
#' the raw expression data (\code{matrix}),
#' groups for gene- and cell-metadata (\code{row_attrs} and \code{col_attrs}, respectively),
#' groups for gene-based and cell-based cluster graphs (\code{row_graphs} and \code{col_graphs}, respectively),
#' and \code{layers}, a group containing alternative representations of the data in \code{matrix}.
#' Each dataset within the loom file has rules as to what size it may be, creating a structure for the entire loom file and all the data within.
#' This structure is enforced to ensure that data remains intact and retriveable when spread across the various datasets in the loom file.
#'
#' \describe{
#'   \item{\code{matrix}}{
#'     The dataset that sets the dimensions for most other datasets within a loom file. This dataset has 'n' genes and 'm' cells.
#'     Due to the way that loomR presents data, this will appear as 'm' rows and 'n' columns. However, other HDF5 libraries will
#'     generally present the data as 'n' rows and 'm' columns
#'   }
#'   \item{\code{row_attrs} and \code{col_attrs}}{
#'     These are one- or two-dimensional datasets where a specific dimension is of length 'n', for row attributes, or 'm', for column attributes.
#'     Within loomR, this must be the second dimension of two-dimensional datasets, or the length of one-dimensional datasets Most other
#'     HDF5 libraries will show this specific dimension as the first dimension for two-dimensional datasets, or the length of one-dimensional
#'     datasets.
#'   }
#'   \item{\code{row_graphs} and \code{col_graphs}}{
#'     Unlike other datasets within a loom file, these are not controlled by \code{matrix}. Instead, within these groups are groups for
#'     specific graphs. Each graph group will have three datasets that represent the graph in
#'     \href{https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)}{coordinate format}: \code{a} for row indices, \code{b} for
#'     column indices, and \code{w} for values. Each dataset within a graph must be one-dimensional and all datasets within a graph must be
#'     the same length. Not all graphs must be the same length as each other.
#'   }
#'   \item{\code{layers}}{Each dataset within \code{layers} must have the exact same dimensions as \code{matrix}}
#' }
#'
#' @section Chunk-based iteration:
#' As loom files can theoretically hold million-cell datasets, performing analysis on these datasets can be impossible due to the memory
#' requirements for holding such a dataset in memory. To combat this problem, \code{loom} objects offer native chunk-based iteration through
#' the \code{batch.scan}, \code{batch.next}, \code{map}, and \code{apply} methods. This section will cover the former two methods; the latter
#' two are covered in the \href{http://satijalab.org/loomR/loomR_tutorial.html}{loomR tutorial}.
#'
#' \code{batch.scan} and \code{batch.next} are the heart of all chunk-based iteration in the \code{loom} object. These two methods make
#' use of \code{\link{itertools::ichunk}} object to chunk through the data in a loom file. Due to the way that R works, \code{batch.scan}
#' initializes the iterator and \code{batch.next} moves through the iterator.
#'
#' The \code{batch.scan} method will break a dataset in the loom file into chunks, based on a chunk size given to it. \code{batch.scan} will
#' work on any dataset, except for two-dimensional attributes and any graph dataset. When iterating over \code{matrix} and the layers, the \code{MARGIN}
#' argument tells the \code{loom} object which way to chunk the data. A \code{MARGIN} of 1 will chunk over genes while a \code{MARGIN} of 2 will chunk
#' over cells. For one-dimmensional attributes, \code{MARGIN} is ignored. \code{batch.scan} returns an integer whose length is the number of iterations
#' it takes to iterate over the dataset selected.
#'
#' Pulling data in chunks is done by \code{batch.next}. This method simply returns the next chunk of data. If \code{return.data = FALSE} is passed,
#' \code{batch.next} will instead return the indices of the next chunk. When using these methods, we recommend storing the results of \code{batch.scan}
#' and iterating through this vector to keep track of where the \code{loom} object is in the iteration.
#' \preformatted{
#'   # Set up the iterator on the `loom` object lfile
#'   batch <- lfile$batch.scan(dataset.use = 'matrix', MARGIN = 2)
#'   # Iterate through the dataset, pulling data
#'   # If `return.data = FALSE` is passed, the indices
#'   # of the next chunk will be returned instead
#'   for (i in batch) {
#'     data.use <- lfile$batch.next()
#'   }
#' }
#'
#' @section Extending loomR:
#' The \code{loom} class is the heart of loomR. This class is written in the
#' \href{https://cran.r-project.org/web/packages/R6/vignettes/Introduction.html}{R6} object style and can be extended in three ways.
#' For each of the following, one be discretionary when \code{return} is used instead of \code{\link{invisible}}. As \code{loom} object are merely
#' handles to loom files, any function or method that modifies the file should not need to return anything. However, we recommend always returning
#' the \code{loom} object invisibly, using \code{\link{invisible}}. While not necessary for functionality, it means that objects in a user's environment
#' won't get overwritten if they try to reassign their \code{loom} object to the output of a function. For functions and methods that don't modify the
#' loom file, and instead return data, then the \code{return} function should be used.
#'
#' The first way to extend \code{loom} objects is by subclassing the object and making a new R6 class. This allows new classes to
#' declare custom R6 methods and gain access to all of the \code{loom} object's methods, including S3- and S4-style methods.
#' New classes can also overwrite any methods for \code{loom} objects, allowing the extender to change the core behaviour of \code{loom} objects.
#' While this option allows the greatest control and access to the \code{loom} object, it involves the greatest amount of work
#' as one would need to write a new R6 class and all the associated boilerplate code. As such, we recommend subclassing \code{loom} objects
#' when a new class is needed, but would advise developers to use the other methods of extending \code{loom} objects for simpler tasks.
#'
#' The second way is by using S4-style methods can be written for \code{loom} objects. loomR exports the \code{loom} class as an S4 class, allowing
#' one to write highly-specialized methods that enforce class-specificity and can change behaviour based on the classes of other objects provided to
#' a function. S4 methods look like normal functions to the end user, but can do different things based on the class, or classes, of objects passed to it.
#' This allows for highly-customized routines without cluttering a package's namespace, as only the generic function is exported. S4 methods can also be
#' written for generics exported by other packages, assuming the said generic has been imported before writing new methods. Furthermore, generics
#' and methods can be kept internally, and R will dispatch the appropriate method as if the generic was exported. However, S4 methods have the drawback
#' of not autocompleting arguments in the terminal or RStudio. This means that the user may need to keep documentation open while using these methods,
#' which detracts from the user-friendliness of these methods. Finally, while there is less boilerplate in declaring S4 generics and methods than
#' declaring R6 classes and methods, there is still more to write than our last method. As such, we recommend S4 methods for anyone who needs method
#' dispatch for internal functions only.
#' \preformatted{
#'   #' @export SomeFunction
#'   methods::setGeneric(
#'     name = 'SomeFunction',
#'     def = function(object, ...) {
#'       return(standardGeneric(f = 'SomeFunction))
#'     }
#'   )
#'
#'   # Note, no extra Roxygen notes needed
#'   methods::setMethod(
#'     f = 'SomeFunction',
#'     signature = c('object' = 'loom'),
#'     definition = function(object, loom.param, ...) {
#'       # do something
#'     }
#'   )
#' }
#'
#' As R6 objects are based on S3 objects, the final way to extend \code{loom} objects is by writing S3-style methods. These methods involve the
#' least amount of boilerplate to set up. S3 generics are written just like normal functions, albiet with a few differences. Firstly, they have
#' two arguments: the argument that determines the class for dispatching and \code{...} to pass other arguments to methods. Finally, the only
#' thing an S3 generic needs to do is call \code{UseMethod} to allow R to dispatch based on the class of whatever the object is. Unlike S4 methods,
#' S3 methods provide tab-autocompletion for method-specific arguments, providing help messages along the way. This means that S3 methods are more
#' user-friendly than S4 methods. Like S4 methods, S3 methods can use S3 generics declared by other packages, with the same assumptions about
#' imports applying here as well. However, S3 methods cannot be kept internally, and must be exported for R to properly dispatch the method. This means
#' that a package's namespace will have n + 1 functions declared for every S3 generic, where n is the number of classes a method is declared for and the
#' one extra is for the generic. Furthermore, as the methods themselves are exported, anyone can simply use the method directly rather than go through
#' the generic and have R dispatch a method based on object class. Despite these drawbacks, S3 methods are how we recommend one extends loomR unless
#' one needs the specific features of R6 classes or S4-style methods.
#' \preformatted{
#'   #' @export somefunction
#'   somefunction <- function(object, ...) {
#'     UseMethod('somefunction', object)
#'   }
#'
#'   #' @export somefunction.loom
#'   #' @method somefunction loom
#'   somefunction.loom <- function(object, loom.param, ...) {
#'     # do something
#'   }
#' }
#'
#' @docType package
#' @name loomR-package
#'
NULL


# Hooks to set loom as an S4 class upon
# loadNamespace or library/require
.onLoad <- function(libname, pkgname) {
  setOldClass(Classes = 'loom')
}
mojaveazure/loomR documentation built on May 29, 2019, 5:44 a.m.