runOnlineINMF | R Documentation |
Perform online integrative non-negative matrix factorization to
represent multiple single-cell datasets in terms of H
, W
, and
V
matrices. It optimizes the iNMF objective function (see
runINMF
) using online learning (non-negative least squares for
H
matrices, and hierarchical alternating least squares (HALS) for
V
matrices and W
), where the number of factors is set by
k
. The function allows online learning in 3 scenarios:
Fully observed datasets;
Iterative refinement using continually arriving datasets;
Projection of new datasets without updating the existing factorization
All three scenarios require fixed memory independent of the number of cells.
For each dataset, this factorization produces an H
matrix (k by cell),
a V
matrix (genes by k), and a shared W
matrix (genes by k). The H
matrices represent the cell factor loadings.
W
is identical among all datasets, as it represents the shared
components of the metagenes across datasets. The V
matrices represent
the dataset-specific components of the metagenes.
runOnlineINMF(object, k = 20, lambda = 5, ...)
## S3 method for class 'liger'
runOnlineINMF(
object,
k = 20,
lambda = 5,
newDatasets = NULL,
projection = FALSE,
maxEpochs = 5,
HALSiter = 1,
minibatchSize = 5000,
WInit = NULL,
VInit = NULL,
AInit = NULL,
BInit = NULL,
seed = 1,
nCores = 2L,
verbose = getOption("ligerVerbose", TRUE),
...
)
## S3 method for class 'Seurat'
runOnlineINMF(
object,
k = 20,
lambda = 5,
datasetVar = "orig.ident",
layer = "ligerScaleData",
assay = NULL,
reduction = "onlineINMF",
maxEpochs = 5,
HALSiter = 1,
minibatchSize = 5000,
seed = 1,
nCores = 2L,
verbose = getOption("ligerVerbose", TRUE),
...
)
object |
liger object. Scaled data required. |
k |
Inner dimension of factorization–number of metagenes. A value in
the range 20-50 works well for most analyses. Default |
lambda |
Regularization parameter. Larger values penalize
dataset-specific effects more strongly (i.e. alignment should increase as
lambda increases). We recommend always using the default value except
possibly for analyses with relatively small differences (biological
replicates, male/female comparisons, etc.) in which case a lower value such
as 1.0 may improve reconstruction quality. Default |
... |
Arguments passed to other S3 methods of this function. |
newDatasets |
Named list of dgCMatrix-class object. New
datasets for scenario 2 or scenario 3. Default |
projection |
Whether to perform data integration with scenario 3 when
|
maxEpochs |
The number of epochs to iterate through. See detail.
Default |
HALSiter |
Maximum number of block coordinate descent (HALS
algorithm) iterations to perform for each update of |
minibatchSize |
Total number of cells in each minibatch. See detail.
Default |
WInit , VInit , AInit , BInit |
Optional initialization for |
seed |
Random seed to allow reproducible results. Default |
nCores |
The number of parallel tasks to speed up the computation.
Default |
verbose |
Logical. Whether to show information of the progress. Default
|
datasetVar |
Metadata variable name that stores the dataset source
annotation. Default |
layer |
For Seurat>=4.9.9, the name of layer to retrieve input
non-negative scaled data. Default |
assay |
Name of assay to use. Default |
reduction |
Name of the reduction to store result. Also used as the
feature key. Default |
For performing scenario 2 or 3, a complete set of factorization result from
a run of scenario 1 is required. Given the structure of a liger
object, all of the required information can be retrieved automatically.
Under the circumstance where users need customized information for existing
factorization, arguments WInit
, VInit
, AInit
and
BInit
are exposed. The requirements for these argument follows:
WInit - A matrix object of size m \times k
. (see
runINMF
for notation)
VInit - A list object of matrices each of size m \times k
.
Number of matrices should match with newDatasets
.
AInit - A list object of matrices each of size k \times k
.
Number of matrices should match with newDatasets
.
BInit - A list object of matrices each of size m \times k
.
Number of matrices should match with newDatasets
.
Minibatch iterations is performed on small subset of cells. The exact
minibatch size applied on each dataset is minibatchSize
multiplied by
the proportion of cells in this dataset out of all cells. In general,
minibatchSize
should be no larger than the number of cells in the
smallest dataset (considering both object
and newDatasets
).
Therefore, a smaller value may be necessary for analyzing very small
datasets.
An epoch is one completion of calculation on all cells after a number of
iterations of minibatches. Therefore, the total number of iterations is
determined by the setting of maxEpochs
, total number of cells, and
minibatchSize
.
Currently, Seurat S3 method does not support working on Scenario 2 and 3, because there is no simple solution for organizing a number of miscellaneous matrices with a single Seurat object. We strongly recommend that users create a liger object which has the specific structure.
liger method - Returns updated input liger object.
A list of all H
matrices can be accessed with
getMatrix(object, "H")
A list of all V
matrices can be accessed with
getMatrix(object, "V")
The W
matrix can be accessed with
getMatrix(object, "W")
Meanwhile, intermediate matrices A
and B
produced in
HALS update can also be accessed similarly.
Seurat method - Returns updated input Seurat object.
H
matrices for all datasets will be concatenated and
transposed (all cells by k), and form a DimReduc object in the
reductions
slot named by argument reduction
.
W
matrix will be presented as feature.loadings
in the
same DimReduc object.
V
matrices, A
matrices, B
matricesm an objective
error value and the dataset variable used for the factorization is
currently stored in misc
slot of the same DimReduc object.
Chao Gao and et al., Iterative single-cell multi-omic integration using online learning, Nat Biotechnol., 2021
pbmc <- normalize(pbmc)
pbmc <- selectGenes(pbmc)
pbmc <- scaleNotCenter(pbmc)
if (requireNamespace("RcppPlanc", quietly = TRUE)) {
# Scenario 1
pbmc <- runOnlineINMF(pbmc, minibatchSize = 200)
# Scenario 2
# Fake new dataset by increasing all non-zero value in "ctrl" by 1
ctrl2 <- rawData(dataset(pbmc, "ctrl"))
ctrl2@x <- ctrl2@x + 1
colnames(ctrl2) <- paste0(colnames(ctrl2), 2)
pbmc2 <- runOnlineINMF(pbmc, k = 20, newDatasets = list(ctrl2 = ctrl2),
minibatchSize = 100)
# Scenario 3
pbmc3 <- runOnlineINMF(pbmc, k = 20, newDatasets = list(ctrl2 = ctrl2),
projection = TRUE)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.