Description Computing the cross-product Centering and scaling Deferred centering and scaling Author(s)
An overview of the available options when performing SVD with any algorithm.
If the dimensions of the input matrix are very different, it may be faster to compute the cross-product and perform the SVD on the resulting square matrix,
rather than performing SVD directly on a very fat or tall input matrix.
The cross-product can often be computed very quickly due to good data locality, yielding a small square matrix that is easily handled by any SVD algorithm.
This is especially true in cases where the input matrix is not held in memory.
Calculation of the cross-product only involves one read across the entire data set,
while direct application of approximate methods like irlba
or rsvd
would need to access the data multiple times.
The various BiocSingular SVD functions allow users to specify the minimum fold difference (via the fold
argument) at which a cross-product should be computed.
Setting fold=1
will always compute the cross-product for any matrix - this is probably unwise.
By contrast, setting fold=Inf
means that the cross-product is never computed.
This is currently the default in all functions, to provide the most expected behaviour unless specifically instructed otherwise.
In general, each SVD function performs the SVD on t((t(x) - C)/S)
where C
and S
are numeric vectors of length equal to ncol(x)
.
The values of C
and S
are defined according to the center
and scale
options.
If center=TRUE
, C
is defined as the column sums of x
.
If center=NULL
or FALSE
, all elements of C
are set to zero.
If center
is a numeric vector with length equal to ncol(x)
, it is used to directly define C
.
If scale=TRUE
, the i
th element of S
is defined as the square root of sum((x[,i] - C[i])^2)/(ncol(x)-1)
, for whatever C
was defined above.
This mimics the behaviour of scale
.
If scale=NULL
or FALSE
, all elements of S
are set to unity.
If scale
is a numeric vector with length equal to ncol(x)
, it is used to directly define S
.
Setting center
or scale
is more memory-efficient than modifiying the input x
directly.
This is because the function will avoid constructing intermediate centered (possibly non-sparse) matrices.
Many of the SVD algorithms (and computation of the cross-product) involve repeated matrix multiplications.
The BiocSingular package has a specialized DeferredMatrix class that defers centering (and to some extent, scaling) during matrix multiplication.
The matrix multiplication is performed on the original matrix, and then the centering/scaling operations are applied to the matrix product.
This allows direct use of the %*%
method for each matrix representation, to exploit features of the underlying representation (e.g., sparsity) for greater speed.
Unfortunately, the speed-up with deferred centering comes at the cost of increasing the risk of catastrophic cancellation. The procedure requires subtraction of one large intermediate number from another to obtain the values of the final matrix product. This could result in a loss of numerical precision that compromises the accuracy of the various SVD algorithms.
The default approach is to explicitly create a dense in-memory centred/scaled matrix via block processing (see blockGrid
in the DelayedArray package).
This avoids problems with numerical precision as large intermediate values are not formed.
In doing so, we consistently favour accuracy over speed unless the functions are specifically instructed to do otherwise, i.e., with deferred=TRUE
.
Aaron Lun
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.