Description Computing the cross-product Centering and scaling Deferred centering and scaling Author(s)

An overview of the available options when performing SVD with any algorithm.

If the dimensions of the input matrix are very different, it may be faster to compute the cross-product and perform the SVD on the resulting square matrix,
rather than performing SVD directly on a very fat or tall input matrix.
The cross-product can often be computed very quickly due to good data locality, yielding a small square matrix that is easily handled by any SVD algorithm.
This is especially true in cases where the input matrix is not held in memory.
Calculation of the cross-product only involves one read across the entire data set,
while direct application of approximate methods like `irlba`

or `rsvd`

would need to access the data multiple times.

The various BiocSingular SVD functions allow users to specify the minimum fold difference (via the `fold`

argument) at which a cross-product should be computed.
Setting `fold=1`

will always compute the cross-product for any matrix - this is probably unwise.
By contrast, setting `fold=Inf`

means that the cross-product is never computed.
This is currently the default in all functions, to provide the most expected behaviour unless specifically instructed otherwise.

In general, each SVD function performs the SVD on `t((t(x) - C)/S)`

where `C`

and `S`

are numeric vectors of length equal to `ncol(x)`

.
The values of `C`

and `S`

are defined according to the `center`

and `scale`

options.

If

`center=TRUE`

,`C`

is defined as the column sums of`x`

. If`center=NULL`

or`FALSE`

, all elements of`C`

are set to zero. If`center`

is a numeric vector with length equal to`ncol(x)`

, it is used to directly define`C`

.If

`scale=TRUE`

, the`i`

th element of`S`

is defined as the square root of`sum((x[,i] - C[i])^2)/(ncol(x)-1)`

, for whatever`C`

was defined above. This mimics the behaviour of`scale`

. If`scale=NULL`

or`FALSE`

, all elements of`S`

are set to unity. If`scale`

is a numeric vector with length equal to`ncol(x)`

, it is used to directly define`S`

.

Setting `center`

or `scale`

is more memory-efficient than modifiying the input `x`

directly.
This is because the function will avoid constructing intermediate centered (possibly non-sparse) matrices.

Many of the SVD algorithms (and computation of the cross-product) involve repeated matrix multiplications.
The BiocSingular package has a specialized DeferredMatrix class that defers centering (and to some extent, scaling) during matrix multiplication.
The matrix multiplication is performed on the original matrix, and then the centering/scaling operations are applied to the matrix product.
This allows direct use of the `%*%`

method for each matrix representation, to exploit features of the underlying representation (e.g., sparsity) for greater speed.

Unfortunately, the speed-up with deferred centering comes at the cost of increasing the risk of catastrophic cancellation. The procedure requires subtraction of one large intermediate number from another to obtain the values of the final matrix product. This could result in a loss of numerical precision that compromises the accuracy of the various SVD algorithms.

The default approach is to explicitly create a dense in-memory centred/scaled matrix via block processing (see `blockGrid`

in the DelayedArray package).
This avoids problems with numerical precision as large intermediate values are not formed.
In doing so, we consistently favour accuracy over speed unless the functions are specifically instructed to do otherwise, i.e., with `deferred=TRUE`

.

Aaron Lun

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.