# options: Global SVD options In LTLA/BiocSingular: Singular Value Decomposition for Bioconductor Packages

## Description

An overview of the available options when performing SVD with any algorithm.

## Computing the cross-product

If the dimensions of the input matrix are very different, it may be faster to compute the cross-product and perform the SVD on the resulting square matrix, rather than performing SVD directly on a very fat or tall input matrix. The cross-product can often be computed very quickly due to good data locality, yielding a small square matrix that is easily handled by any SVD algorithm. This is especially true in cases where the input matrix is not held in memory. Calculation of the cross-product only involves one read across the entire data set, while direct application of approximate methods like `irlba` or `rsvd` would need to access the data multiple times.

The various BiocSingular SVD functions allow users to specify the minimum fold difference (via the `fold` argument) at which a cross-product should be computed. Setting `fold=1` will always compute the cross-product for any matrix - this is probably unwise. By contrast, setting `fold=Inf` means that the cross-product is never computed. This is currently the default in all functions, to provide the most expected behaviour unless specifically instructed otherwise.

## Centering and scaling

In general, each SVD function performs the SVD on `t((t(x) - C)/S)` where `C` and `S` are numeric vectors of length equal to `ncol(x)`. The values of `C` and `S` are defined according to the `center` and `scale` options.

• If `center=TRUE`, `C` is defined as the column sums of `x`. If `center=NULL` or `FALSE`, all elements of `C` are set to zero. If `center` is a numeric vector with length equal to `ncol(x)`, it is used to directly define `C`.

• If `scale=TRUE`, the `i`th element of `S` is defined as the square root of `sum((x[,i] - C[i])^2)/(ncol(x)-1)`, for whatever `C` was defined above. This mimics the behaviour of `scale`. If `scale=NULL` or `FALSE`, all elements of `S` are set to unity. If `scale` is a numeric vector with length equal to `ncol(x)`, it is used to directly define `S`.

Setting `center` or `scale` is more memory-efficient than modifiying the input `x` directly. This is because the function will avoid constructing intermediate centered (possibly non-sparse) matrices.

## Deferred centering and scaling

Many of the SVD algorithms (and computation of the cross-product) involve repeated matrix multiplications. The BiocSingular package has a specialized DeferredMatrix class that defers centering (and to some extent, scaling) during matrix multiplication. The matrix multiplication is performed on the original matrix, and then the centering/scaling operations are applied to the matrix product. This allows direct use of the `%*%` method for each matrix representation, to exploit features of the underlying representation (e.g., sparsity) for greater speed.

Unfortunately, the speed-up with deferred centering comes at the cost of increasing the risk of catastrophic cancellation. The procedure requires subtraction of one large intermediate number from another to obtain the values of the final matrix product. This could result in a loss of numerical precision that compromises the accuracy of the various SVD algorithms.

The default approach is to explicitly create a dense in-memory centred/scaled matrix via block processing (see `blockGrid` in the DelayedArray package). This avoids problems with numerical precision as large intermediate values are not formed. In doing so, we consistently favour accuracy over speed unless the functions are specifically instructed to do otherwise, i.e., with `deferred=TRUE`.

## Author(s)

Aaron Lun

LTLA/BiocSingular documentation built on Feb. 25, 2020, 7:31 p.m.