options: Global SVD options

BiocSingular optionsR Documentation

Global SVD options

Description

An overview of the available options when performing SVD with any algorithm.

Computing the cross-product

If the dimensions of the input matrix are very different, it may be faster to compute the cross-product and perform the SVD on the resulting square matrix, rather than performing SVD directly on a very fat or tall input matrix. The cross-product can often be computed very quickly due to good data locality, yielding a small square matrix that is easily handled by any SVD algorithm. This is especially true in cases where the input matrix is not held in memory. Calculation of the cross-product only involves one read across the entire data set, while direct application of approximate methods like irlba or rsvd would need to access the data multiple times.

The various BiocSingular SVD functions allow users to specify the minimum fold difference (via the fold argument) at which a cross-product should be computed. Setting fold=1 will always compute the cross-product for any matrix - this is probably unwise. By contrast, setting fold=Inf means that the cross-product is never computed. This is currently the default in all functions, to provide the most expected behaviour unless specifically instructed otherwise.

Centering and scaling

In general, each SVD function performs the SVD on t((t(x) - C)/S) where C and S are numeric vectors of length equal to ncol(x). The values of C and S are defined according to the center and scale options.

  • If center=TRUE, C is defined as the column means of x. If center=NULL or FALSE, all elements of C are set to zero. If center is a numeric vector with length equal to ncol(x), it is used to directly define C.

  • If scale=TRUE, the ith element of S is defined as the square root of sum((x[,i] - C[i])^2)/(ncol(x)-1), for whatever C was defined above. This mimics the behaviour of scale. If scale=NULL or FALSE, all elements of S are set to unity. If scale is a numeric vector with length equal to ncol(x), it is used to directly define S.

Setting center or scale is more memory-efficient than modifiying the input x directly. This is because the function will avoid constructing intermediate centered (possibly non-sparse) matrices.

Deferred centering and scaling

Many of the SVD algorithms (and computation of the cross-product) involve repeated matrix multiplications. We speed this up by using the ScaledMatrix class to defer centering (and to some extent, scaling) during matrix multiplication. The matrix multiplication is performed on the original matrix, and then the centering/scaling operations are applied to the matrix product. This allows direct use of the %*% method for each matrix representation, to exploit features of the underlying representation (e.g., sparsity) for greater speed.

Unfortunately, the speed-up with deferred centering comes at the cost of increasing the risk of catastrophic cancellation. The procedure requires subtraction of one large intermediate number from another to obtain the values of the final matrix product. This could result in a loss of numerical precision that compromises the accuracy of the various SVD algorithms.

The default approach is to explicitly create a dense in-memory centred/scaled matrix, possibly via block processing (see blockGrid in the DelayedArray package). This avoids problems with numerical precision as large intermediate values are not formed. In doing so, we consistently favour accuracy over speed unless the functions are specifically instructed to do otherwise, i.e., with deferred=TRUE.

Author(s)

Aaron Lun


LTLA/BiocSingular documentation built on March 5, 2024, 5:19 a.m.