library(knitr) library(rmarkdown) library(BiocStyle) options(tinytex.verbose = TRUE) knitr::opts_chunk$set(collapse = TRUE, comment = "", cache=FALSE, message = FALSE, width = 180, crop = NULL)
This package implements several matrix operations using Matrix
and DelayMatrix
objects as well as HDF5 data files. Some basic algebra operations that can also be computed that are useful to implement statistical analyses using standard methodologies such as principal component analyses (PCA) or least squares estimation. The package also contains specific statistical methods mainly used in omic
data analysis such as lasso regression. All procedures referred to HDF5 can be found in BigDataStatMeth_hdf5 vignette.
The package requires other packages from CRAN
and Bioconductor
to be installed.
CRAN
: Matrix
, RcppEigen
and RSpectra
.Bioconductor
: DelayedArray
As the package can also deal with hdf5 files [See Vignette BigDataStatMeth_hdf5], these other packages are required: HDF5Array
, rhdf5
. The user can execute this code to install the required packages:
# Install BiocManager (if not previously installed) install.packages("BiocManager") # Install required packages BiocManager::install(c("Matrix", "RcppEigen", "RSpectra", "DelayedArray", "HDF5Array", "rhdf5"))
Our package needs to be installed from source code. In such cases, a collection of software (e.g. C, C++, Fortran, etc.) are required, mainly for Windows users. These programs can be installed using Rtools.
Once required packages and Rtools are installed, BigDataStatMeth
package can be installed from our GitHub repository as follows:
# Install devtools and load library (if not previously installed) install.packages("devtools") library(devtools) # Install BigDataStatMeth install_github("isglobal-brge/BigDataStatMeth")
First, let us start by loading the required packages to describe the main capabilities of the package
library(Matrix) library(DelayedArray) library(BigDataStatMeth) library(ggplot2)
This packages is also required to reproduce this vignette
library(microbenchmark)
In this section, different products of matrices and vectors are introduced. The methods implement different strategies including block multiplication algorithms and the use of parallel implementations.
A block matrix or a partitioned matrix is a matrix that is interpreted as having been broken into sections called blocks or submatrices. Intuitively, a block matrix can be visualized as the original matrix with a collection of horizontal and vertical lines, which break it up, or partition it, into a collection of smaller matrices. the implementation has been made from the adaptation of the Fox algorithm [1].
[AB=\begin{pmatrix} {A}{11}&{A}{12} \ {A}{21}&{A}{22}\end{pmatrix}\begin{pmatrix}{B}{11}&B{12}\B_{21}&B_{22}\end{pmatrix}=\begin{pmatrix}{C}{11}&C{12}\C_{21}&C_{22}\end{pmatrix}=C]
Let us simulate to set of matrices to illustrate the use of the function accross the entire documment. First, we simulate a simple case with to matrices A and B with dimensions 500x500 and 500x750, respectively. Second, another example with dimensions 1000x10000 are use to evaluate the performance in large matrices. The examples with big datasets will be illustrated using real data belonging to different omic settings. We can simulate to matrices with the desired dimensions by
# Define small matrix A and B set.seed(123456) n <- 500 p <- 750 A <- matrix(rnorm(n*n), nrow=n, ncol=n) B <- matrix(rnorm(n*p), nrow=n, ncol=p) # Define big matrix Abig and Bbig n <- 1000 p <- 10000 Abig <- matrix(rnorm(n*n), nrow=n, ncol=n) Bbig <- matrix(rnorm(n*p), nrow=n, ncol=p)
Matrix multiplication using block matrices is implemented in the bdblockmult()
function. It only requires to setting the argument block_size
, by default block_size = 128
. An optimal block size is important to better performance but it is difficult to asses what is the optimum block size for each matrix.
# Use 10x10 blocks AxB <- bdblockmult(A, B, block_size = 10)
As expected the results obtained using this procedure are the correct ones
all.equal(AxB, A%*%B)
Note that when the argument block_size
is larger than any of the dimensions of matrix A or B the blocks_size
is set to min(cols(A), rows(A), cols(B), rows(B))
.
The process can be speed up by making computations in parallel using paral=TRUE
an optional parameter threads can be used to indicate the number of threads
to launch simultaneously, if threads=NULL
the function takes the available threads - 1, leaving one available for user.
AxB <- bdblockmult(A, B, block_size = 10, paral = TRUE) all.equal(AxB,A%*%B)
To work with big matrices bdblockmult()
can save matrices in hdf5 file format in order to be able to operate with them in blocks and not overload the memory, by default are considered large matrices if the number of rows or columns is greater than 5000, but it can be changed with bigmatrix
argument. Or we can also force to execute the matrix multiplication with data on memory setting the parameter onmemory = TRUE
# We want to force it to run in memory AxBNOBig <- bdblockmult(Abig, Bbig, onmemory = TRUE) # Run matrix multiplication with data on memory using submatrices of 256x256 AxBBig3000 <- bdblockmult(Abig, Bbig, block_size = 256 , onmemory = TRUE)
Here, we compare the performance of the block method with the different options.
bench1 <- microbenchmark( # Parallel block size = 128 Paral128Mem = bdblockmult(Abig, Bbig, paral = TRUE), # On disk parallel block size = 256 Paral256Disk = bdblockmult(Abig, Bbig, block_size=256, paral=TRUE), Paral256Mem = bdblockmult(Abig, Bbig, block_size=256, paral=TRUE, onmemory=TRUE), Paral1024Mem = bdblockmult(Abig, Bbig, block_size=1024, paral=TRUE, onmemory=TRUE), times = 3 ) bench1
We can observe that the shortest execution time is achieved with block_size = 256
.
The same information is depicted in the next Figure which illustrates the comparison between the different assessed methods
ggplot2::autoplot(bench1)
A sparse matrix or sparse array is a matrix in which most of the elements are zero. BigDataStatMeth
allows to perform matrix multiplication with sparse matrices using the function bdblockmult_sparse
. It is necessary that at least one of the the two matrices is defined as sparse in R using dgCMatrix
class. An example of a sparse matrix could be :
\begin{equation} \begin{bmatrix} a_{11}&a_{12} & 0 & \cdots & \cdots & \cdots & \cdots & 0 \ a_{21} & a_{22} & a_{23} & \ddots & && & \vdots \ 0 & a_{32} & a_{33} & a_{34} & \ddots & & & \vdots \ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & & \vdots \ \vdots & & \ddots & \ddots & \ddots & \ddots & \ddots& \vdots\ \vdots & & & & \ddots & a_{n-1,n-2} & a_{n-1,n-1} & a_{n-1,n}\ 0 & \cdots & \cdots & \cdots & \cdots & 0 & a_{n,n-1} & a_{n,n} \ \end{bmatrix} \end{equation}
k <- 1e3 # Generate 2 sparse matrix x_sparse and y_sparse set.seed(1) x_sparse <- sparseMatrix( i = sample(x = k, size = k), j = sample(x = k, size = k), x = rnorm(n = k) ) set.seed(2) y_sparse <- sparseMatrix( i = sample(x = k, size = k), j = sample(x = k, size = k), x = rnorm(n = k) ) d <- bdblockmult_sparse(x_sparse, y_sparse) f <- x_sparse%*%y_sparse all.equal(d,f)
Here, we compare the performace using sparse matrix multiplication with sparse matrices and the blockmult multiplication with the same matrices not declared as sparce.
res <- microbenchmark( sparse_mult = bdblockmult_sparse(x_sparse, y_sparse), matrix_mult = bdblockmult(as.matrix(x_sparse), as.matrix(y_sparse)), RSparse_mult = x_sparse%*% y_sparse, times = 3 ) res
The same information is depicted in the next Figure which illustrates the comparison between the different assessed methods
ggplot2::autoplot(res)
We can observe the huge difference in execution time
To perform a cross-product and tcrossproduct with BigDataStatMeth we use the same function bdcrossprod()
setting parameter transposed
to TRUE or FALSE. Like other functions implemented in BigDataStatMeth, we can work with R Objects. Setting parameter transposed for Crossproduct and tCrossProduct
if transposed = FALSE
(default), we perform a Crossproduct [\left( C \right) ={ \left( A \right) }^{ t }\left( A \right) ]
if transposed = TRUE
: we perform a transposed-Crossproduct [\left( C \right) = \left( A \right){ \left( A \right) }^{ t } ]
Here we shown some examples using bdcrossprod
function
n <- 500 m <- 250 A <- matrix(rnorm(n*m), nrow=n, ncol=m) # Cross Product of a standard R matrix cpA <- bdCrossprod(A)
We obtain the expected values computed using crossprod
R function
all.equal(cpA, crossprod(A))
you may also set transposed=TRUE
# Transposed Cross Product R matrices tcpA <- bdtCrossprod(A)
We obtain the expected values computed using tcrossprod
function
all.equal(tcpA, tcrossprod(A))
We can show that the implemented version really improves the R implementation computational speed.
res <- microbenchmark( bdcrossp_tr = bdtCrossprod(A), rcrossp_tr = tcrossprod(A), bdcrossp = bdCrossprod(A), rcrossp = crossprod(A), times = 3) res
ggplot2::autoplot(res)
You can perform a weighted cross-product $C = X^ t w X$ with bdwcrossprod()
given a matrix X as argument and a vector or matrix of weights, w.
n <- 250 X <- matrix(rnorm(n*n), nrow=n, ncol=n) u <- runif(n) w <- u * (1 - u) wcpX <- bdwproduct(X, w,"xwxt") wcpX[1:5,1:5]
Those are the expected values as it is indicated by executing:
all.equal( wcpX, X%*%diag(w)%*%t(X) )
With argument transposed=TRUE
, we can perform a transposed weighted cross-product $C = A w A^t$
wtcpX <- bdwproduct(X, w,"xtwx") wtcpX[1:5,1:5]
Those are the expected values as it is indicated by executing:
all.equal(wtcpX, t(X)%*%diag(w)%*%X)
The Cholesky factorization is widely used for solving a system of linear equations whose coefficient matrix is symmetric and positive definite.
[A = LL^t = U^tU]
where $L$ is a lower triangular matrix and U is an upper triangular matrix. To get the Inverse Cholesky we can use the function bdInvCholesky()
. Let us start by illustrating how to do this computations in a simulated data:
# Generate a positive definite matrix Posdef <- function (n, ev = runif(n, 0, 10)) { Z <- matrix(ncol=n, rnorm(n^2)) decomp <- qr(Z) Q <- qr.Q(decomp) R <- qr.R(decomp) d <- diag(R) ph <- d / abs(d) O <- Q %*% diag(ph) Z <- t(O) %*% diag(ev) %*% O return(Z) } A <- Posdef(n = 500, ev = 1:500) invchol <- bdInvCholesky(A) round(invchol[1:5,1:5],8)
We can check whether this function returns the expected values obtained with the standard R function solve
:
all.equal(invchol, solve(A))
We can show that the implemented version really improves the R implementation computational speed.
res <- microbenchmark(invchol = bdInvCholesky(A), invcholR = solve(A), times = 3) res
ggplot2::autoplot(res)
The Moore-Penrose pseudoinverse is a direct application of the SVD. The inverse of a matrix A can be used to solve the equation ${Ax}={b}$. But in the case where the set of equations have 0 or many solutions the inverse cannot be found and the equation cannot be solved. The following formula is used to find the pseudoinverse:
[{A}^+= {VD}^+{U}^T]
In BigDataStatMeth we implemented Moore-Penrose pseudoinverse in function bdpseudoinv
, now we get the pseudoinverse from a simulated data
m <- 1300 n <- 1200 A <- matrix(rnorm(n*m), nrow=n, ncol=m) pseudoinv <- bdpseudoinv(A) zapsmall(pseudoinv)[1:5,1:5]
QR decomposition, also known as a QR factorization or QU factorization is a decomposition of a matrix A into a product : [A = QR] of an orthogonal matrix Q and an upper triangular matrix R. QR decomposition is often used to solve the linear least squares problems.
In BigDataStatMeth we implemented QR decomposition in function bdQR
, the QR decomposition can be performed in R objects. To show how to use this function we performa a QR decomposition from the previous simulated data in matrix A.
QR_A <- bdQR(A) QR_R <- qr(A) # Show results for Q zapsmall(QR_A$Q[1:5,1:5]) # Show results for R zapsmall(QR_A$R[1:5,1:5]) # Test Q all.equal(QR_A$Q, qr.Q(QR_R), check.attributes=FALSE)
In BigDataStatMeth we implemented the function bdSolve
that computes the solution to a real system of linear equations
[A*X = B] where A is an N-by-N matrix and X and B are N-by-K matrices.
Here we solve a matrix equation with a squared matrix A (1000 by 1000) and B matrix (1000 by 2)
# Simulate data m <- 1000 n <- 1000 A <- matrix(runif(n*m), nrow = n, ncol = m) B <- matrix(runif(n*2), nrow = n) # Solve matrix equation X <- bdSolve(A, B) # Show results X[1:5,]
Now we check results multiplying matrix A by results X, if all is correct $A*X = B$
testB <- bdblockmult(A,X) B[1:5,] testB[1:5,] all.equal(B, testB)
The SVD of an $m \times n$ real or complex matrix $A$ is a factorization of the form:
$$U\Sigma { V }^{ T }$$
where :
-$U$ is a $m \times m$ real or complex unitary matrix -$\Sigma$ is a $m \times n$ rectangular diagonal matrix with non-negative real numbers on the diagonal -$V$ is a $n \times n$ real or complex unitary matrix.
Notice that:
He have implemented the SVD for R matrices in the function bdSVD()
. The method, so far, only allows SVD of real matrices $A$. This code illustrate how to perform such computations:
# Matrix simulation set.seed(413) n <- 500 A <- matrix(rnorm(n*n), nrow=n, ncol=n) # SVD bsvd <- bdSVD(A) # Singular values, and right and left singular vectors bsvd$d[1:5] bsvd$u[1:5,1:5] bsvd$v[1:5,1:5]
We get the expected results obtained with standard R functions:
all.equal( sqrt( svd( tcrossprod( scale(A) ) )$d[1:10] ), bsvd$d[1:10] )
you get the $\sigma_i$, $U$ and $V$ of normalized matrix $A$, if you want to perform the SVD from not normalized matrix $A$ then you have to set the parameter bcenter = false
and bscale = false
.
bsvd <- bdSVD(A, bcenter = FALSE, bscale = FALSE) bsvd$d[1:5] bsvd$u[1:5,1:5] bsvd$v[1:5,1:5] all.equal( sqrt(svd(tcrossprod(A))$d[1:10]), bsvd$d[1:10] )
A method developed by M. A. Iwen and B. W. Ong uses a distributed and incremental SVD algorithm that is useful for agglomerative data analysis on large networks. The algorithm calculates the singular values and left singular vectors of a matrix A, by first, partitioning it by columns. This creates a set of submatrices of A with the same number of rows, but only some of its columns. After that, the SVD of each of the submatrices is computed. The final step consists of combining the results obtained by merging them again and computing the SVD of the resulting matrix.
This approach is only implemented using HDF5 files. This method is implemented in bdSVD_hdf5
function, this function works directly on hdf5 data format, loading in memory only the data to perform calculations and saving the results again in the hdf5 file for later use. The user is referred to read section 7.3.1 from this vignette.
Let us illustrate how to perform a PCA using miRNA data obtained from TCGA corresponding to 3 different tumors: melanoma (ME), leukemia (LEU) and centran nervous system (CNS). Data is available at BigDataStatMeth
and can be loaded by simply:
data(miRNA) data(cancer) dim(miRNA)
We observe that there are a total of 21 individuals and 537 miRNAs. The vector cancer
contains the type of tumor of each individual. For each type we have:
table(cancer)
Now, the typical principal component analysis on the samples
can be run on the miRNA
matrix since it has miRNAs in columns and individuals in rows
pc <- prcomp(miRNA)
We can plot the two first components with:
plot(pc$x[, 1], pc$x[, 2], main = "miRNA data on tumor samples", xlab = "PC1", ylab = "PC2", type="n") abline(h=0, v=0, lty=2) points(pc$x[, 1], pc$x[, 2], col = cancer, pch=16, cex=1.2) legend("topright", levels(cancer), pch=16, col=1:3)
The PCA is equivalent to performing the SVD on the centered data, where the centering occurs on the columns. In that case the function bdSVD
requires to set the argument bcenter
and bscale
equal to TRUE, the dafault values.
miRNA.c <- sweep(miRNA, 2, colMeans(miRNA), "-") svd.da <- bdSVD(miRNA.c, bcenter = FALSE, bscale = FALSE)
The PCA plot for the two principal components can then be be obtained by:
plot(svd.da$u[, 1], svd.da$u[, 2], main = "miRNA data on tumor samples", xlab = "PC1", ylab = "PC2", type="n") abline(h=0, v=0, lty=2) points(svd.da$u[, 1], svd.da$u[, 2], col = cancer, pch=16, cex=1.2) legend("topright", levels(cancer), pch=16, col=1:3)
We can observe that both figures are equal irrespective to a sign change of second component (that can happen in SVD).
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.