Matrix operations, algebra and Statistical Methods for Big Data. Part II: working with HDF5 files"

library(knitr)
library(BiocStyle)

knitr::opts_chunk$set(collapse = TRUE, comment = "", cache = FALSE, message = FALSE, width = 180, crop = NULL)
if( isTRUE(file.exists('delayed.hdf5'))) {
    file.remove('delayed.hdf5')
}
if( isTRUE(file.exists('robject.hdf5'))){
    file.remove('robject.hdf5')
}
if( isTRUE(file.exists('colesterol_file.hdf5'))){
    file.remove('colesterol_file.hdf5')
}
if( isTRUE(file.exists('tmp_blockmult.hdf5'))){
    file.remove('tmp_blockmult.hdf5')
}

Overview

This package implements several matrix operations using Matrix and DelayMatrix objects as well as HDF5 data files. Some basic algebra operations that can also be computed that are useful to implement statistical analyses using standard methodologies such as principal component analyses (PCA) or least squares estimation. The package also contains specific statistical methods mainly used in omic data analysis such as lasso regression. All procedures to work with objects loaded in memory can be found in BigDataStatMeth_memory vignette.

Prerequisites

The package requires other packages from CRAN and Bioconductor to be installed.

The user can execute this code to install the required packages:

# Install BiocManager (if not previously installed)
install.packages("BiocManager") 

# Install required packages
BiocManager::install(c("Matrix", "RcppEigen", "RSpectra",
                       "DelayedArray", "HDF5Array", "rhdf5"))

Our package needs to be installed from source code. In such cases, a collection of software (e.g. C, C++, Fortran, ...) are required, mainly for Windows users. These programs can be installed using Rtools.

Install package

Once required packages and Rtools are installed, BigDataStatMeth package can be installed from our GitHub repository as follows:

# Install devtools and load library (if not previously installed)
install.packages("devtools") 
library(devtools)

# Install BigDataStatMeth 
install_github("isglobal-brge/BigDataStatMeth")

Getting started

First, let us start by loading the required packages to describe the main capabilities of the package

library(rhdf5)
library(BigDataStatMeth)

Previous knowledge

HDF5 data files

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. It is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

HDF5 is a technology suite that makes possible the management of extremely large and complex data collections, can accomodate virtually every kind of data in a single file, sequences, images, SNP matrices, and every other type of data and metadata associated with an experiment.

There is no limit on the number or size of data objects in the collection, giving great flexibility for omic data. Is high-performance I/O with a rich set of integrated performance features that allow for access time and storage space optimizations

HDF5 file structure include only two major types of object:

This results in a truly hierarchical, filesystem-like data format

knitr::include_graphics("imgs/hdf5_squema.jpg")

Basics in rhdf5

Create hdf5 file

We have implemented the bdCreate_hdf5_matrix_file () function to create an hdf5 file with a group and a dataset in a single step. This function allows to create datasets from a standard R matrices from Matrix objects.

library(rhdf5)

set.seed(5234)
n <- 500
m <- 600
A <- matrix(rnorm(n*m,mean=0,sd=1), n,m)

# We also can create a dataset from R matrix object
bdCreate_hdf5_matrix_file(filename = "robject.hdf5", 
                          object = A, 
                          group = "INPUT", 
                          dataset = "A")

We see 0 in the console, indicating that no errors where found when creating the hdf5 file. Notice that a file called "robject.hdf5" will be created in the working directory

list.files()

Add datasets in hdf5 file

The function bdAdd_hdf5_matrix () allows to add a dataset in a existing hdf5 file. We can create the dataset in any group, if group doesn't exists in file, the group is created before append the dataset.

set.seed(5234)
n <- 100
m <- 10000
A <- matrix(rnorm(n*m,mean=0,sd=1), n, m)

set.seed(5234)
n <- 50
m <- 12000
B <- matrix(rnorm(n*m,mean=3,sd=0.5), n, m)

# We create another data file (delayed.hdf5) with a matrix A.
# The group is called INPUT
bdCreate_hdf5_matrix_file(filename = "delayed.hdf5", 
                        object = A, 
                        group = "INPUT", 
                        dataset = "A")

# And them, we add another matrix B to the same group
bdAdd_hdf5_matrix(object = B, 
                filename = "delayed.hdf5", 
                group = "INPUT", 
                dataset = "B")

Create transposed matrix in hdf5 file

The function Add_hdf5_matrix () also allows to create a dataset with a transposed matrix with the parameter transpose = TRUE. For example, we can add a dataset in hdf5 with the B transposed matrix with :

# Create dataset BTransposed with B transposed matrix in delayed.hdf5 file at INPUT group
bdAdd_hdf5_matrix(B, "delayed.hdf5", "INPUT", "BTransposed", transp = TRUE);

Open and get hdf5 content file

We can open an existing file show contents and access data using functions from rhdf5 package. rhdf5 is an R interface for HDF5. The file must always be opened before working with it.

With htls () function we get the the list of an hdf5 content without the need of open the file.

# Examine hierarchy before open file
h5ls("delayed.hdf5")

Once opened, it can be seen just typing its name. In that case, we only get the current level in the hierarchical tree

# Open file
h5fdelay <- H5Fopen("delayed.hdf5")
# Show hdf5 hierarchy (groups)
h5fdelay

NOTE: We can also use hdfview

Access datasets data

The \$ operator can be used to access the next group level, this operator reads the object from disk. We can assign the dataset contents to an R object in order to work with it.

Bdata <- h5fdelay$INPUT$B
Bdata[1:3,1:5]

Or we can get the data for the transposed version of this matrix created before

BdataTrans <- h5fdelay$INPUT$BTransposed
BdataTrans[1:5,1:3]

IMPORTANT NOTE: This is a very interesting feature. Imaging you wan to compute the mean of each column in the matrix A. If you want to do this in R, you first need to load the file and then use any of the existing functions in R. If A is big, loading all the matrix in R can be inneficient since you actually only need the information in one column

out <- sapply(1:100, function(i){
  mean(h5fdelay$INPUT$A[,i])
  })
head(out)

There are efficient ways of reading a subset of data without the need of reading the entire file. To this end, see this Practical tips.

Close hdf5 file

After work with hdf5 the user always must close the file. We can close only one file with H5Fclose function or close all hdf5 opened files with h5closeAll function

# Close delayed.hdf5 file
H5Fclose(h5fdelay)

# Open 2 files and close all
h5fdelay <- H5Fopen("delayed.hdf5")
h5fr <- H5Fopen("robject.hdf5")

h5closeAll()

More information about working with hdf5 files in R can be found in several vignettes.

Convert text file to HDF5

Usually we have files with a large amount of data to be loaded into R. For Big data problem, R can't deal with this data so we cannot efficiently work with then. In order to resolve this issue, you can work directly from data on disk with datasets stored in an HDF5 data file.

You can easily change your text files to HDF5 using the function bdImport_text_to_hdf5 (), to do that, we only have to define the input file with the filename parameter and the HDF5 destination dataset with parameters outputfile for the output file name, outGroup with the group to store the dataset and the dataset name with the parameter outDataset. It should be considered that:

In this example we convert the "colesterol.csv" file and we will save the data to a file 'colesterol_file.hdf5' under the group 'COLESTEROLGROUP' in the dataset 'COLESTEROLDATASET'. In this example the text file contains column names so we set the parameter header = TRUE.

bdImport_text_to_hdf5(filename = "colesterol.csv", 
                      sep=',', 
                      outputfile = "colesterol_file.hdf5", 
                      outGroup = "COLESTEROL", 
                      outDataset = "COLESTEROLDATA", 
                      header = TRUE)

If we observe the new file colesterol_file.hdf5 and its content

h5ls("colesterol_file.hdf5")

# We can open the file and have access to the data
h5Colesterol <- H5Fopen("colesterol_file.hdf5")

# Show hdf5 content dataset
h5Colesterol$COLESTEROL$COLESTEROLDATA[1:5, 1:6]

# Show colnames 
head(h5Colesterol$COLESTEROL$.COLESTEROLDATA_dimnames$`2`)

H5Fclose(h5Colesterol)

NOTE: We can overwrite an existing hdf5 file with new data by setting overwrite = TRUE

Operation with HDF5 files

Once our data are available as hdf5 files, we can operate directly on it without having data on memory (i.e we have access from disk. In BigDataStatMeth we have implemented most of the common matrix operations and algebra functions to help users how are not familiar with this type of implementation. Next section provides several examples

HDF5 Matrix Multiplication

In this section, different products of matrices and vectors are introduced. The methods implement different strategies including block multiplication algorithms and the use of parallel implementations with hdf5 files datasets.

BigDataStatMeth allows to make an hdf5 matrix multiplication with :

Hdf5 Matrix multiplication from data in memory

To work with big matrices bdblockmult() saves matrices in hdf5 file format in order to be able to operate with them in blocks and not overload the memory, by default are considered large matrices if the number of rows or columns is greater than 5000, but it can be changed with bigmatrix argument.

In that case, by default the function bdblockmult(), by default, this function creates the tmp_blockmult.hdf5 file with two groups the INPUT group with datasets A and B with original matrices and the OUTPUT group with dataset C with the results. The file name can be set with outputfile parameter.

n <- 1000
p <- 10000
Abig <- matrix(rnorm(n*n), nrow=n, ncol=n)
Bbig <- matrix(rnorm(n*p), nrow=n, ncol=p)


# We want to force it to run in memory
AxBNOBig <- bdblockmult(Abig, Bbig, bigmatrix = 100000)

# We consider a big matrix if number of rows or columns are > 500
AxBBig3000 <- bdblockmult(Abig, Bbig, bigmatrix = 500)

Depending on whether the calculation has been performed directly in memory or from an hdf5 file, the returned object type is different.

If we work in memory results are returned as a current r matrix object,

class(AxBNOBig)
AxBNOBig[1:5,1:5]

if we work in disk, we return a list with the filnme and the dataset that contains the results inside the hdf5 file. Then we can open the hdf5 file and get the result matrix and the original matrices,

To work in R with hdf5 data object we can use the rhdf5 function library from bioconductor, we can open file and read all the content

h5data <- H5Fopen(AxBBig3000$file)

# We can get where data is stored
AxBBig3000$dataset

# We observe that the data is in folder OUTPUT dataset C
reshdf5 <- h5data$OUTPUT$C

reshdf5[1:5,1:5]

all.equal(reshdf5, AxBNOBig)

Remember that it is It is important to close all dataset, group, and file handles when not used anymore,

# Close file
H5Fclose(h5data)

Hdf5 Matrix multiplication from data stored in a file

If we have the matrices stored in hdf5 data file, we can use the function bdblockmult_hdf5. This function allows to perform a matrix multiplication from two matrices stored in a file.

To use the bdblockmult_hdf5, we have to set the parameters filename with the file that contains the matrices, group, here, we have to indicate under which group are stored the matrices and the datasets names where data is stored parameters a and b. By default, the results group can be set by outgroup parameter by default, the output group is OUTPUT .

To show how this function works, we will use the matrices created with bdblockmult function previously used and we will store the results under the 'HDF5_RES' group

res <- bdblockmult_hdf5(filename = 'tmp_blockmult.hdf5', group="INPUT",
                        a="A", b="B", outgroup = 'HDF5_RES')

# We show the hdf5 content
 h5ls(res$file)

Now, we compare the obtained results with bdblockmult_hdf5 and bdblockmult functions .

# Get content
h5data <- H5Fopen(res$file)

# We can get where data is stored
res$dataset

# We get the results with bdblockmult (previous step)
resblockmult <- h5data$OUTPUT$C

# and the results obtained from bdblockmult_hdf5
resblockmult_hdf5 <- h5data$HDF5_RES$A_x_B
resblockmult_hdf5[1:5,1:5]

# Close the file
h5closeAll()

all.equal(resblockmult, resblockmult_hdf5)

We obtain the same result using bdblockmult and bdblockmult_hdf5.

Cross-product and Transposed Cross-product with hdf5 files

To perform a cross-product $C = A^t A$ you can call bdCrossprod_hdf5() function. We just have to set the parameters filename, group and A with the file, group and the dataset that we want to use and optionally the outgroup parameter to set the output group where we want to store results.

The result dataset is stored with CrossProd_x pattern.

res <- bdCrossprod_hdf5(filename = 'robject.hdf5', 
                        group="INPUT", 
                        A="A", 
                        outgroup = "RESULTS")

We obtain the expected values computed using crossprod function.

# Get content
h5data <- H5Fopen(res$file)

# We can get wher data is stored
res$dataset

# We get the Crossprod Results and the original matrix
resCProd <- h5data$RESULTS$CrossProd_AxA
A <- h5data$INPUT$A

# Close the file
h5closeAll()

# Show results
resCProd[1:5,1:5]

all.equal(resCProd, crossprod(A))

You may also compute the Crossprod with two different matrix $C = A^t B$ ,

set.seed(5234)
n <- 500
m <- 600
B <- matrix(rnorm(n*m,mean=0,sd=1), n,m)

bdAdd_hdf5_matrix(B, filename = 'robject.hdf5', group="INPUT", dataset = "B2")

# Get Crossprod with two matrix
res <- bdCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", 
                        groupB = "INPUT", B = "B2", outgroup = "RESULTS")

We obtain the expected values computed using crossprod function.

# Get content
h5data <- H5Fopen(res$file)

# We can get wher data is stored
res$dataset

# We get the Crossprod Results and the original matrix
resCProd <- h5data$RESULTS$CrossProd_AxB2
A <- h5data$INPUT$A
B <- h5data$INPUT$B2

# Close the file
h5closeAll()

# Show results
resCProd[1:5,1:5]

all.equal(resCProd, crossprod(A,B))

you may also set use bdtCrossprod_hdf5() function to get a transposed cross-product $C = A A^t$ . In that case the result dataset is stored with tCrossProd_x pattern.

res <- bdtCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", outgroup = "RESULTS")

We obtain the expected values computed using crossprod function.

# Get content
h5data <- H5Fopen(res$file)

# We can get wher data is stored
res$dataset

# We get the Crossprod Results and the original matrix
restCProd <- h5data$RESULTS$tCrossProd_AxA
A <- h5data$INPUT$A

# Close the file
h5closeAll()

# Show results
restCProd[1:5,1:5]

all.equal(restCProd, tcrossprod(A))

You may also compute the tCrossprod with two different matrix $C = A B^t$

set.seed(5234)
n <- 500
m <- 600
B <- matrix(rnorm(n*m,mean=0,sd=1), n,m)

bdAdd_hdf5_matrix(B, filename = 'robject.hdf5', group="INPUT", dataset = "B3")

# Get Crossprod with two matrix
res <- bdtCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", 
                         groupB = "INPUT", B = "B3", outgroup = "RESULTS")

We obtain the expected values computed using crossprod function.

# Get content
h5data <- H5Fopen(res$file)

# We can get wher data is stored
res$dataset

# We get the Crossprod Results and the original matrix
restCProd <- h5data$RESULTS$tCrossProd_AxB3
A <- h5data$INPUT$A
B <- h5data$INPUT$B3

# Close the file
h5closeAll()

# Show results
restCProd[1:5,1:5]

all.equal(restCProd, tcrossprod(A,B))

Singular Value Decomposition (SVD)

The SVD of an $m \times n$ real or complex matrix $A$ is a factorization of the form:

$$U\Sigma { V }^{ T }$$

where :

Notice that:

Block Singular Values Decomposition

This method was developed by M. A. Iwen and B. W. Ong (2016). The authors introduced a distributed and incremental SVD algorithm that is useful for agglomerative data analysis on large networks. The algorithm calculates the singular values and left singular vectors of a matrix A, by first, partitioning it by columns. This creates a set of submatrices of A with the same number of rows, but only some of its columns. After that, the SVD of each of the submatrices is computed. The final step consists of combining the results obtained by merging them again and computing the SVD of the resulting matrix.

knitr::include_graphics("imgs/blocksvd.png")

This method is implemented in bdSVD_hdf5 function, this function works directly on hdf5 data format, loading in memory only the data to perform calculations and saving the results again in the hdf5 file for later use.

We have to indicate the file to work with, the dataset name and the group where the dataset is located :

# Create dataframe data with 'odata' matrix in delayed hdf5 file at OMIC group
set.seed(5234)
n <- 100
m <- 150000
odata <- matrix(rnorm(n*m, mean=0, sd=1), n,m)

bdAdd_hdf5_matrix(odata, "delayed.hdf5", "OMICS", "data")

# Direct from hdf5 data file
svdh5 <- bdSVD_hdf5("delayed.hdf5", "OMICS", "data")

# with R implementation from data in memory
test <- H5Fopen("delayed.hdf5")
# get results svd (d)
svd_hdf5_d <- test$SVD$data$d[1:7]
# get data
omdata <- test$OMICS$data
h5closeAll()

# Results in hdf5 file for d
svd_hdf5_d[1:7]

svd <- svd(scale(omdata))
svd$d[1:7]

Like in Simple Singular Value Decomposition we can normalize, center or scale data before proceed with SVD decomposition with bscale and bcenter parameters, by default this parameter are TRUE, data is normalized before SVD decomposition. To proceed with SVD without normalization :

# Direct from hdf5 data file
svdh5 <- bdSVD_hdf5("delayed.hdf5", "OMICS", "data", 
                    bcenter = FALSE, bscale = FALSE,
                     k = 2, q = 1)


# get results svd (d)
test <- H5Fopen("delayed.hdf5")
svd_hdf5_d <- test$SVD$data$d[1:7] 
h5closeAll()

# SVD (d) from file - data not normalized
svd_hdf5_d

# with R implementation from data in memory
svd <- svd(omdata)
svd$d[1:7]

In the SVD decomposition by blocks we can indicate the number of decomposition levels and number of local SVDs to concatenate at each level with parameters q and k respectively, by default q = 1 one level with k=2.

# Block decomposition with 1 level and 4 local SVDs at each level
svdh5 <- bdSVD_hdf5(file = "delayed.hdf5", group = "OMICS", dataset = "data",
                    threads = 2)

# get results svd (d)
fprova <- H5Fopen("delayed.hdf5")
  svd_hdf5_d <- fprova$SVD$data$d[1:7]
h5closeAll()

# SVD (d) from file - data not normalized
svd_hdf5_d

# with R implementation from data in memory
svd <- svd(scale(omdata))
svd$d[1:7]

Utils for data analysis working directly in hdf5 files in genomic data

We have also implemented a set of functions to deal with genomic data which is a setting were big data is normally present. These are described in the next sections

Genomic data imputation

Imputation in genetics refers to the statistical inference of unobserved genotypes. In genetic epidemiology and quantitative genetics, researchers aim at identifying genomic locations where variation between individuals is associated with variation in traits of interest between individuals.

At the moment BigDataStatMeth has implemented snps imputation following the encoding used in Genomic Data Structure file format (GDS files) where SNPs are codded as 0 : two B alleles, 1 : one A allele and one B allele, 2 : two A alleles and 3 : missing data. BigDataStatMeth impute data where SNPs values are '3' (missing data).

The imputation method implemented is generated from the information of the SNP data that contains the missing data, it is performed by generating a random value following a discrete distribution. If in the SNP we find 70% of '0', 25% of '1' and 5% of '2', the value of the missing data will be imputed with a probability of 70% to be '0', 25% to be '1' and 5% to be '2'.

We first simulate a genotype matrix with 0, 1, 2 and 3 and store it in hdf5 file and show data stored in file:

set.seed(108432)
geno.sim <- matrix(sample(0:3, 10000, replace = TRUE), byrow = TRUE, ncol = 10)
bdAdd_hdf5_matrix(geno.sim, "delayed.hdf5", "OMICS", "geno")

# Get data and show the first 5 rows
h5fsvd = H5Fopen("delayed.hdf5")
geno <- h5fsvd$OMICS$geno
h5closeAll()

geno[1:5,1:10]

Remember, we allways have to close hdf5 data file. Now we impute values where genotype = '3' (missing data in GDS format). To apply imputation we only have to indicate the file where dataset is, the group and the dataset name, optionally, we indicate were we want to store results with outputgroup and outputfolder, by default the results are stored in the input dataset. An important parameter is bycols, with bycols we can inform if we want to impute data by cols bycols = TRUE or by rows bycols = FALSE, the default imputation implemented method is by cols.

In next example we perform the imputation in "imputedgeno" dataset inside the group "OMICS" and the imputation method applied takes into account the probabilities in columns (default).

bdImpute_snps_hdf5("delayed.hdf5", group="OMICS", dataset="geno",
                outgroup="OMICS", outdataset="imputedgeno")

Now we get the imputation results

# Get imputed data and show the first 5 rows
h5fsvd = H5Fopen("delayed.hdf5")
imputedgeno <- h5fsvd$OMICS$imputedgeno
h5closeAll()

imputedgeno[1:5,1:10]

Normalization (center and scale) of genomic data

Normalization is one of the most important procedures in genomics data analysis. A typical dataset contains more than one sample and we are almost always interested in making comparisons between these and in order to make comparisons we need to normalize data.

In BigDataStatMeth we implemented a normalization method that works directly with datasets stored in hdf5 file format, like in other functions implemented in BigDataSet that works with hdf5 data files, we have to indicate where the data is, the filename, the name of the group under the dataset is and the dataset name. In normalization function, we can indicate if we want to center data, scale or center and scale (default option). The applied formula to normalize data by default in implemented function is : $$ \frac{X - \mu}{\sigma}$$

The implemented function to normalize data direct from hdf5 data file is bdNormalize_hdf5.

To show you an example, we will use the imputed geno dataset imputedgeno created before in imputation method example. To normalize data we have to indicate where the dataset is stored and we have two optiona paramters bcenter and bscale with default value = TRUE. In order to normalize data, we don't need to modify these optional parameters. If we only want to center data, then we have to put bscale = FALSE or if we only want to scale data bcenter parameter must be FALSE.

Important : The normalization results are stored under "NORMALIZED" group, under this group, we found all normalized datasets with the same structure under root, for example, if we normalize dataset genoimputed under OMICS group, the normalized results will be in "NORMALIZED"/"OMICS"/genoimputed

Let show an example normalizing imputed geno under group "OMICS" in delayed.hdf5 file:

bdNormalize_hdf5("delayed.hdf5", group="OMICS", dataset="imputedgeno")

The results will be under groups "NORMALIZED" - "OMICS" in dataset genoimputed

# Geno after normalization
h5fsvd = H5Fopen("delayed.hdf5")
genonormalized <- h5fsvd$NORMALIZED$OMICS$geno
h5closeAll()

genonormalized[1:5,1:10]

QC - Remove low data

Missing data in meta-analysis of genetic association studies are unavoidable. In BigDataStatMeth we implemented a function to remove those SNPs with a hight percentage of missing data. The implemented function to remove SNPs wit a certain percentage of missing data is the function bdRemovelowdata

Like in imputation method, in remove low data method we have to inform the input data indicating the filename, the group and the dataset name and we also have to inform the dataset were data will be stored after removing the SNPs with parameters outgroup and outdataset.

To remove low data we have to infrorm if we want to remove data by cols bycols = TRUE or by rows bycols = FALSE and the percentage of missing data pcent parameter. pcent of missing data refers to missing percentage in columns if we defined bycols = TRUE or pcent refers to rows if bycols = FALSE.

To show how renove low data works we use the previous dataset stored in delayed.hdf5 file, in that case, we will assume that SNPs are in columns and we will remove those SNPs where missing data is greater than 40%.

bdRemovelowdata("delayed.hdf5", group="OMICS", dataset="geno",
                outgroup="OMICS", outdataset="removedgeno",
                bycols = TRUE, pcent = 0.39)

After remove the SNPs with pcent greater than 39% results have been saved in dataset "removedgeno" under group "OMICS", we observe the resulting dataset

# Get imputed data and show the first 5 rows
h5fsvd = H5Fopen("delayed.hdf5")
removedgeno <- h5fsvd$OMICS$removedgeno
h5closeAll()

removedgeno[1:5,1:10]

Session information

sessionInfo()


Try the BigDataStatMeth package in your browser

Any scripts or data that you put into this service are public.

BigDataStatMeth documentation built on March 30, 2022, 1:07 a.m.