library(knitr) library(BiocStyle) knitr::opts_chunk$set(collapse = TRUE, comment = "", cache = FALSE, message = FALSE, width = 180, crop = NULL)
if( isTRUE(file.exists('delayed.hdf5'))) { file.remove('delayed.hdf5') } if( isTRUE(file.exists('robject.hdf5'))){ file.remove('robject.hdf5') } if( isTRUE(file.exists('colesterol_file.hdf5'))){ file.remove('colesterol_file.hdf5') } if( isTRUE(file.exists('tmp_blockmult.hdf5'))){ file.remove('tmp_blockmult.hdf5') }
This package implements several matrix operations using Matrix
and DelayMatrix
objects as well as HDF5 data files. Some basic algebra operations that can also be computed that are useful to implement statistical analyses using standard methodologies such as principal component analyses (PCA) or least squares estimation. The package also contains specific statistical methods mainly used in omic
data analysis such as lasso regression. All procedures to work with objects loaded in memory can be found in BigDataStatMeth_memory vignette.
The package requires other packages from CRAN
and Bioconductor
to be installed.
CRAN
: Matrix
, RcppEigen
and RSpectra
.Bioconductor
: HDF5array
, rhdf5
The user can execute this code to install the required packages:
# Install BiocManager (if not previously installed) install.packages("BiocManager") # Install required packages BiocManager::install(c("Matrix", "RcppEigen", "RSpectra", "DelayedArray", "HDF5Array", "rhdf5"))
Our package needs to be installed from source code. In such cases, a collection of software (e.g. C, C++, Fortran, ...) are required, mainly for Windows users. These programs can be installed using Rtools.
Once required packages and Rtools are installed, BigDataStatMeth
package can be installed from our GitHub repository as follows:
# Install devtools and load library (if not previously installed) install.packages("devtools") library(devtools) # Install BigDataStatMeth install_github("isglobal-brge/BigDataStatMeth")
First, let us start by loading the required packages to describe the main capabilities of the package
library(rhdf5) library(BigDataStatMeth)
Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. It is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.
HDF5 is a technology suite that makes possible the management of extremely large and complex data collections, can accomodate virtually every kind of data in a single file, sequences, images, SNP matrices, and every other type of data and metadata associated with an experiment.
There is no limit on the number or size of data objects in the collection, giving great flexibility for omic data. Is high-performance I/O with a rich set of integrated performance features that allow for access time and storage space optimizations
HDF5 file structure include only two major types of object:
Datasets
, which are multidimensional arrays of a homogeneous type. For example, datasets for omics data could be to genomics, transcriptomics, epigenomics, proteomics and/or metabolomics experiments
Groups
, which are container structures which can hold datasets and other groups
This results in a truly hierarchical, filesystem-like data format
knitr::include_graphics("imgs/hdf5_squema.jpg")
Create hdf5 file
We have implemented the bdCreate_hdf5_matrix_file ()
function to create an hdf5 file with a group and a dataset in a single step. This function allows to create datasets from a standard R matrices from Matrix
objects.
library(rhdf5) set.seed(5234) n <- 500 m <- 600 A <- matrix(rnorm(n*m,mean=0,sd=1), n,m) # We also can create a dataset from R matrix object bdCreate_hdf5_matrix_file(filename = "robject.hdf5", object = A, group = "INPUT", dataset = "A")
We see 0 in the console, indicating that no errors where found when creating the hdf5 file. Notice that a file called "robject.hdf5" will be created in the working directory
list.files()
Add datasets in hdf5 file
The function bdAdd_hdf5_matrix ()
allows to add a dataset in a existing hdf5 file. We can create the dataset in any group, if group doesn't exists in file, the group is created before append the dataset.
set.seed(5234) n <- 100 m <- 10000 A <- matrix(rnorm(n*m,mean=0,sd=1), n, m) set.seed(5234) n <- 50 m <- 12000 B <- matrix(rnorm(n*m,mean=3,sd=0.5), n, m) # We create another data file (delayed.hdf5) with a matrix A. # The group is called INPUT bdCreate_hdf5_matrix_file(filename = "delayed.hdf5", object = A, group = "INPUT", dataset = "A") # And them, we add another matrix B to the same group bdAdd_hdf5_matrix(object = B, filename = "delayed.hdf5", group = "INPUT", dataset = "B")
Create transposed matrix in hdf5 file
The function Add_hdf5_matrix ()
also allows to create a dataset with a transposed matrix with the parameter transpose = TRUE
. For example, we can add a dataset in hdf5 with the B transposed matrix with :
# Create dataset BTransposed with B transposed matrix in delayed.hdf5 file at INPUT group bdAdd_hdf5_matrix(B, "delayed.hdf5", "INPUT", "BTransposed", transp = TRUE);
Open and get hdf5 content file
We can open an existing file show contents and access data using functions from rhdf5
package. rhdf5
is an R interface for HDF5. The file must always be opened before working with it.
With htls ()
function we get the the list of an hdf5 content without the need of open the file.
# Examine hierarchy before open file h5ls("delayed.hdf5")
Once opened, it can be seen just typing its name. In that case, we only get the current level in the hierarchical tree
# Open file h5fdelay <- H5Fopen("delayed.hdf5") # Show hdf5 hierarchy (groups) h5fdelay
NOTE: We can also use hdfview
Access datasets data
The \$ operator can be used to access the next group level, this operator reads the object from disk. We can assign the dataset contents to an R object in order to work with it.
Bdata <- h5fdelay$INPUT$B Bdata[1:3,1:5]
Or we can get the data for the transposed version of this matrix created before
BdataTrans <- h5fdelay$INPUT$BTransposed BdataTrans[1:5,1:3]
IMPORTANT NOTE: This is a very interesting feature. Imaging you wan to compute the mean of each column in the matrix A. If you want to do this in R, you first need to load the file and then use any of the existing functions in R. If A is big, loading all the matrix in R can be inneficient since you actually only need the information in one column
out <- sapply(1:100, function(i){ mean(h5fdelay$INPUT$A[,i]) }) head(out)
There are efficient ways of reading a subset of data without the need of reading the entire file. To this end, see this Practical tips.
Close hdf5 file
After work with hdf5 the user always must close the file. We can close only one file with H5Fclose
function or close all hdf5 opened files with h5closeAll
function
# Close delayed.hdf5 file H5Fclose(h5fdelay) # Open 2 files and close all h5fdelay <- H5Fopen("delayed.hdf5") h5fr <- H5Fopen("robject.hdf5") h5closeAll()
More information about working with hdf5 files in R can be found in several vignettes.
Usually we have files with a large amount of data to be loaded into R. For Big data problem, R can't deal with this data so we cannot efficiently work with then. In order to resolve this issue, you can work directly from data on disk with datasets stored in an HDF5 data file.
You can easily change your text files to HDF5 using the function bdImport_text_to_hdf5 ()
, to do that, we only have to define the input file with the filename
parameter and the HDF5 destination dataset with parameters outputfile
for the output file name, outGroup
with the group to store the dataset and the dataset name with the parameter outDataset
. It should be considered that:
This function only allows to import numeric data except for rownames and columnames where character are allowed.
Data can be delimited by different characters. By default, we use tabs ("\t") but it can be changed easily with sep
parameter.
In this example we convert the "colesterol.csv" file and we will save the data to a file 'colesterol_file.hdf5' under the group 'COLESTEROLGROUP' in the dataset 'COLESTEROLDATASET'. In this example the text file contains column names so we set the parameter header = TRUE
.
bdImport_text_to_hdf5(filename = "colesterol.csv", sep=',', outputfile = "colesterol_file.hdf5", outGroup = "COLESTEROL", outDataset = "COLESTEROLDATA", header = TRUE)
If we observe the new file colesterol_file.hdf5
and its content
h5ls("colesterol_file.hdf5") # We can open the file and have access to the data h5Colesterol <- H5Fopen("colesterol_file.hdf5") # Show hdf5 content dataset h5Colesterol$COLESTEROL$COLESTEROLDATA[1:5, 1:6] # Show colnames head(h5Colesterol$COLESTEROL$.COLESTEROLDATA_dimnames$`2`) H5Fclose(h5Colesterol)
NOTE: We can overwrite an existing hdf5 file with new data by setting overwrite = TRUE
Once our data are available as hdf5 files, we can operate directly on it without having data on memory (i.e we have access from disk. In BigDataStatMeth
we have implemented most of the common matrix operations and algebra functions to help users how are not familiar with this type of implementation. Next section provides several examples
In this section, different products of matrices and vectors are introduced. The methods implement different strategies including block multiplication algorithms and the use of parallel implementations with hdf5 files datasets.
BigDataStatMeth
allows to make an hdf5 matrix multiplication with :
To work with big matrices bdblockmult()
saves matrices in hdf5 file format in order to be able to operate with them in blocks and not overload the memory, by default are considered large matrices if the number of rows or columns is greater than 5000, but it can be changed with bigmatrix
argument.
In that case, by default the function bdblockmult()
, by default, this function creates the tmp_blockmult.hdf5 file with two groups the INPUT group with datasets A and B with original matrices and the OUTPUT group with dataset C with the results. The file name can be set with outputfile
parameter.
n <- 1000 p <- 10000 Abig <- matrix(rnorm(n*n), nrow=n, ncol=n) Bbig <- matrix(rnorm(n*p), nrow=n, ncol=p) # We want to force it to run in memory AxBNOBig <- bdblockmult(Abig, Bbig, bigmatrix = 100000) # We consider a big matrix if number of rows or columns are > 500 AxBBig3000 <- bdblockmult(Abig, Bbig, bigmatrix = 500)
Depending on whether the calculation has been performed directly in memory or from an hdf5 file, the returned object type is different.
If we work in memory results are returned as a current r matrix object,
class(AxBNOBig) AxBNOBig[1:5,1:5]
if we work in disk, we return a list with the filnme and the dataset that contains the results inside the hdf5 file. Then we can open the hdf5 file and get the result matrix and the original matrices,
To work in R with hdf5 data object we can use the rhdf5
function library from bioconductor
, we can open file and read all the content
h5data <- H5Fopen(AxBBig3000$file) # We can get where data is stored AxBBig3000$dataset # We observe that the data is in folder OUTPUT dataset C reshdf5 <- h5data$OUTPUT$C reshdf5[1:5,1:5] all.equal(reshdf5, AxBNOBig)
Remember that it is It is important to close all dataset, group, and file handles when not used anymore,
# Close file H5Fclose(h5data)
If we have the matrices stored in hdf5 data file, we can use the function bdblockmult_hdf5
. This function allows to perform a matrix multiplication from two matrices stored in a file.
To use the bdblockmult_hdf5
, we have to set the parameters filename
with the file that contains the matrices, group
, here, we have to indicate under which group are stored the matrices and the datasets names where data is stored parameters a
and b
. By default, the results group can be set by outgroup
parameter by default, the output group is OUTPUT .
To show how this function works, we will use the matrices created with bdblockmult
function previously used and we will store the results under the 'HDF5_RES' group
res <- bdblockmult_hdf5(filename = 'tmp_blockmult.hdf5', group="INPUT", a="A", b="B", outgroup = 'HDF5_RES') # We show the hdf5 content h5ls(res$file)
Now, we compare the obtained results with bdblockmult_hdf5
and bdblockmult
functions .
# Get content h5data <- H5Fopen(res$file) # We can get where data is stored res$dataset # We get the results with bdblockmult (previous step) resblockmult <- h5data$OUTPUT$C # and the results obtained from bdblockmult_hdf5 resblockmult_hdf5 <- h5data$HDF5_RES$A_x_B resblockmult_hdf5[1:5,1:5] # Close the file h5closeAll() all.equal(resblockmult, resblockmult_hdf5)
We obtain the same result using bdblockmult
and bdblockmult_hdf5
.
To perform a cross-product $C = A^t A$ you can call bdCrossprod_hdf5()
function. We just have to set the parameters filename
, group
and A
with the file, group and the dataset that we want to use and optionally the outgroup
parameter to set the output group where we want to store results.
The result dataset is stored with CrossProd_
res <- bdCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", outgroup = "RESULTS")
We obtain the expected values computed using crossprod
function.
# Get content h5data <- H5Fopen(res$file) # We can get wher data is stored res$dataset # We get the Crossprod Results and the original matrix resCProd <- h5data$RESULTS$CrossProd_AxA A <- h5data$INPUT$A # Close the file h5closeAll() # Show results resCProd[1:5,1:5] all.equal(resCProd, crossprod(A))
You may also compute the Crossprod with two different matrix $C = A^t B$ ,
set.seed(5234) n <- 500 m <- 600 B <- matrix(rnorm(n*m,mean=0,sd=1), n,m) bdAdd_hdf5_matrix(B, filename = 'robject.hdf5', group="INPUT", dataset = "B2") # Get Crossprod with two matrix res <- bdCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", groupB = "INPUT", B = "B2", outgroup = "RESULTS")
We obtain the expected values computed using crossprod
function.
# Get content h5data <- H5Fopen(res$file) # We can get wher data is stored res$dataset # We get the Crossprod Results and the original matrix resCProd <- h5data$RESULTS$CrossProd_AxB2 A <- h5data$INPUT$A B <- h5data$INPUT$B2 # Close the file h5closeAll() # Show results resCProd[1:5,1:5] all.equal(resCProd, crossprod(A,B))
you may also set use bdtCrossprod_hdf5()
function to get a transposed cross-product $C = A A^t$ . In that case the result dataset is stored with tCrossProd_
res <- bdtCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", outgroup = "RESULTS")
We obtain the expected values computed using crossprod
function.
# Get content h5data <- H5Fopen(res$file) # We can get wher data is stored res$dataset # We get the Crossprod Results and the original matrix restCProd <- h5data$RESULTS$tCrossProd_AxA A <- h5data$INPUT$A # Close the file h5closeAll() # Show results restCProd[1:5,1:5] all.equal(restCProd, tcrossprod(A))
You may also compute the tCrossprod with two different matrix $C = A B^t$
set.seed(5234) n <- 500 m <- 600 B <- matrix(rnorm(n*m,mean=0,sd=1), n,m) bdAdd_hdf5_matrix(B, filename = 'robject.hdf5', group="INPUT", dataset = "B3") # Get Crossprod with two matrix res <- bdtCrossprod_hdf5(filename = 'robject.hdf5', group="INPUT", A="A", groupB = "INPUT", B = "B3", outgroup = "RESULTS")
We obtain the expected values computed using crossprod
function.
# Get content h5data <- H5Fopen(res$file) # We can get wher data is stored res$dataset # We get the Crossprod Results and the original matrix restCProd <- h5data$RESULTS$tCrossProd_AxB3 A <- h5data$INPUT$A B <- h5data$INPUT$B3 # Close the file h5closeAll() # Show results restCProd[1:5,1:5] all.equal(restCProd, tcrossprod(A,B))
The SVD of an $m \times n$ real or complex matrix $A$ is a factorization of the form:
$$U\Sigma { V }^{ T }$$
where :
Notice that:
This method was developed by M. A. Iwen and B. W. Ong (2016). The authors introduced a distributed and incremental SVD algorithm that is useful for agglomerative data analysis on large networks. The algorithm calculates the singular values and left singular vectors of a matrix A, by first, partitioning it by columns. This creates a set of submatrices of A with the same number of rows, but only some of its columns. After that, the SVD of each of the submatrices is computed. The final step consists of combining the results obtained by merging them again and computing the SVD of the resulting matrix.
knitr::include_graphics("imgs/blocksvd.png")
This method is implemented in bdSVD_hdf5
function, this function works directly on hdf5 data format, loading in memory only the data to perform calculations and saving the results again in the hdf5 file for later use.
We have to indicate the file to work with, the dataset name and the group where the dataset is located :
# Create dataframe data with 'odata' matrix in delayed hdf5 file at OMIC group set.seed(5234) n <- 100 m <- 150000 odata <- matrix(rnorm(n*m, mean=0, sd=1), n,m) bdAdd_hdf5_matrix(odata, "delayed.hdf5", "OMICS", "data") # Direct from hdf5 data file svdh5 <- bdSVD_hdf5("delayed.hdf5", "OMICS", "data") # with R implementation from data in memory test <- H5Fopen("delayed.hdf5") # get results svd (d) svd_hdf5_d <- test$SVD$data$d[1:7] # get data omdata <- test$OMICS$data h5closeAll() # Results in hdf5 file for d svd_hdf5_d[1:7] svd <- svd(scale(omdata)) svd$d[1:7]
Like in Simple Singular Value Decomposition we can normalize, center or scale data before proceed with SVD decomposition with bscale
and bcenter
parameters, by default this parameter are TRUE, data is normalized before SVD decomposition. To proceed with SVD without normalization :
# Direct from hdf5 data file svdh5 <- bdSVD_hdf5("delayed.hdf5", "OMICS", "data", bcenter = FALSE, bscale = FALSE, k = 2, q = 1) # get results svd (d) test <- H5Fopen("delayed.hdf5") svd_hdf5_d <- test$SVD$data$d[1:7] h5closeAll() # SVD (d) from file - data not normalized svd_hdf5_d # with R implementation from data in memory svd <- svd(omdata) svd$d[1:7]
In the SVD decomposition by blocks we can indicate the number of decomposition levels and number of local SVDs to concatenate at each level with parameters q
and k
respectively, by default q = 1
one level with k=2
.
# Block decomposition with 1 level and 4 local SVDs at each level svdh5 <- bdSVD_hdf5(file = "delayed.hdf5", group = "OMICS", dataset = "data", threads = 2) # get results svd (d) fprova <- H5Fopen("delayed.hdf5") svd_hdf5_d <- fprova$SVD$data$d[1:7] h5closeAll() # SVD (d) from file - data not normalized svd_hdf5_d # with R implementation from data in memory svd <- svd(scale(omdata)) svd$d[1:7]
We have also implemented a set of functions to deal with genomic data which is a setting were big data is normally present. These are described in the next sections
Imputation in genetics refers to the statistical inference of unobserved genotypes. In genetic epidemiology and quantitative genetics, researchers aim at identifying genomic locations where variation between individuals is associated with variation in traits of interest between individuals.
At the moment BigDataStatMeth
has implemented snps imputation following the encoding used in Genomic Data Structure file format (GDS files) where SNPs are codded as 0 : two B alleles, 1 : one A allele and one B allele, 2 : two A alleles and 3 : missing data. BigDataStatMeth impute data where SNPs values are '3' (missing data).
The imputation method implemented is generated from the information of the SNP data that contains the missing data, it is performed by generating a random value following a discrete distribution. If in the SNP we find 70% of '0', 25% of '1' and 5% of '2', the value of the missing data will be imputed with a probability of 70% to be '0', 25% to be '1' and 5% to be '2'.
We first simulate a genotype matrix with 0, 1, 2 and 3 and store it in hdf5 file and show data stored in file:
set.seed(108432) geno.sim <- matrix(sample(0:3, 10000, replace = TRUE), byrow = TRUE, ncol = 10) bdAdd_hdf5_matrix(geno.sim, "delayed.hdf5", "OMICS", "geno") # Get data and show the first 5 rows h5fsvd = H5Fopen("delayed.hdf5") geno <- h5fsvd$OMICS$geno h5closeAll() geno[1:5,1:10]
Remember, we allways have to close hdf5 data file. Now we impute values where genotype = '3' (missing data in GDS format). To apply imputation we only have to indicate the file where dataset is, the group and the dataset name, optionally, we indicate were we want to store results with outputgroup
and outputfolder
, by default the results are stored in the input dataset.
An important parameter is bycols
, with bycols
we can inform if we want to impute data by cols bycols = TRUE
or by rows bycols = FALSE
, the default imputation implemented method is by cols.
In next example we perform the imputation in "imputedgeno" dataset inside the group "OMICS" and the imputation method applied takes into account the probabilities in columns (default).
bdImpute_snps_hdf5("delayed.hdf5", group="OMICS", dataset="geno", outgroup="OMICS", outdataset="imputedgeno")
Now we get the imputation results
# Get imputed data and show the first 5 rows h5fsvd = H5Fopen("delayed.hdf5") imputedgeno <- h5fsvd$OMICS$imputedgeno h5closeAll() imputedgeno[1:5,1:10]
Normalization is one of the most important procedures in genomics data analysis. A typical dataset contains more than one sample and we are almost always interested in making comparisons between these and in order to make comparisons we need to normalize data.
In BigDataStatMeth we implemented a normalization method that works directly with datasets stored in hdf5 file format, like in other functions implemented in BigDataSet that works with hdf5 data files, we have to indicate where the data is, the filename, the name of the group under the dataset is and the dataset name. In normalization function, we can indicate if we want to center data, scale or center and scale (default option). The applied formula to normalize data by default in implemented function is : $$ \frac{X - \mu}{\sigma}$$
The implemented function to normalize data direct from hdf5 data file is bdNormalize_hdf5
.
To show you an example, we will use the imputed geno dataset imputedgeno created before in imputation method example. To normalize data we have to indicate where the dataset is stored and we have two optiona paramters bcenter
and bscale
with default value = TRUE. In order to normalize data, we don't need to modify these optional parameters. If we only want to center data, then we have to put bscale = FALSE
or if we only want to scale data bcenter
parameter must be FALSE.
Important : The normalization results are stored under "NORMALIZED" group, under this group, we found all normalized datasets with the same structure under root, for example, if we normalize dataset genoimputed under OMICS group, the normalized results will be in "NORMALIZED"/"OMICS"/genoimputed
Let show an example normalizing imputed geno under group "OMICS" in delayed.hdf5 file:
bdNormalize_hdf5("delayed.hdf5", group="OMICS", dataset="imputedgeno")
The results will be under groups "NORMALIZED" - "OMICS" in dataset genoimputed
# Geno after normalization h5fsvd = H5Fopen("delayed.hdf5") genonormalized <- h5fsvd$NORMALIZED$OMICS$geno h5closeAll() genonormalized[1:5,1:10]
Missing data in meta-analysis of genetic association studies are unavoidable. In BigDataStatMeth we implemented a function to remove those SNPs with a hight percentage of missing data. The implemented function to remove SNPs wit a certain percentage of missing data is the function bdRemovelowdata
Like in imputation method, in remove low data method we have to inform the input data indicating the filename, the group and the dataset name and we also have to inform the dataset were data will be stored after removing the SNPs with parameters outgroup
and outdataset
.
To remove low data we have to infrorm if we want to remove data by cols bycols = TRUE
or by rows bycols = FALSE
and the percentage of missing data pcent
parameter. pcent
of missing data refers to missing percentage in columns if we defined bycols = TRUE
or pcent
refers to rows if bycols = FALSE
.
To show how renove low data works we use the previous dataset stored in delayed.hdf5 file, in that case, we will assume that SNPs are in columns and we will remove those SNPs where missing data is greater than 40%.
bdRemovelowdata("delayed.hdf5", group="OMICS", dataset="geno", outgroup="OMICS", outdataset="removedgeno", bycols = TRUE, pcent = 0.39)
After remove the SNPs with pcent greater than 39% results have been saved in dataset "removedgeno" under group "OMICS", we observe the resulting dataset
# Get imputed data and show the first 5 rows h5fsvd = H5Fopen("delayed.hdf5") removedgeno <- h5fsvd$OMICS$removedgeno h5closeAll() removedgeno[1:5,1:10]
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.