Working with Large Datasets

knitr::opts_chunk$set(fig.align="center", cache=FALSE, cache.path = "largeDatasets_cache/",  fig.path="largeDatasets_figure/",error=FALSE, #make it stop on error
fig.width=6,fig.height=6,autodep=TRUE,out.width="600px",out.height="600px",
message=FALSE)
#knitr::opts_knit$set(stop_on_error = 2L) #really make it stop
#knitr::dep_auto()

Avoid NxN matrices

The best advice on using the clustering functions in clusterExperiment for large datasets is to avoid calculating any $NxN$ distance or similarity matrix. They take a long time to calculate, and a large amount of memory to store.

Choice of Clustering Routine

The most likely reason to calculate such a matrix is because of the clustering routine used. Methods like PAM or hierarchical clustering use a distance matrix and are not good choices for large datasets.

Prior to version 2.5.5, our functions would internally calculate a distance matrix if the clustering algorithm needed it, and it would be hard for the user to realize that they selected a clustering routine that needed such a matrix. Now we have added the argument makeMissingDiss, which, if FALSE, will not calculate any needed distance matrices and instead return an error. We recommend setting this argument to FALSE with large datasets as a caution. If you discover that you are hitting an error, select a different clustering algorithm that does not need a $NxN$ distance matrix.

Note, that this may not work with PAM, because PAM takes as input a matrix $x$ (see ?pam). But if a $x$ matrix is given as input, the pam function simply calculates internally the distance matrix! The option makeMissingDiss=FALSE may not catch this, since the actual clustering function allows for using an input matrix $x$. (This is an unfortunate for large datasets, and we may in the future change how we classify the possible input into PAM to classify it as a method that only accepts distance matrices to allow it to be caught by makeMissingDiss=FALSE.)

Similarly, using any options regarding silhouette distance will create a $NxN$ matrix as part of the silhouette computation in cluster package. This includes findBestK=TRUE argument. These options should only be considered for moderate sized datasets where the calculation (and storage) of the $NxN$ matrix is not a problem.

Subsampling and Consensus clustering

Unfortunately, subsampling and consensus clustering (with makeConsensus) operate by clustering based on the proportion of shared clusterings per pairs of sample, which has been in past versions stored by clusterExperiment in a $NxN$ matrix (see the main tutorial vignette for an explanation of these methods). While we are working on methods to avoid calculating this matrix, they are not yet completely operational in avoiding the $NxN$ matrix.

We have, however, in version 2.5.5 made some infrastructure changes to allow for avoidance of the $NxN$ matrix for subsampling and consensus clustering if the user has defined a clustering function to do this (see details below).

We have also in version 2.5.5 changed the clustering functions to allow the user to request clustering of only unique representations of the combinations of clusterings from subsampling or in makeConsensus, significantly reducing the size of the $NxN$ matrix used in the actual clustering step (see below).

Technical details

Here we document some infrastructure changes made to allow for avoidance of the $NxN$ matrix for subsampling and consensus clustering. These do not, as of yet, actually provide the ability to avoid the $NxN$ calculation for the clustering, but do set up an infrastructure where the user can now provide the appropriate clustering routine to avoid it.

Data in Memory versus HDF5

The package is compatible with HDF5 Matrices, meaning that the package will run if the data given is a reference to a HDF5 file. However, the code may acheive such this compatibility by bringing the full matrix into memory. In particular, the default clustering routines are not compatible with the HDF5 implementation, meaning that they must bring the full dataset into memory for calculations.

The only exception to this is the method "mbkmeans" which calls on the clustering routine (from the package of the same name). This package implements a version of kmeans ("Mini-Batch kmeans") that truly works with the structure of the HDF5 datasets to avoid bringing the full dataset into memory. "Mini-batch kmeans" refers to only using a proportion of the data (a "batch") at each iteration of the clustering. The mbkmeans package integrates this with HDF5 files, among other formats, meaning that mbkmeans actually is written (in C code) so as to not bring the entire dataset into memory but only the subset (or batch) needed for any particular calculation.

Unlike the mbkmeans package, however, the integration in clusterExperiment has not been tested to ensure that the full dataset is not inadvertantly brought into memory by other components of clusterExperiment infrastructure. This is an ongoing area for improvement. (So far integration of mbkmeans as a built-in options in clusterExperiment has only been tested so far that it successfully runs the clustering routine.)

Further comments:



Try the clusterExperiment package in your browser

Any scripts or data that you put into this service are public.

clusterExperiment documentation built on Feb. 11, 2021, 2 a.m.