knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, results = "hide")
LIGER supports integrating datasets based on HDF5 files, for examples, the 10X CellRanger output file located at sample/outs/filtered_feature_bc_matrix.h5
can be used with LIGER. Here's the link to learn more about CellRanger output HDF5 files. With the feature that we read one chunk of data into memory at a time, large-scale datasets stored on disk can be analyzed with a rather small memory specification.
Large datasets are often generated over multiple 10X runs (for example, multiple biological replicates). In such cases it may be necessary to merge the HDF5 files from each run into a single HDF5 file. We provide the
mergeH5()
function for this purpose.
To integrate datasets stored in HDF5 files, we first create a liger object using the file names of the HDF5 files. We prepared the following example datasets for demonstration:
library(rliger) dataSource <- data.frame( filenames = c("pbmcs_ctrl_vignette.h5", "pbmcs_stim_vignette.h5"), urls = c("https://figshare.com/ndownloader/files/43108963", "https://figshare.com/ndownloader/files/43108966") ) for (i in seq_len(nrow(dataSource))) { if (!file.exists(dataSource$filenames[i])) { download.file(dataSource$urls[i], destfile = dataSource$filenames[i]) } } pbmcs <- createLiger(list(ctrl = dataSource$filenames[1], stim = dataSource$filenames[2]))
Similar to other iNMF integration tutorial, we also need to preprocee the data by 1. normalizing each cell by its library size, 2. select variable genes for each dataset, and 3. scale the gene expression within each dataset in a non-negative way.
pbmcs <- pbmcs %>% normalize() %>% selectGenes(var.thresh = 0.2) %>% scaleNotCenter()
Now we can use online iNMF to factorize the data, again using only minibatches that we read from the HDF5 files on demand (default mini-batch size = 5000). Sufficient number of iterations is crucial for obtaining ideal factorization result. If the size of the mini-batch is set to be close to the size of the whole dataset (i.e. an epoch only contains one iteration), max.epochs
needs to be increased accordingly for more iterations.
pbmcs <- runIntegration(pbmcs, k = 20, method = "online")
After performing the factorization, we can perform quantile normalization to align the datasets, and then use graph based community detection algorithm to find clusters.
pbmcs <- quantileNorm(pbmcs) pbmcs <- runCluster(pbmcs, nNeighbors = 30, resolution = .6, nRandomStarts = 2)
We can also visualize the cell factor loadings in two dimensions using t-SNE or UMAP.
pbmcs <- runUMAP(pbmcs, nNeighbors = 30, minDist = .3) plotByDatasetAndCluster(pbmcs)
For downstream analysis that requires expression data, such as differential expression test, we currently do not support direct computation using on-disk HDF5 data. In order to conduct such analyses, users need to create downsampled subset from the large-scale on-disk data and read it into memory. By default, downsample()
function randomly samples maxCells
cells. Alternatively, we can set balanced sampling basing on given categorical variables. Note that newH5 = FALSE
has to be set in order to take the subsample into memory, other wise another downsampled HDF5 file would be created and still cannot be used for those downstream analysis method.
pbmcs.sample <- downsample(pbmcs, maxCells = 5000, useSlot = "rawData", newH5 = FALSE)
Using the sampled data stored in memory, we can now compare clusters or datasets to identify differentially expressed genes. The runMarkerDEG()
function performs differential expression analysis, by default with Wilcoxon rank-sum test. The default mode of mark detection goes for finding the marker of each cluster by performing the test between a cluster against all the rest. Here, we demonstrate finding the dataset-specific markers, within each cluster.
markerTable <- runMarkerDEG(pbmcs.sample, conditionBy = "dataset", splitBy = "leiden_cluster")
The returned object is a data.frame
with all test result combined in one, without filtering or sorting. The group
column contains the information for splitting the tests. For example, when the value is "2.stim"
, it indicates the result statistics is done within cluster 2 comparing "stim" dataset against others. Here we show how to use dplyr commands to get the top 20 dataset markers of "stim" within cluster 2.
library(dplyr) markerTable %>% filter(group == "2.stim", logFC > 0, padj < 0.05) %>% arrange(padj, -logFC) %>% slice_head(n = 20)
We can show UMAP coordinates of sampled cells by their loadings on each factor (Factor 4 as an example). The lower part of the figure shows the gene loading ranking of the factor, with the middle showing the ranking of the shared gene loading, and the dataset-specific gene loading ranking on two sides. Note that this part of test requires normalized gene expression loaded in memory, whereas we only down-sampled the raw count data.
pbmcs.sample <- normalize(pbmcs.sample) factorMarker <- getFactorMarkers(pbmcs.sample, dataset1 = "ctrl", dataset2 = "stim") plotGeneLoadings(pbmcs.sample, markerTable = factorMarker, useFactor = 4)
With the subsampled data loaded into memory, we can then also generate plots of dimensional reduction coordinates colored by expression of specified gene. For example, the gene ISG15 has expression visualized below.
plotGeneDimRed(pbmcs.sample, "ISG15")
Furthermore, we can make violin plots of expression of specified gene. Taking ISG15 as an example again, we can make a violin plot for each dataset and group each by cluster.
plotGeneViolin(pbmcs.sample, "ISG15", byDataset = TRUE, title = names(pbmcs.sample))
The second scenario of online learning algorithm is where we have performed factorization on existing dataset(s), and we add on continually arriving datasets to the pool and update the factorization with existing result.
For example, still with the same PBMC example, we first perform online iNMF on the control dataset.
pbmcs <- createLiger(list(ctrl = "pbmcs_ctrl_vignette.h5")) %>% normalize() %>% selectGenes() %>% scaleNotCenter() %>% runIntegration(method = "online")
Then, we obtain the new dataset, and then run online iNMF with adding it as a new arriving dataset. There are a couple of points to note.
newDatasets
. Users need to be aware of the second point above, because we remove missing features by default and can cause error when variable features obtained from existing dataserts is identified as missing when creating the object for new datasets.```{R, results="hide"} pbmcs.new <- runOnlineINMF(pbmcs, newDatasets = list(stim = "pbmcs_stim_vignette.h5"))
After the factorization, we can go through the same procedure to align the factor loading and jointly cluster the cells, as well as visualize the integration result. ```r pbmcs.new <- quantileNorm(pbmcs.new) %>% runCluster(nNeighbors = 30, resolution = .3) %>% runUMAP(nNeighbors = 30, minDist = .3) plotByDatasetAndCluster(pbmcs.new)
The third scenario serves as a fast inspection for new datasets. We do not update the factorization with newly arriving data, but instead, compute the factor loading in cells (i.e. $H$ matrices for new datasets) using the existing gene loading in the factors ($W$ matrix from previous factorization). Therefore, no dataset-specific gene loading in factors ($V$ matrices) can be produced for the new datasets in this approach. Scenario 3 can be triggered with specifying the new datasets, following the same guidance as in scenario 2 above, and set projection = TRUE
.
pbmcs <- createLiger(list(ctrl = "pbmcs_ctrl_vignette.h5")) %>% normalize() %>% selectGenes() %>% scaleNotCenter() %>% runIntegration(method = "online") pbmcs.new <- runOnlineINMF(pbmcs, newDatasets = list(stim = "pbmcs_stim_vignette.h5"), projection = TRUE) pbmcs.new <- quantileNorm(pbmcs.new) %>% runCluster(nNeighbors = 30, resolution = .3) %>% runUMAP(nNeighbors = 30, minDist = .3) plotByDatasetAndCluster(pbmcs.new)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.