knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The algorithm takes a list of two or more digital gene expression (DGE) matrices as input. Genes should be in rows and cells in columns. Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell, select variable genes, and scale the data. Note that we do not center the data because nonnegative matrix factorization accepts only positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that corresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.
dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is recommended. dge2 = readRDS("dge2.RDS") ligerex = createLiger(list(name1 = dge1, name2 = dge2)) #Can also pass in more than 2 datasets ligerex = normalize(ligerex) ligerex = selectGenes(ligerex, var.thresh = 0.1) ligerex = scaleNotCenter(ligerex)
liger
also has functions for reading data generated by 10X's cellranger count
pipeline (including
from 10X V3). We can merge data by data type (most commonly Gene Expression) across multiple samples and
then use this as a single dataset in a new object for integration.
# 10X data to be combined for sample 1 sample1_data1 = "/path/to/10X/output/dir1" sample1_data2 = "/path/to/10X/output/dir2" sample1_dge = read10X(sample.dirs = list(sample1_data1, sample1_data2), sample.names = c("s1_data1", "s1_data2"), min.umis = 500) # 10X data for sample 2 sample2_data = "/path/to/10X/output/dir3" sample2_dge = read10X(sample.dirs = list(sample2_data), sample.names = c("s2_data"), min.umis = 500) liger10X = createLiger(list(sample1 = sample1_dge, sample2 = sample2_dge)) # continue with other preprocessing steps
Next we perform the factorization using an alternating least squares algorithm. After performing the factorization, we identify cells that load on corresponding cell factors and quantile normalize their factor loadings across datasets. The key parameters here are the number of factors (k), the penalty parameter (lambda), and the clustering resolution. In most cases, the default settings of lambda=5.0 and resolution=1.0 provide reasonable results.
ligerex = optimizeALS(ligerex, k = 20) ligerex = quantileAlignSNF(ligerex) #SNF clustering and quantile alignment
We can visualize the results by using dimensionality reduction techniques like t-SNE or UMAP (recommended
for larger datasets). Visualizations can be colored by dataset of origin, cluster assignment, or
any feature included in or added to the cell metadata (cell.data
).
plotWordClouds
and plotGeneLoadings
are useful ways to visualize the most highly loading genes (both shared and dataset specific) for each factor, in conjunction with the factor loadings across cells.
These functions are very similar, but plotGeneLoadings
displays the top loading genes in ordered scatter plots instead of word clouds.
ligerex = runTSNE(ligerex) # for larger datasets, may want to use UMAP instead ligerex = runUMAP(ligerex) plotByDatasetAndCluster(ligerex) #Can also pass in different set of cluster labels to plot plotFeature(ligerex, "nUMI") pdf("word_clouds.pdf") plotWordClouds(ligerex) dev.off() pdf("gene_loadings.pdf") plotGeneLoadings(ligerex) dev.off()
Another way to examine factor loadings across cells, and to help visualize the alignment process is
through plotFactors
; this can be helpful for seeing significant scale differences between the
datasets. We can also visualize the correspondence between clusters and factors in the data with
plotClusterProportions
and plotClusterFactors
. These can be especially useful for identifying
clusters associated with certain factors and corresponding marker genes.
plotFactors(ligerex) plotClusterProportions(ligerex) plotClusterFactors(ligerex, use.aligned = T)
We can use the factorization to more explicitly identify shared and dataset-specific markers. The function getFactorMarkers
returns a list, where the first element contains dataset-specific markers for dataset 1, the second
element contains sharedmarkers, the third element contains dataset-specific markers for dataset 2,
and the last 2 elements indicate the number of factors in which each marker is found. This information
allows the identification of ubiquitous vs. cell-type-specific dataset differences.
markers = getFactorMarkers(ligerex, num.genes = 10) plotGene(ligerex, gene = "Malat1") plotGeneViolin(ligerex, gene = "Malat1")
We can compare and visualize liger
cluster assignments with existing clusterings (if cluster assignments
are available for the individual datasets).
# published cluster assignments for all cells in dataset 1 and 2 clusters_published1 = readRDS("clusters1.RDS") clusters_published2 = readRDS("clusters2.RDS") # calculate adjusted Rand Index between liger cluster assignments and another assignment calcARI(ligerex, c(clusters_published1, clusters_published2)) # calculate purity between liger cluster assignments and another assignment for just dataset 1 calcPurity(ligerex, clusters_published1) # visualize joint cluster assignment as related to individual dataset cluster assignments makeRiverplot(ligerex, clusters_published1, clusters_published2)
liger
includes methods for estimating the level of integration (alignment) between datasets and
the level of distortion of structure for each individual dataset after factorization and alignment
(compared to before). In general, datasets which have been factorized and aligned in a meaningful way
should show a high degree of integration (high alignment metric), while maintaining a low degree of distortion
(high agreement metric).
calcAlignment(ligerex) calcAgreement(ligerex) # see if certain clusters are more integrated than others calcAlignmentPerCluster(ligerex)
The suggestK
and suggestLambda
functions can aid in selecting k and lambda. We want to find the smallest
k for which the increase in entropy metric begins to level off (an "elbow" in the plot). Similarly, we want the
smallest lambda for which the alignment metric stabilizes.
suggestK(ligerex) # plot entropy metric to find an elbow that can be used to select the number of factors suggestLambda(ligerex, k) # plot alignment metric to find an elbow that can be used to select the value of lambda
If we want to add new data, change k or lambda, or re-analyze a subset of the data, the functions below provide an efficient method of updating. This is much faster than the naive approach of simply re-running the optimizeALS algorithm.
ligerex = optimizeNewK(ligerex, k = 15) #Can also decrease K #Add new batches from the same condition/technology/species/protocol ligerex = optimizeNewData(ligerex, newdata = list(name1 = dge1.new, name2 = dge2.new), which.datasets = list(name1, name2), add.to.existing = T) #Add completely new datasets. Specify which existing datasets are most similar. ligerex = optimizeNewData(ligerex, newdata = list(name3 = dge3, name4 = dge4), which.datasets = list(name1, name2), add.to.existing = F) #cell.subset is a list of cells to retain from each dataset ligerex = optimizeSubset(ligerex, cell.subset)
We can easily create liger
objects from Seurat (V2 or V3) objects (and vice versa), while keeping
calculated features from the original objects.
# Create liger object from two separate Seurat objects, keeping union of top 2000 highly variable # genes from each object ligerex = seuratToLiger(list(seurat1, seurat2), names = c('name1', 'name2'), num.hvg.info = 2000) # Create liger object from single integrated Seurat V2 object, keeping CCA factorization, # splitting datasets by Seurat meta.var column "original" ligerex = seuratToLiger(seurat_obj, combined.seurat = T, meta.var = 'original', cca.to.H = T) # Create liger object from single integrated Seurat V3 object, splitting datasets by two # available assays in Seurat ligerex = seuratToLiger(seurat_obj, combined.seurat = T, assays.use = c('RNA', 'ADT')) # Create Seurat object from liger object, keeping liger highly variable genes seurat_obj = ligerToSeurat(ligerex, use.liger.genes = T)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.