knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

This article only focuses on creating a liger object using data presented in various forms and provides detailed examples and explanations. Currently, we support importing the following forms of input data:

Whichever type of data you want to bring to a LIGER analysis, we generally recommend that it represents the raw counts information. LIGER performs optimally with its own normalization (library-size normalization only) and scaling (no centering) procedures, and other types of transformation may not be compatible with LIGER's assumptions.

The data imported from flat MTX files will be read into memory, internally converted into dgCMatrix objects. Data from other R object classes will also be converted into dgCMatrix objects. Data stored in HDF5 files can either be loaded into dgCMatrix or HDF5-backed DelayedMatrix representation. The latter is particularly useful when the datasets are large and cannot be loaded into memory.

The main liger object constructor function, createLiger(), requires a named list as its first argument, where each element is a dataset to be integrated. The elements of the dataset list must be all of the same type, i.e. all in memory (dgCMatrix) or all on disk (DelayedMatrix). In the case where users have mixed types of data in hand, we provide conversion functions to coerce everything to one side.

If you need to import data presented in other forms but encounter problems coercing them to supported formats, please feel free to open an Issue on our GitHub repository to request for support.

Importing 10X Cellranger Output

Given the prevalence of 10X Genomics' single-cell sequencing technology, the cellranger output has been the most commonly seen data source for real-world analyses. We provide the following examples showing how to read multi-sample datasets from cellranger output into a liger object.

Data presented with MTX files

MTX files are proposed for efficiently storing sparse matrix data into flat text files. Cellranger output MEX format data that contains a MTX file together with a text file with barcodes and a text file with feature names. For example, the following is the structure of the cellranger output directory of one certain sample.

# shell commands in unix terminal
$ cd /home/jdoe/runs/sample345/outs
$ tree filtered_feature_bc_matrix
filtered_feature_bc_matrix
    ├── barcodes.tsv.gz
    ├── features.tsv.gz
    └── matrix.mtx.gz
0 directories, 3 files

We provide a function read10X() for identifying proper files to use and loading them into memory. The function works for either just one sample or the root directory that conatins multiple samples.

Reading a single sample

To load a single sample:

```{R importMTX_single, eval=FALSE} sample1 <- read10X("path/to/filtered_feature_bc_matrix")

Depending on whether the cellranger version is before 3.0 or after, the returned object `sample1` varies in structure. For cellranger version 3.0 and later, the returned object is a named list where each element is for a sample, even if we opt to only read one sample. And this element would be another named list for available feature types. For example when doing scRNAseq, it will be `"Gene Expression"`.

sample1 $sampleName $sampleName$Gene Expression [... dgCMatrix]

For older cellranger, `sample1` will a named list, named by sample names, where each element is directly a dgCMatrix object.

sample1 $sampleName [... dgCMatrix]

#### Reading multiple samples from root directory

The root directory that `read10X()` expects is the directory where you can see subdirecories named by sample names.

shell commands in unix terminal

$ cd /home/jdoe/ $ tree runs runs ├── sample123 │ └── outs │ └── filtered_feature_bc_matrix │ ├── barcodes.tsv.gz │ ├── features.tsv.gz │ └── matrix.mtx.gz └── sample456 └── outs └── filtered_feature_bc_matrix ├── barcodes.tsv.gz ├── features.tsv.gz └── matrix.mtx.gz

```{R importMTX_multi, eval=FALSE}
multiSample <- read10X("path/to/runs")

The returned object multiSample is a named list for each sample available in the runs/ folder. Similar to the case of reading a single sample, the element for each sample will be a list for available feature types for cellranger>=3.0 or a dgCMatrix object for older cellranger.

Creating a liger object from loaded data

The required structure for the input of createLiger() is a named list of matrices. For cellranger>=3.0 case where the loaded data contains substructure for potentially multiple feature types, we need to extract the desired data and organize them a bit. The following example shows how to take only the gene expression out of the loaded data and organize them into the desired format to finally create a liger object.

```{R createLiger, eval=FALSE} multiSample <- read10X("path/to/runs") multiSample

$sample123

$sample123$Gene Expression

[... dgCMatrix]

$sample123$Antibody Capture

[... dgCMatrix]

$sample456

$sample456$Gene Expression

[... dgCMatrix]

$sample456$Antibody Capture

[... dgCMatrix]

multiSample <- lapply(multiSample, [[, i = "Gene Expression") multiSample

$sample123

[... dgCMatrix]

$sample456

[... dgCMatrix]

ligerObj <- createLiger(multiSample)

For convenience, if you only want to work on RNA data, you can directly load with `read10XRNA()`.

```{R importMTX_RNA, eval=FALSE}
multiSample <- read10XRNA("path/to/runs")
multiSample
## $sample123
##   [... dgCMatrix]
## $sample456
##   [... dgCMatrix]
ligerObj <- createLiger(multiSample)

We also provide read10XATAC() for loading only the ATAC data from cellranger output. Note here that 10X provides different platforms that contain ATAC modality, namingly "cellranger atac" and "cellranger arc". Users need to specify argument pipeline for the function to correctly infer the file structure.

Data presented with H5 files

Cellranger outputs identical information into H5 files, which contain the same sparse matrix, barcodes and feature names. H5 files are based on the HDF5 library, which is designed for storing and managing large and complex data. They can be found at the same folder where you see the filtered_feature_bc_matrix/ folder, and the file would be named by filtered_feature_bc_matrix.h5.

# shell commands in unix terminal
$ cd /home/jdoe/runs/sample345/
$ tree outs
outs
    ├── filtered_feature_bc_matrix
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    └── filtered_feature_bc_matrix.h5
    ...

Function read10XH5() is provided to load the raw count matrix of a single sample into a sparse matrix. The example below loads the data into memory and allows for the best speed in computation. This is typically recommended for transcriptomic data of at most a total of a few hundred thousand single cells.

```{R importH5, eval=FALSE} sample1Matrix <- read10XH5("/home/jdoe/runs/sample123/outs/filtered_feature_bc_matrix.h5") sample2Matrix <- read10XH5("/home/jdoe/runs/sample456/outs/filtered_feature_bc_matrix.h5") matrixList <- list(sample123 = sample1Matrix, sample456 = sample2Matrix) ligerObj <- createLiger(matrixList)

For projects with a larger number of cells, e.g. a million or more, we recommend loading the data into a HDF5-backed DelayedMatrix representation. This is done by setting `inMemory = FALSE`. Computation will be executed by loading affordable chunks of data into memory at a time. This can scale to millions of cells, at the cost of slightly slower computation speed. The **limitation** here is that one H5 file can only contain one QC'ed dataset, as the subset handling feature of DelayedMatrix is not supported yet. This will be updated in the future. Users can currently perform QC on individual in-memory data and write to equivalent format with `writeH5()`.

```{R importH5Delayed, eval=FALSE}
sample1Matrix <- read10XH5("/home/jdoe/runs/sample123/outs/filtered_feature_bc_matrix.h5", inMemory = FALSE)
sample2Matrix <- read10XH5("/home/jdoe/runs/sample456/outs/filtered_feature_bc_matrix.h5", inMemory = FALSE)
matrixList <- list(sample123 = sample1Matrix, sample456 = sample2Matrix)
ligerObj <- createLiger(matrixList)

In rliger version before 2.2.0, HDF5 backed data are imported with simply passing the file paths to createLiger(). This usage is still supported though, it is not recommended anymore and will be deprecated in the future.

Importing R object - dgCMatrix

"dgCMatrix", supported by package Matrix, is the core class we adopt for holding the cell feature matrix in rliger. It represents CSC (Compressed Sparse Column) matrix, which is the most common form of sparse matrix in R given its efficiency in both memory usage and computation.

We expect that users have each dgCMatrix object representing a single dataset to be integrated. The main object constructor function, createLiger(), expects datasets are organized in a named list and being passed to the first argument.

```{importDGCMATRIX, eval=FALSE}

NOT RUN, for demonstration only

datasetList <- list( "name1" = dgCMatrix1, "name2" = dgCMatrix2, "name3" = dgCMatrix3 ) ligerObj <- createLiger(datasetList)

If your matrix is a combination of multiple datasets to be integrated, we suggest using `as.liger()` with argument `datasetVar` supplied with a factor object as the labeling to indicate the proper splitting behavior.

```{importDGCMATRIX2, eval=FALSE}
# NOT RUN, for demonstration only
datasetVar
##  [1] sample1 sample1 sample1 sample1 ...
## Levels: sample1 sample2 sample3
ligerObj <- as.liger(dgCMatrix, datasetVar = datasetVar)

Importing R object - Seurat

Seurat has been the most popular R package for single-cell analysis. We directly support running rliger functions on a Seurat object without having to convert a Seurat object into a liger object. Please click on the link to see a detailed example, together with many use cases of interoperation between Seurat and rliger.

Importing R object - SingleCellExperiment

SingleCellExperiment (SCE) has also been a popular class for single-cell analysis. Here's how we bring raw data from SCEs.

Multiple SCEs, each a dataset

This is the most preferable case. Most of the times, people do QC and filterings to remove unexpressing genes and low quality cells. If a SCE is presented as a combined representation of QC'ed datasets, it's likely that unwanted zero entries would be introduced to keep the dimensionality consistent, and this would unnecessarily increase the memory usage and computation time. If you have multiple SCEs, each representing a dataset, we recommend to pass them as a named list to createLiger().

```{importSCE, eval=FALSE}

NOT RUN, for demonstration only

sceList <- list( "name1" = sce1, "name2" = sce2, "name3" = sce3 ) ligerObj <- createLiger(sceList)

Note that rliger only takes `counts(sce)` for use and coerce it to a dgCMatrix.

### A single SCE, including many datasets

If it happens that your single SCE object includes multiple datasets that need to be integrated. We suggest using `as.liger()` to convert it into a liger object. In this case, the second argument `datasetVar` must be provided for inferring how to split the data. Assume that the example `sce` object has a column `sample` in its `colData`:

```{importSCE2, eval=FALSE}
# NOT RUN, for demonstration only
ligerObj <- as.liger(sce, datasetVar = "sample")

Importing from H5AD file, Python AnnData object

In rliger version before 2.2.0, HDF5 backed data are imported with simply passing the file paths to createLiger(). This usage is still supported though, it is not recommended anymore and will be deprecated in the future.

WARNING: We strongly suggest that you make a COPY of the H5AD file for rliger analyis if you create objects with passing the file paths (an old syntax). This mode loads the HDF5 file with read-and-write mode and occasionally writes to the file, which makes the file unable to be loaded back to Python AnnData object.

AnnData is a popular Python class for single-cell analysis. When storing the object, it is written to an H5AD file, which is also based on HDF5 library, and is supported to be loaded into a liger object. If you encounter issues of incompatible encoding, please try writing the data to H5AD file with upgraded AnnData package or submit an issue on our GitHub repository.

Different than other types of classes where the raw counts are stored at a fixed location, it is important to explicitly specify where the raw counts can be found in an H5AD file. Please see the function help.

There are currently a few limitations with H5AD files:

Each H5AD file a dataset

H5AD files with single dataset can be easily loaded into memory with readH5AD(). Please make sure the layer specified contains sparse data.

```{importH5AD, eval=FALSE} sample1Matrix <- readH5AD("path/to/sample1.h5ad", layer = "X") sample2Matrix <- readH5AD("path/to/sample2.h5ad", layer = "layers/counts") matrixList <- list(sample1 = sample1Matrix, sample2 = sample2Matrix) ligerObj <- createLiger(matrixList)

### One H5AD file, multiple datasets

For H5AD files containing multiple batches, it can be loaded into one in-memory matrix first, and then converted to a liger object with `as.liger()`. The metadata variable specifying the batch information can be loaded at the same time.

```{importH5AD2, eval=FALSE}
bigMatrixWithObs <- readH5AD('path/to/project.h5ad', layer = 'layers/counts', obs = TRUE)
bigMatrix <- bigMatrixWithObs$matrix
metadata <- bigMatrixWithObs$obs
ligerObj <- as.liger(bigMatrix, datasetVar = metadata$batch)
# You can also add other necessary variables afterwards.
# metadata <- metadata[colnames(ligerObj),]
# ligerObj$varName <- metadata$varName

As stated in the limitations above, large datasets typically providing scaled expression of variable features in dense form is not supported.

Scalable on-disk computation with H5AD files

Similar to how we deal with 10X H5 files, data backed in H5AD files can also be loaded into a DelayedMatrix representation. This is also done by setting inMemory = FALSE. Computation will be executed by loading affordable chunks of data into memory at a time. This can scale to millions of cells, at the cost of slightly slower computation speed.

The limitation here is that one H5AD file can only contain one QC'ed dataset, as the subset handling feature of DelayedMatrix is not supported yet. This will be updated in the future. Users can currently perform QC on individual in-memory data and write to equivalent format with writeH5().

```{importH5ADDelayed, eval=FALSE} sample1Matrix <- readH5AD("path/to/sample1.h5ad", layer = "X", inMemory = FALSE) sample2Matrix <- readH5AD("path/to/sample2.h5ad", layer = "layers/counts", inMemory = FALSE) matrixList <- list(sample1 = sample1Matrix, sample2 = sample2Matrix) ligerObj <- createLiger(matrixList)

## Working with mixed types of data

**LIGER requires that the data to be co-factorized are either all in memory or all on disk. ** 

### Mixing all kinds of R in-memory objects

If each of your R objects contains a single dataset to be integrated, you can simply pass them to the named list. Each element of the list is processed separately.

```{importMixed2, eval=FALSE}
# NOT RUN, for demonstration only
ligerObj <- createLiger(
  list(
    sample1 = dgCMatrix1, 
    seurat2 = seuratObj,
    sce3 = sce, 
  )
)

A single dgCMatrix object can be loaded from various sources which are not demonstrated in example above (e.g. read10X(), read10XH5(), readH5AD() or any other methods).

Function as.ligerDataset is called under the hood to convert each R object to a ligerDataset object, and it directly fetches the 'counts' layer/slot of the default assay from a Seurat object, or the 'counts' assay from an SCE object. If this is not the desired behavior, you can manually extract the correct matrix.

```{importMixed3, eval=FALSE}

NOT RUN, for demonstration only

ligerObj <- createLiger( list( sample1 = dgCMatrix1, seurat2 = SeuratObject::LayerData(seuratObj, layer = "somelayer", assay = "someassay"), sce3 = SummarizedExperiment::assay(sce, "someassay"), ) )

If the given Seurat or SCE objects contain multiple datasets, you can use `as.liger()` to convert them into a liger object, with `datasetVar` properly specified to have the datasets split. Then `c()` method can be applied to merge multiple liger objects into one, analogous to how one would concatenate R list objects.

```{importMixed4, eval=FALSE}
# NOT RUN, for demonstration only
ligerObj1 <- as.liger(seuratObj, datasetVar = "sample")
ligerObj1
## An object of class liger with XXXXX cells
## datasets(2): patient001 (XXXX cells), patient002 (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0): 
## dimReds(0):
ligerObj2 <- as.liger(sce, datasetVar = "batch")
ligerObj2
## An object of class liger with XXXXX cells
## datasets(2): someid101 (XXXX cells), someid103 (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0):
## dimReds(0):
ligerObj <- c(ligerObj1, ligerObj2)
# You can even insert more datasets with `dataset<-()`
dataset(ligerObj, "pbmc") <- dgCMatrix1
ligerObj
## An object of class liger with XXXXX cells
## datasets(5): patient001 (XXXX cells), patient002 (XXXX cells), someid101 (XXXX cells), someid103 (XXXX cells), pbmc (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0):
## dimReds(0):

Mixing on-disk representation from cellranger H5 files and H5AD files

For enabling scalable on-disk computation, we utilize DelayedArray repsentation to hold on-disk sparse data without loading it into memory. This type of representation should be created with either read10XH5() or readH5AD() with setting inMemory = FALSE. Note that not all types of DelayedArray objects are supported. Please also see the limitation stated in sections above.

To use on-disk data from both 10X H5 files and H5AD files, users can simply create the DelayedArray object of each, and normally create a liger object.

```{importMixed, eval=FALSE}

NOT RUN, for demonstration only

tenxSample1 <- read10XH5('path/sample/out/filtered_feature_bc_matrix.h5', inMemory = FALSE) adataSample2 <- readH5AD('path/sample2.h5ad', layer = "layers/counts", inMemory = FALSE)

ligerObj <- createLiger(list(sample1 = tenxSample1, sample2 = adataSample2))

### Mixing H5/H5AD files with in-memory objects

It is not supported to mix on-disk and in-memory data in LIGER integration, though, we provide approaches to coerce the data to either side.

#### H5/H5AD to in-memory

Functions `read10XH5()` or `readH5AD()` can be used to load the data into a dgCMatrix. Users can then follow guides above that cover many situations.

#### In-memory to HDF5

We provide function `writeH5()` to write an in-memory sparse matrix to a 10X-style H5 file. Users can then load it to a DelayedArray object for holding the on-disk information. Note that the written H5 file might not be loadable by some other toolkits, since only gene names are avaiable to be written while some toolkits require both gene name and ID to be avaiable.

```{writeH5, eval=FALSE}
writeH5(dgCMatrix1, 'path/to/sample1.h5')
delayed1 <- read10XH5('path/to/sample1.h5', inMemory = FALSE)
adataSample2 <- readH5AD('path/sample2.h5ad', layer = "layers/counts", inMemory = FALSE)
ligerObj <- createLiger(list(sample1 = delayed1, sample2 = adataSample2))


welch-lab/liger documentation built on July 17, 2025, 5:09 a.m.