knitr::opts_chunk$set(echo = TRUE, results = "hold")

Structure

Starting from rliger 2.0.0, we introduced a newly designed structure for the main data container object class. The figure below brings a overall idea about the design.

liger_object_structure\

On the left hand side of the figure, we briefly illustrate how the old structure was like. It has slots for data at each processing stage, and each slot is a named list object containing matrices.

In the new version, on the right side, we first introduce another new class, ligerDataset, to serve as a container for all matrices belonging to the same specific dataset. In this way, we have easy and safe control on data matching thanks to a number of object oriented accessor methods and validity checks. As for the liger class, there are some main differences:

We demonstrate examples below with example dataset "pbmc", which is a minimal subset from the study conducted in Hyun Min Kang and et. al., Nature Biotechnology, 2018. The data is a ready to use (new) liger object with only raw counts data. We quickly process it here so that we can show how to retrieve all kinds of data in sections below.

```{R, message=FALSE, warning=FALSE, results="hide"} library(rliger) data("pbmc") pbmc <- pbmc %>% normalize() %>% selectGenes() %>% scaleNotCenter() %>% runIntegration() %>% quantileNorm() %>% runCluster() %>% runUMAP()

## Access a dataset

As introduced above, the dataset-specific information is contained in a [ligerDataset](../reference/ligerDataset-class.html) object. 

- To get the names of all datasets

```{R}
names(pbmc)
length(pbmc)
lengths(pbmc)
ctrlLD <- dataset(pbmc, dataset = "ctrl")
# Alternatively, using numeric index
ctrlLD <- dataset(pbmc, 1)

In any other rliger functions where the argument useDatasets is exposed, users can always use the exact character name(s) or the numeric index to specify the datasets to be involved in the analysis. Moreoever, a logical vector of index is also allowed and could ease the usage in some cases.

```{R, eval = FALSE}

Not run, just for example, assuming we've got the clustering for such an object

names(ligerObj)

[1] female-1 female-2 male-3 male-4 female-5 ......

femaleIdx <- startsWith(names(ligerObj), "fe") runMarkerDEG(ligerObj, conditionBy = "dataset", splitBy = "leiden_cluster", useDatasets = femaleIdx)

In the example above, the `runMarkerDEG()` funcion is parametered for detecting dataset specific markers within each cluster, and only within the female samples. For example, cells from condition "female-1 and cluster 1" will be tested against cells belonging to condition "cluster 1 and all other female datasets". Can be use

- To access multiple datasets, returned in a list

```{R}
ldList <- datasets(pbmc)

Access feature matrices

We have three main generics for accessing feature matrices, namingly rawData(), normData() and scaleData(). For scaled unshared features, used for UINMF, we also have scaleUnsharedData(). Additionally, we provide rawPeak() and normPeak() for accessing the peak counts in a ATACseq dataset. The logistics of the accessor to all these feature matrices are the same, so we only present the case for raw counts.

rawList <- rawData(pbmc)
class(rawList)
length(rawList)
# Alternatively
rawList <- getMatrix(pbmc, "rawData", returnList = TRUE)
stimRaw <- rawData(pbmc, "stim")
class(stimRaw)
dim(stimRaw)
# Alternatively, get the `ligerDataset` object of the dataset first
# and then fetch from there
stim <- dataset(pbmc, "stim")
stimRaw <- rawData(stim)
# Alternatively
stimRaw <- getMatrix(pbmc, "rawData", dataset = "stim")
ctrlRaw <- rawData(pbmc, "ctrl")
# Assume we do some operation on it here
rawData(pbmc, "ctrl") <- ctrlRaw

In the new version, strict validity checks have been put upon modification in object content. Replacement with unmatching feature names or barcodes will be rejected. In the case where there is a need to replace the dataset with a different set of barcodes or features, we suggest recreating a new ligerDataset object with the new raw counts (or other feature matrix), and then replace the whole dataset with it.

ctrlRaw <- rawData(pbmc, "ctrl")
ctrlRawSubset <- ctrlRaw[1:200, 1:200]
## Not Run, will raise error
# rawData(pbmc, "ctrl") <- ctrlRawSubset

```{R, eval=FALSE} ctrlNew <- createLigerDataset(rawData = ctrlRawSubset) dataset(pbmc, "ctrl") <- ctrlNew dim(pbmc)

[1] NA 500

## Access cell metadata

As previously descibed at the top of this page, cell metadata including dataset origin, study metadata, QC metrics and cluster labeling are all stored in `cellMeta(obj)`.

- To have a look at the full metadata table:

```{R}
cellMeta(pbmc)
nUMI <- cellMeta(pbmc, "nUMI")
class(nUMI)
length(nUMI)
# Alternatively
nUMI <- pbmc$nUMI
nUMI <- pbmc[["nUMI"]]
cellMeta(pbmc, c("nUMI", "nGene"))
nUMI <- cellMeta(pbmc, "nUMI", useDatasets = "ctrl")
length(nUMI)
foo <- seq_len(ncol(pbmc))
pbmc$foo <- foo
# Alternatively
cellMeta(pbmc, "foo") <- foo

If the new variable is only for a subset of cells (e.g. the original clustering derived from an individual dataset).

ctrlBar <- seq_len(ncol(dataset(pbmc, "ctrl")))
cellMeta(pbmc, "ctrl_bar", useDataset = "ctrl") <- ctrlBar

Access dimensionality reductions

allDimReds <- dimReds(pbmc)
class(allDimReds)
length(allDimReds)
umap <- dimRed(pbmc, "UMAP")
class(umap)
dim(umap)
ctrlUMAP <- dimRed(pbmc, "UMAP", useDatasets = "ctrl")
dim(ctrlUMAP)
defaultDimRed(pbmc) <- "UMAP"

Every time when runUMAP() or runTSNE() is called, the new result will be set as default. When default dimensionality reduction is set, any plotting function that shall work with it will use it by default without the need to specify it explicitly. And the dimensionality reduction accessor function also returns the default one if no specific one is requested.

umap <- dimRed(pbmc)
dimRed(pbmc, "newUMAP") <- umap

Cell identifiers on rownames(value) will be checked for matching is present. The check is aware of that a dataset name prefix is added to the object cell IDs.

ctrlUMAP <- dimRed(pbmc, "UMAP", useDatasets = "ctrl")
dimRed(pbmc, "ctrlUMAP", useDatasets = "ctrl") <- ctrlUMAP

Access factorization result

We suggest using getMatrix() for all matrices involved in the factorization, including:

HList <- getMatrix(pbmc, "H")
lapply(HList, dim)

Subsetting the data

For cell level subsetting, any indexing method among barcode names, numeric or logical index can do the job. Cells are indexed by rownames(cellMeta(object)), which is a concatenation of the barcodes from each dataset, and datasets are ordered as names(object) shows.

pbmcSmall <- pbmc[, 1:100]
pbmcCluster1 <- pbmc[, pbmc$leiden_cluster == 1]

For gene level subsetting, we only allow using gene names, because it is assumed that different datasets can have different set of genes. And only genes shared by all datasets can be used.

pbmcVarOnly <- pbmc[varFeatures(pbmc),]
ctrlUnsharedGenes <- c("P2RY1", "GFI1B", "HDGFRP2", "TUBGCP6", "CELA1")
# Not run, will raise error
# pbmc[ctrl.unshared.genes,]

Cell level subsetting works in the exactly same way as a liger object.

ctrlLD <- dataset(pbmc, "ctrl")
ctrlLDSmall <- ctrlLD[, 1:100]

Gene level subsetting on a ligerDataset object can achieved with any type of index.

ctrlLDsmall <- ctrlLD[1:100, ]
ctrlLDsmall <- ctrlLD[1:100, 1:100]

Note that, scaleData(ctrlLD) and scaleUnsharedData(ctrlLD) comes with only variable genes identified upstream. Subsetting genes on a ligerDataset object is based on its raw input data. Therefore, we only take the user specification available in scaled data into the subset of scaled data.

Check the records of run commands

We implemented a analysis tracking feature in order to keep a record of what functions are called and what parameters are used.

commands(pbmc)

A unique suffix is added to each function name to keep track of calls of the same function with different parameters.

commands(pbmc, "runINMF")

A function can be applied to an object several times with parameter tweaks. For example, different lambda for iNMF integration. If runINMF() is called several times, calling commands(pbmc, "runINMF") returns a list of records of all such calls, as all record names starting with "runINMF" are matched. So listing names first and using the unique record name will be required for getting the information of one specific call among all of such. For another example, given that runCluster() and runUMAP() are also in the record, the following result would be returned if we do matching with only "run"

commands(pbmc, "run")


MacoskoLab/liger documentation built on Jan. 19, 2025, 8:22 a.m.