clonoStats: Assign cell-level clonotypes and calculate abundances
In kstreet13/VDJdive: Analysis Tools for 10X V(D)J Data

clonoStats

R Documentation

Assign cell-level clonotypes and calculate abundances

Description

Assign clonotype labels to cells and produce two summary tables: the clonotypes x samples table of abundances and the counts x samples table of clonotype frequencies.

Usage

clonoStats(x, ...)

## S4 method for signature 'SplitDataFrameList'
clonoStats(
  x,
  group = "sample",
  type = NULL,
  assignment = FALSE,
  method = "EM",
  lang = c("cpp", "r"),
  thresh = 0.01,
  iter.max = 1000,
  BPPARAM = SerialParam()
)

## S4 method for signature 'SingleCellExperiment'
clonoStats(x, contigs = "contigs", group = "sample", ...)

## S4 method for signature 'clonoStats'
clonoStats(x, group = NULL, lang = c("cpp", "r"))

Arguments

`x`	A `SplitDataFrameList` object containing V(D)J contig information, split by cell barcodes, as created by `readVDJcontigs`. Alternatively, a `SingleCellExperiment` object with such a `SplitDataFrameList` in the `colData`, as created by `addVDJtoSCE`.
`...`	additional arguments.
`group`	character. The name of the column in `x` (or the `colData` of `x`, for `SingleCellExperiment` objects) that stores each cell's group identity, typically either its sample of origin or cluster label. Alternatively, a vector of length equal to `x` (or `ncol(x)`) indicating the group identity. Providing this information can dramatically speed up computation. When running `clonoStats` for the first time on a dataset, we highly recommend setting the group identity to sample of origin to avoid unwanted cross-talk between samples.
`type`	character. The type of VDJ data (one of `"TCR"` or `"BCR"`). If `NULL`, this is determined by the most prevalent `chain` types in `x`.
`assignment`	logical. Whether or not to return the full `nCells x nClonotypes` sparse matrix of clonotype assignments (default = `FALSE`)
`method`	character. Which method to use for assigning cell-level clonotypes. Options are `"EM"` (default), `"unique"`, or `"CellRanger"`. Alternatively, this may be the name of a numeric column of the contig data or any `chain` type contained therein. See Details.
`lang`	character. Indicates which implementation of certain methods to use. The EM algorithm is implemented in both pure R (`'r'`) and mixed R and C++ (`'cpp'`, default) versions. Similarly, clonotype summarization is implemented in two ways, which can impact speed, regardless of choice of `method`.
`thresh`	Numeric threshold for convergence of the EM algorithm. Indicates the maximum allowable deviation in a count between updates. Only used if `method = "EM"`.
`iter.max`	Maximum number of iterations for the EM algorithm. Only used if `method = "EM"`.
`BPPARAM`	A BiocParallelParam object specifying the parallel backend for distributed clonotype assignment operations (split by `group`). Default is `BiocParallel::SerialParam()`.
`contigs`	character. When `x` is a `SingleCellExperiment`, this is the name of the column in the `colData` of `x` that contains the VDJ contig data.

Details

Assign cells (with at least one V(D)J contig) to clonotypes and produce summary tables that can be used for downstream analysis. Clonotype assignment can be handled in multiple ways depending on the choice of "method":

"EM": Cells are assigned probabilistically to their most likely clonotype(s) with the Expectation-Maximization (EM) algorithm. For ambiguous cells, this leads to proportional (non-integer) assignment across multiple clonotypes and a frequency table of (non-integer) expected counts.
"unique": Cells are assigned a clonotype if (and only if) they can be uniquely assigned a single clonotype. For a T cell, this means having exactly one alpha chain and one beta chain.
"CellRanger": Clonotype labels are taken from contig data and matched across samples.
column name in contig data: Similar to "unique", but additionally, cells with multiples of a particular chain are assigned a "dominant" clonotype based on which contig has the higher value in this column (typical choices being "umis" or "reads").
type of chain in contig data: Clonotypes are based entirely on this type of chain (eg. "TRA" or "TRB") and cells may be assigned to multiple clonotypes, if multiples of that chain are present.

The "EM", "unique", and UMI/read-based quantification methods all define a clonotype as a pair of specific chains (alpha and beta for T cells, heavy and light for B cells). Unlike other methods, the EM algorithm assigns clonotypes probabilistically, which can lead to non-integer counts for cells with ambiguous information (ie. only an alpha chain, or two alphas and one beta chain).

We highly recommend providing information on each cell's sample of origin, as this can speed up computation and provide more accurate results. This is particularly important for the EM algorithm, which shares information across cells in the same group, so splitting by sample can improve accuracy by removing extraneous clonotypes from the set of possibilities for a particular cell.

Value

Returns an object of class clonoStats, containing group-level clonotype summaries. May optionally include a sparse matrix of cell-level assignment information, if assignment = TRUE. If x is a SingleCellExperiment object, this output is added to the metadata.