countCellsPerClonotype: Count cells per clonotype
In LTLA/RandomGrabBag: Utility Functions for Analyzing Repertoire Sequencing Data

Description Usage Arguments Details Value Normalization for cell number Author(s) Examples

Count the number of cells that exhibit each clonotype.

countCellsPerClonotype(x, ...)

## S4 method for signature 'ANY'
countCellsPerClonotype(
  x,
  clone.field,
  group = NULL,
  cov.field = NULL,
  downsample = FALSE,
  down.ncells = NULL
)

## S4 method for signature 'CompressedSplitDataFrameList'
countCellsPerClonotype(x, clone.field, cov.field, group = NULL, ...)

`x`	Any data.frame-like object where each row corresponds to a single cell and contains its representative sequence. Rows with any `NA` values in the specified `clone.field` columns are ignored. Alternatively, a SplitDataFrameList where each DataFrame corresponds to a cell and each row in that DataFrame is a sequence in that cell.
`...`	For the generic, further arguments to pass to individual methods. For the `CompressedSplitDataFrameList` method, further arguments to pass to the ANY method.
`clone.field`	String specifying the columns of `x` containing the clonotype identity.
`group`	Factor of length equal to `x` indicating the group to which each cell belongs.
`cov.field`	String specifying the column of `x` containing the read/UMI coverage.
`downsample`	Logical scalar indicating whether downsampling should be performed.
`down.ncells`	Integer scalar indicating the number of cells to downsample each group to. Defaults to the smallest number of sequence-containing cells across all levels in `group`.

The aim of this function is to quantify clonal expansion based on the number of cells of a particular clonotype. Clonal expansion is of interest as it serves as a proxy for the strength of the immune response to antigens; we can then compare the degree of expansion between experimental conditions or cell states to gain some biological insights.

Greater expansion manifests in the form of (i) more clonotypes with multiple cells and (ii) clonotypes with a greater number of cells. The exact effect probably depends on the nature of the antigen, e.g., the number of exposed epitopes, that determine whether the expansion is spread across a larger number of clones. See the summarizeClonalExpansion function for more details.

When cov.field is specified, only the most high-abundance sequence is used from each cell. In contrast, setting cov.field=NULL will count each sequence separately, such that one cell may contribute multiple times. It is probably safest to set this to some non-NULL value to avoid complications from dependencies between counts, though any problems are also probably minor.

Cells without any clonotype are completely ignored within this function, as they do not contribute to any of the clonotype counts.

An IntegerList containing one integer vector per level of group (or all cells, if group=NULL). Each entry of the vector corresponds to a clonotype and contains the number of cells with that clonotype. Each vector is also sorted in decreasing order.

One difficulty with quantification is that the average cells per clonotype and number of multiple-cell clonotypes is not a linear function with respect to the number of cells. Increasing the number of cells may result in more new clonotypes or more cells assigned to previously observed clonotypes, depending on the (unknown) clonotype composition of the population. This complicates comparisons between groups that contain different numbers of cells, e.g., diversity metrics in summarizeClonotypeCounts cannot be directly compared between groups.

We solve this problem by simply downsampling so that all levels of group have the same number of cells. This eliminates uninteresting technical differences in, e.g., cell capture rates when comparing between groups, without making any assumptions about the clonotype composition of the vast majority of unobserved cells. Of course, this also eliminates the biological effect of the increase in the number of cells upon expansion, but any such expansion should hopefully still be detectable via changes in clonotype composition among the remaining cells.

As discussed in countCellsPerGeneCombo, we do not adjust for differences in sequencing depth across groups. Our assumption is that any change in coverage manifests as a scaling of the probability of detecting a clonotype where the magnitude of scaling is the same across all clonotypes. If so, differences in coverage translate to differences in the number of cells with a clonotype, allowing us to use the same downsampling solution above.

Aaron Lun

df <- data.frame(
    cell.id=sample(LETTERS, 30, replace=TRUE),
    clonotype=sample(paste0("clonotype_", 1:5), 30, replace=TRUE),
    umi=pmax(1, rpois(30, 5))
)

y <- splitDataFrameByCell(df, field="cell.id")
out <- countCellsPerClonotype(y, "clonotype", cov.field="umi")
out

out2 <- countCellsPerClonotype(y, "clonotype", cov.field="umi",
   group=sample(3, length(y), replace=TRUE))
out2