countSharedClonotypes: Count shared clonotypes across groups
In LTLA/RandomGrabBag: Utility Functions for Analyzing Repertoire Sequencing Data

Description Usage Arguments Details Value Author(s) See Also Examples

Count the number of clonotypes that are shared across groups, usually different cell types.

countSharedClonotypes(x, ...)

## S4 method for signature 'ANY'
countSharedClonotypes(
  x,
  clone.field,
  group,
  metric = c("none", "jaccard", "maximum"),
  collapse.cells = FALSE
)

## S4 method for signature 'CompressedSplitDataFrameList'
countSharedClonotypes(x, clone.field, cov.field, group, ...)

`x`	Any data.frame-like object where each row corresponds to a single cell and contains its representative sequence. Rows with any `NA` values in the specified `clone.field` columns are ignored. Alternatively, a SplitDataFrameList where each DataFrame corresponds to a cell and each row in that DataFrame is a sequence in that cell.
`...`	For the generic, further arguments to pass to individual methods. For the `CompressedSplitDataFrameList` method, further arguments to pass to the ANY method.
`clone.field`	String specifying the columns of `x` containing the clonotype identity.
`group`	Factor of length equal to `x` indicating the group to which each cell belongs.
`metric`	String specifying the type of sharing metric to return.
`collapse.cells`	Logical scalar indicating whether each clonotype should be counted only once.
`cov.field`	String specifying the column of `x` containing the read/UMI coverage.

This function quantifies the sharing of clonotypes are shared across different groups. The most obvious application is to identify shared clonotypes across different cell types (as represented by cluster identity), allowing us to infer that those cell types shared a common ancestor. Examples include clonotypes that are shared across various B or T cell states (e.g., activation, memory), indicating that there is an active transition between states.

If metric="none", we return the number of clonotypes that are shared between each pair of groups. If collapse.cells=FALSE, we instead return the total number of cells across both groups that exhibit a shared clonotype.

If metric="jaccard", the Jaccard index is computed between every pair of groups. On a practical level, this adjusts for differences in the size of the groups so that large groups do not dominate the output. On a theoretical level, we interpret this value by considering the progenitor population that gives rise to the two groups; the Jaccard index represents the proportion of cells in the progenitor population that can develop into either group. If collapse.cells=FALSE, a weighted version of the Jaccard index is used, i.e., the Ruzicka similarity. This involves summing the minimum and maximum frequencies of each clonotype in the numerator and denominator, respectively.

If metric="maximum", we compute the larger of the proportions of shared clonotypes across the two groups. For example, if 10 clonotypes are shared and one group has 20 clonotypes and the other group has 40 clonotypes, the output value will be 0.5. This is designed to detect high shared proportions in smaller groups; for example, if a rare subpopulation is derived from the same progenitors as a much larger population, the Jaccard index would be small even if the sharing in the small subpopulation is very high. If collapse.cells=FALSE, we instead compute the proportion of cells in each group that exhibit a shared clonotype.

A numeric matrix where the off-diagonal entries contain metrics of clonotype sharing between each pair of groups.

Aaron Lun

countCellsPerClonotype, to compare between groups.

df <- data.frame(
    cell.id=sample(LETTERS, 30, replace=TRUE),
    clonotype=sample(paste0("clonotype_", 1:5), 30, replace=TRUE),
    umi=pmax(1, rpois(30, 5))
)

y <- splitDataFrameByCell(df, field="cell.id")
out <- countSharedClonotypes(y, "clonotype", cov.field="umi",
   group=sample(3, length(y), replace=TRUE))
out