countCellsPerGeneCombo: Count gene combinations
In LTLA/RepertoireUtils: Utility Functions for Analyzing Repertoire Sequencing Data

Description Usage Arguments Details Value Normalization for cell number Author(s) Examples

Count the number of cells that express each unique combination of genes.

countCellsPerGeneCombo(x, ...)

## S4 method for signature 'ANY'
countCellsPerGeneCombo(
  x,
  gene.field,
  group = NULL,
  downsample = FALSE,
  down.ncells = NULL,
  row.names = TRUE
)

## S4 method for signature 'CompressedSplitDataFrameList'
countCellsPerGeneCombo(x, gene.field, cov.field, group = NULL, ...)

`x`	Any data.frame-like object where each row corresponds to a single cell and contains its representative sequence. Rows with any `NA` values in the specified `gene.field` columns are ignored. Alternatively, a SplitDataFrameList where each DataFrame corresponds to a cell and each row in that DataFrame is a sequence in that cell.
`...`	For the generic, further arguments to pass to individual methods. For the `CompressedSplitDataFrameList` method, further arguments to pass to the ANY method.
`gene.field`	Character vector of names of columns of `x` containing the genes of interest (e.g., VDJ components).
`group`	Factor of length equal to `x` indicating the group to which each cell belongs.
`downsample`	Logical scalar indicating whether downsampling should be performed.
`down.ncells`	Integer scalar indicating the number of cells to downsample each group to. Defaults to the number of cells in the smallest group in `group`.
`row.names`	Logical scalar indicating whether row names should be added by concatenating all gene names per combination.
`cov.field`	String specifying the column of `x` containing the read/UMI coverage.

The aim of this function is to generate a count matrix for use in differential “expression” analyses, i.e., does one particular group of cells express a particular gene combination more frequently than another group? This can be useful to examine the effect of particular experimental conditions or the behavior of different cell states, especially if the specific biological function (e.g., antigen) of each gene combination is known in advance.

If cov.field is set, only the most high-abundance sequence is used from each cell. In contrast, setting cov.field=NULL will count each sequence separately, such that one cell may contribute multiple times. It is probably safest to set this to some non-NULL value to avoid complications from dependencies between counts, though any problems are also probably minor.

A SummarizedExperiment where each row corresponds to a unique gene combination and each column corresponds to a level of group (or all cells, if group=NULL). The assays contain a single matrix containing the number of cells for each gene combination and grouping level, while the rowData contains information about the gene combination.

Here, expression is defined in terms of number of cells expressing the gene, rather than the more typical quantity of the number of reads or UMIs assigned to that gene. If the sequencing coverage varies between groups, we assume that such changes have the same scaling effect on the probability of detecting each gene combination, which cancels out after normalizing by the total number of cells.

However, the above assumption only works for differential expression analyses between groups. When comparing other metrics such as diversity values (see summarizeGeneComboCounts), scaling normalization is not sufficient and we instead resort to downsampling all groups to the same total cell number. This is achieved with downsample=TRUE with the automatically determined down.ncells, which eliminates uninteresting technical differences between groups from cell capture efficiency or sample size.

Aaron Lun

df <- data.frame(
    cell.id=sample(LETTERS, 30, replace=TRUE),
    v_gene=sample(c("TRAV1", "TRAV2", "TRAV3"), 30, replace=TRUE),
    j_gene=sample(c("TRAJ4", "TRAJ5", "TRAV6"), 30, replace=TRUE),
    umi=pmax(1, rpois(30, 1))
)

y <- splitDataFrameByCell(df, field="cell.id")
out <- countCellsPerGeneCombo(y, c("v_gene", "j_gene"), cov.field="umi")
rowData(out)
assay(out)

out2 <- countCellsPerGeneCombo(y, c("v_gene", "j_gene"), cov.field="umi",
   group=sample(10, length(y), replace=TRUE))
rowData(out2)
assay(out2)