countCellsPerGeneCombo: Count gene combinations

Description Usage Arguments Details Value Normalization for cell number Author(s) Examples

Description

Count the number of cells that express each unique combination of genes.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
countCellsPerGeneCombo(x, ...)

## S4 method for signature 'ANY'
countCellsPerGeneCombo(
  x,
  gene.field,
  group = NULL,
  downsample = FALSE,
  down.ncells = NULL,
  row.names = TRUE
)

## S4 method for signature 'CompressedSplitDataFrameList'
countCellsPerGeneCombo(x, gene.field, cov.field, group = NULL, ...)

Arguments

x

Any data.frame-like object where each row corresponds to a single cell and contains its representative sequence. Rows with any NA values in the specified gene.field columns are ignored.

Alternatively, a SplitDataFrameList where each DataFrame corresponds to a cell and each row in that DataFrame is a sequence in that cell.

...

For the generic, further arguments to pass to individual methods.

For the CompressedSplitDataFrameList method, further arguments to pass to the ANY method.

gene.field

Character vector of names of columns of x containing the genes of interest (e.g., VDJ components).

group

Factor of length equal to x indicating the group to which each cell belongs.

downsample

Logical scalar indicating whether downsampling should be performed.

down.ncells

Integer scalar indicating the number of cells to downsample each group to. Defaults to the number of cells in the smallest group in group.

row.names

Logical scalar indicating whether row names should be added by concatenating all gene names per combination.

cov.field

String specifying the column of x containing the read/UMI coverage.

Details

The aim of this function is to generate a count matrix for use in differential “expression” analyses, i.e., does one particular group of cells express a particular gene combination more frequently than another group? This can be useful to examine the effect of particular experimental conditions or the behavior of different cell states, especially if the specific biological function (e.g., antigen) of each gene combination is known in advance.

If cov.field is set, only the most high-abundance sequence is used from each cell. In contrast, setting cov.field=NULL will count each sequence separately, such that one cell may contribute multiple times. It is probably safest to set this to some non-NULL value to avoid complications from dependencies between counts, though any problems are also probably minor.

Value

A SummarizedExperiment where each row corresponds to a unique gene combination and each column corresponds to a level of group (or all cells, if group=NULL). The assays contain a single matrix containing the number of cells for each gene combination and grouping level, while the rowData contains information about the gene combination.

Normalization for cell number

Here, expression is defined in terms of number of cells expressing the gene, rather than the more typical quantity of the number of reads or UMIs assigned to that gene. If the sequencing coverage varies between groups, we assume that such changes have the same scaling effect on the probability of detecting each gene combination, which cancels out after normalizing by the total number of cells.

However, the above assumption only works for differential expression analyses between groups. When comparing other metrics such as diversity values (see summarizeGeneComboCounts), scaling normalization is not sufficient and we instead resort to downsampling all groups to the same total cell number. This is achieved with downsample=TRUE with the automatically determined down.ncells, which eliminates uninteresting technical differences between groups from cell capture efficiency or sample size.

Author(s)

Aaron Lun

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
df <- data.frame(
    cell.id=sample(LETTERS, 30, replace=TRUE),
    v_gene=sample(c("TRAV1", "TRAV2", "TRAV3"), 30, replace=TRUE),
    j_gene=sample(c("TRAJ4", "TRAJ5", "TRAV6"), 30, replace=TRUE),
    umi=pmax(1, rpois(30, 1))
)

y <- splitDataFrameByCell(df, field="cell.id")
out <- countCellsPerGeneCombo(y, c("v_gene", "j_gene"), cov.field="umi")
rowData(out)
assay(out)

out2 <- countCellsPerGeneCombo(y, c("v_gene", "j_gene"), cov.field="umi",
   group=sample(10, length(y), replace=TRUE))
rowData(out2)
assay(out2)

LTLA/RepertoireUtils documentation built on Feb. 9, 2020, 12:51 p.m.