In rnabioco/djvdj: A collection of single-cell V(D)J tools

# Chunk opts
knitr::opts_chunk$set(
  collapse   = TRUE,
  comment    = "#>",
  warning    = FALSE,
  message    = FALSE,
  fig.width  = 8,
  fig.height = 4
)

This vignette provides detailed examples for quantifying differences in clonotype frequencies. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.

library(djvdj)
library(Seurat)
library(ggplot2)

# Load GEX data
data_dir <- system.file("extdata/splen", package = "djvdj")

gex_dirs <- c(
  BL6 = file.path(data_dir, "BL6_GEX/filtered_feature_bc_matrix"),
  MD4 = file.path(data_dir, "MD4_GEX/filtered_feature_bc_matrix")
)

so <- gex_dirs |>
  Read10X() |>
  CreateSeuratObject() |>
  AddMetaData(splen_meta)

# Add V(D)J data to object
vdj_dirs <- c(
  BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"),
  MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj")
)

so <- so |>
  import_vdj(vdj_dirs, define_clonotypes = "cdr3_gene")

Calculating clonotype frequencies

To quantify clonotype frequencies and store the results in the object meta.data, the calc_frequency() function can be used. This will add columns showing the number of occurrences of each clonotype ('freq'), the percentage of cells sharing the clonotype ('pct'), and a label that can be used for plotting ('grp'). By default these calculations will be performed for all cells in the object.

so_vdj <- so |>
  calc_frequency(data_col = "clonotype_id")

To calculate clonotype frequencies separately for samples or clusters, the cluster_col argument can be used. To do this just specify the name of the column containing the sample or cluster IDs for each cell.

so_vdj <- so |>
  calc_frequency(
    data_col = "clonotype_id",
    cluster_col = "sample"
  )

When cluster_col is specified, an additional meta.data column ('shared') will be added indicating whether the clonotype is shared between multiple clusters.

so_vdj |>
  slot("meta.data") |>
  head(2)

Plotting clonotype frequencies

djvdj includes the plot_clone_frequency() function to visualize differences in clonotype frequency between samples or clusters. By default this will produce bargraphs. Plot colors can be adjusted using the plot_colors argument.

so |>
  plot_clone_frequency(
    data_col = "clonotype_id",
    plot_colors = "#3182bd"
  )

Frequencies can be calculated and plotted separately for each sample or cluster using the cluster_col argument. The panel_nrow and panel_scales arguments can be used to add separate scales for each sample or to adjust the number of rows used to arrange plots.

As expected we see that most MD4 B cells share the same clonotype, while BL6 cells have a diverse repertoire.

so |>
  plot_clone_frequency(
    data_col     = "clonotype_id",
    cluster_col  = "orig.ident",
    panel_scales = "free"
  )

Rank-abundance plots can also be generated by setting the method argument to 'line'. Most djvdj plotting functions return ggplot objects that can be further modified with ggplot2 functions. Here we further modify plot aesthetics using the ggplot::theme() function. Most djvdj plotting function also include the ability to transform the axis using the trans argument.

so |>
  plot_clone_frequency(
    data_col    = "clonotype_id",
    cluster_col = "orig.ident",
    method      = "line",
    plot_colors = c(MD4 = "#fec44f", BL6 = "#3182bd"),
    trans       = "log10"         # log-transform axis
  ) +
  theme(aspect.ratio = 0.8)

UMAP projections

By default calc_frequency() will divide clonotypes into groups based on frequency and add a column to the meta.data containing these group labels. Clonotype frequencies can be summarized on a UMAP projection by plotting the added 'grp' column using the generic plotting function plot_scatter().

Cells that lack BCR data will be plotted as NAs, the color of these points can be adjusted using the na_color argument.

# Create UMAP summarizing samples
mouse_gg <- so |>
  plot_scatter(data_col = "orig.ident")

# Create UMAP summarizing clonotype frequencies
abun_gg <- so |>
  calc_frequency(
    data_col = "clonotype_id",
    cluster_col = "sample"
  ) |>
  plot_scatter(data_col = "clonotype_id_grp")

mouse_gg + abun_gg

Highly abundant clonotypes can also be specifically labeled on a UMAP projection. To do this, pass a vector of top clonotypes to highlight to the top argument of plot_scatter().

top_gg <- so |>
  plot_scatter(
    data_col    = "clonotype_id",
    top         = "clonotype56",
    plot_colors = c(other = "#fec44f", clonotype56 = "#3182bd")
  )

mouse_gg + top_gg

Other frequency calculations

In addition to clonotype abundance, calc_frequency() can be used to summarize the frequency of any cell label present in the object. In this example we count the number of cells present for each cell type in each sample.

so_vdj <- so |>
  calc_frequency(
    data_col = "cell_type",
    cluster_col = "sample"
  )

To plot the fraction of cells present for each cell type, we can use the generic plotting function, plot_frequency(). This will create stacked bargraphs summarizing each cell label present in the data_col column. The color of each group can be specified with the plot_colors argument.

so |>
  plot_frequency(
    data_col    = "cell_type",
    cluster_col = "sample",
    plot_colors = c("#3182bd", "#fec44f")
  )

To summarize the number cells present for each cell type, set the units argument to 'frequency'. To create grouped bargraphs, set the stack argument to FALSE.

so |>
  plot_frequency(
    data_col    = "cell_type",
    cluster_col = "sample",
    units       = "frequency",
    stack       = FALSE
  )

Clusters can also be grouped based on an additional variable such as treatment group (e.g. placebo vs drug) or disease status (e.g. healthy vs disease). This will generate bargraphs (or boxplots) showing the mean and standard deviation for each group. In this example we are comparing the 3 BL6 and 3 MD4 samples. You will also notice that there is a group labeled as NA, these are cells that lacked V(D)J data and thus did not have an assigned isotype.

so |>
  plot_frequency(
    data_col    = "isotype",
    cluster_col = "sample",
    group_col   = "orig.ident",
    plot_colors = c(MD4 = "#fec44f", BL6 = "#3182bd")
  )

p-values

p-values can be calculated and shown on plots generated by plot_frequency() and plot_gene_usage(). To do this, you must pass a grouping variable to the group_col argument, which is used to group the clusters found in cluster_col. This is best used when you have a set of samples that can be divided into distinct groups. The cluster names should be unique for each treatment group, e.g. healthy: healthy-1, healthy-2; disease: disease-1, disease-2.

The method used to calculate p-values can be specified with the p_method argument. By default a t-test will be performed, if more than two groups are compared the Kruskal-Wallis test will be used. A summary table of the calculated p-values can also be saved by passing a path to the p_file argument.

The p_label argument can be used to modify which p-values are shown on the plot, by default only significant p-values are shown. In this example we display all p-values calculated using a t-test.

so |>
  plot_frequency(
    data_col    = "isotype",
    cluster_col = "sample",
    group_col   = "orig.ident",

    p_label     = "all",
    p_method    = "t"
  )

Custom labels for different p-value cutoffs can be specified by passing a named vector to the p_label argument. To display the actual p-value when it is below a certain threshold, use the keyword 'value'. Symbols can also be displayed by including the unicode symbol code. In this example we display p-values <0.05, print a soccer ball for <0.1, and all others are labeled as 'ns'.

so |>
  plot_frequency(
    data_col    = "isotype",
    cluster_col = "sample",
    group_col   = "orig.ident",
    p_label     = c(value = 0.05, "\\u26BD" = 0.1, ns = Inf)
  )

Label aesthetics can be modified by passing a named list of aesthetic parameters to the label_params argument. These parameters will also modify the n-label, to specifically modify the p-value label, prefix each parameter with 'p.', e.g. p.size = 14.

so |>
  plot_frequency(
    data_col     = "isotype",
    cluster_col  = "sample",
    group_col    = "orig.ident",
    n_label      = "corner",
    label_params = list(p.color = "red")
  )