In rnabioco/djvdj: A collection of single-cell V(D)J tools

# Chunk opts
knitr::opts_chunk$set(
  collapse   = TRUE,
  comment    = "#>",
  warning    = FALSE,
  message    = FALSE,
  fig.width  = 8,
  fig.height = 3
)

This vignette provides detailed examples for calculating and visualizing V(D)J gene usage. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.

library(djvdj)
library(Seurat)
library(ggplot2)
library(RColorBrewer)

# Load GEX data
data_dir <- system.file("extdata/splen", package = "djvdj")

gex_dirs <- c(
  BL6 = file.path(data_dir, "BL6_GEX/filtered_feature_bc_matrix"),
  MD4 = file.path(data_dir, "MD4_GEX/filtered_feature_bc_matrix")
)

so <- gex_dirs |>
  Read10X() |>
  CreateSeuratObject() |>
  AddMetaData(splen_meta)

# Add V(D)J data to object
vdj_dirs <- c(
  BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"),
  MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj")
)

so <- so |>
  import_vdj(vdj_dirs, define_clonotypes = "cdr3_gene")

Calculating gene usage

The calc_gene_usage() function will calculate the number of cells ('freq') and percentage of cells ('pct') with each gene in the data_cols column(s). The 'n_cells' column shows the total number of cells used for calculating percentages. By default these results are added to the object meta.data, to return a data.frame set return_df to TRUE.

so |>
  calc_gene_usage(
    data_cols = "v_gene",
    return_df = TRUE
  )

To perform gene usage calculations separately for cell clusters (or samples), provide the meta.data column containing cluster labels to the cluster_col argument. Here we see that the MD4 samples almost exclusively use a single V segment (IGKV5-43), which is expected since MD4 B cells are monoclonal.

so |>
  calc_gene_usage(
    data_cols   = "v_gene",
    cluster_col = "sample",
    return_df   = TRUE
  )

To only perform calculations for a specific chain, use the chain argument. In this example we are only returning results for the IGK chain. Here we see some values in the 'v_gene' column labeled as 'None', this shows the number of cells that did not have a V gene segment identified.

so |>
  calc_gene_usage(
    data_cols   = "v_gene",
    cluster_col = "sample",
    chain       = "IGK",
    return_df   = TRUE
  )

If two columns are provided to the data_cols argument, the number of cells containing each combination of genes is returned.

so |>
  calc_gene_usage(
    data_cols   = c("v_gene", "j_gene"),
    cluster_col = "sample",
    return_df   = TRUE
  )

Plotting gene usage

The plot_gene_usage() function will summarize the frequency of each gene segment. By default if a single column is passed to the data_cols argument, a bargraph will be returned. The number of top genes to include in the plot can be specified with the genes argument.

so |>
  plot_gene_usage(
    data_cols   = "v_gene",
    cluster_col = "orig.ident",
    genes       = 20
  )

By default, percentages are shown on the y-axis, to instead plot the frequency, set the units argument to 'frequency'.

so |>
  plot_gene_usage(
    data_cols   = "v_gene",
    cluster_col = "orig.ident",
    units       = "frequency"
  )

Plot colors can be adjusted using the plot_colors argument. In addition, plot_gene_usage() returns a ggplot object that can be modified with ggplot2 functions such as ggplot2::theme(). Plots can be further adjusted by passing aesthetic parameters directly to ggplot2, e.g. alpha, linetype, color, etc.

so |>
  plot_gene_usage(
    data_cols   = "v_gene",
    cluster_col = "orig.ident",
    plot_colors = c(BL6 = "#3288BD", MD4 = "#D53E4F"),

    color = "black",  # parameters to pass to ggplot2
    alpha = 0.7
  ) +
  theme(axis.text.x = element_text(angle = 90))

If two columns are passed to the data_cols argument, a heatmap will be generated summarizing the usage of different pairs of segments. If a column is provided to the cluster_col argument, a separate heatmap will be generated for each cluster.

In this example we are plotting the frequency that different heavy chain V and J segments appear together.

so |>
  plot_gene_usage(
    data_cols   = c("v_gene", "j_gene"),
    cluster_col = "orig.ident",
    chain       = "IGH",
    genes       = 15
  )

The paired gene usage for two chains can also be plotted using plot_gene_pairs(). In this example we are plotting the frequency that different heavy and light chain V segments appear together.

so |>
  plot_gene_pairs(
    data_col    = "v_gene",
    chains      = c("IGH", "IGK"),
    cluster_col = "orig.ident",
    genes       = 12
  )

Circos plot

A circos plot can be created by setting the method argument to 'circos'. This plot will summarize the number of cells containing different gene pairs, which is shown as the axis labels for each sample. This requires the circlize package to be installed.

In this example, we are summarizing the segment usage for the entire dataset (BL6 and MD4 cells combined). The cluster_col argument can be used to create a separate plot for each sample. Labels can be rotated to eliminate overlapping text using the rotate_labels argument.

so |>
  plot_gene_usage(
    data_cols     = c("v_gene", "j_gene"),
    method        = "circos",
    genes         = 6,
    rotate_labels = TRUE
  )

Plot colors can be modified using the plot_colors argument, additional parameters can be passed directly to circlize::chordDiagram(). In this example we add a border around the links and scale the plot so each sample is the same width.

so |>
  plot_gene_usage(
    data_cols   = c("v_gene", "j_gene"),
    method      = "circos",
    genes       = 8,
    plot_colors = brewer.pal(10, "Spectral"),

    link.border = "black",  # parameters to pass to chordDiagram()
    scale       = TRUE
  )

Gene segment usage can be plotted separately for cell clusters (or samples) using the cluster_col argument. The number of rows used to arrange plots can be modified using the panel_nrow argument.

so |>
  plot_gene_usage(
    data_cols   = c("v_gene", "j_gene"),
    method      = "circos",
    cluster_col = "sample",
    genes       = 5,
    panel_nrow  = 2,
    scale       = TRUE
  )