Gene usage analysis

# knitr::knit_hooks$set(optipng = knitr::hook_optipng)
# knitr::opts_chunk$set(optipng = '-o7')

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.align = "center")
knitr::opts_chunk$set(fig.width = 12)
knitr::opts_chunk$set(fig.height = 6)

library(immunarch)
# source("../R/testing.R")
# immdata = load_test_data()
data(immdata)

Gene usage computation

immunarch comes with a gene segments data table containing known gene segments for several species following the IMGT nomenclature. In order to get the current statistics of genes, call the gene_stats() function:

gene_stats()

To compute the distribution of genes, immunarch includes the geneUsage function. It receives a repertoire or a list of repertoires as input and genes and species for which you want to get the statistics. E.g., if you plan to use TRBV genes of Homo Sapiens, you need to use the hs.trbv string in the function, where hs comes from the alias column and trbv is the gene name. In case you plan to use IGHJ genes of Mus Musculus, you need to use musmus.ighj:

# Next four function calls are equal. "hs" is from the "alias" column.
imm_gu <- geneUsage(immdata$data, "hs.trbv")
# imm_gu = geneUsage(immdata$data, "HomoSapiens.trbv")
# imm_gu = geneUsage(immdata$data, "hs.TRBV")
# imm_gu = geneUsage(immdata$data, "HomoSapiens.TRBV")

imm_gu

Gene distributions could be computed either using counts of individual clonotypes (.quant = "count") or not using them (.quant = NA).

In order to compute allele-level or family-level distributions, change the .type parameter.

Parameter .norm controls whether immunarch will normalise the data to ensure the sum of all frequencies to be equal 1 or not.

You can visualise the histogram of gene usage in different ways:

# Compute the distribution of the first two samples
imm_gu <- geneUsage(immdata$data[c(1, 2)], "hs.trbv", .norm = T)

vis(imm_gu)
imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T)

vis(imm_gu, .by = "Status", .meta = immdata$meta)
vis(imm_gu, .grid = T)

Another practical approach to the visualisation of group distributions are box plots:

vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")

Ambiguity of gene segment names

Due to the ambiguity of gene alignments for some clonotypes, geneUsage has the following options to deal with ambiguous data:

Gene usage analysis

To analyse the gene usage immunarch introduces the geneUsageAnalysis function. The .method parameter controls how the data is going to be preprocessed and analysed. geneUsageAnalysis includes following methods for preprocessing:

And a few methods for the actual analysis:

You can call several methods in a single line of code, which is probably the most powerful feature of the package. For instance, "js+hclust" first computes Jensen-Shannon divergence and then applies hierarchical clustering on the resulting distance matrix, whereas "anova" computes ANOVA on each gene separately after repertoires have been grouped:

imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T)

imm_gu_js <- geneUsageAnalysis(imm_gu, .method = "js", .verbose = F)
imm_gu_cor <- geneUsageAnalysis(imm_gu, .method = "cor", .verbose = F)

p1 <- vis(imm_gu_js, .title = "Gene usage JS-divergence", .leg.title = "JS", .text.size = 1.5)
p2 <- vis(imm_gu_cor, .title = "Gene usage correlation", .leg.title = "Cor", .text.size = 1.5)

p1 + p2

Now let us visualise the output after both preprocessing and analysis:

imm_gu_js[is.na(imm_gu_js)] <- 0

vis(geneUsageAnalysis(imm_gu, "cosine+hclust", .verbose = F))

# vis(geneUsageAnalysis(imm_gu, "js+dbscan", .verbose = F))

On top of that you can add clustering:

imm_cl_pca <- geneUsageAnalysis(imm_gu, "js+pca+kmeans", .verbose = F)
imm_cl_mds <- geneUsageAnalysis(imm_gu, "js+mds+kmeans", .verbose = F)
imm_cl_tsne <- geneUsageAnalysis(imm_gu, "js+tsne+kmeans", .perp = .01, .verbose = F)

p1 <- vis(imm_cl_pca, .plot = "clust")
p2 <- vis(imm_cl_mds, .plot = "clust")
p3 <- vis(imm_cl_tsne, .plot = "clust")
p1 + p2 + p3

You can regulate the number of clusters as well:

imm_cl_pca2 <- geneUsageAnalysis(imm_gu, "js+pca+kmeans", .k = 3, .verbose = F)
vis(imm_cl_pca2)

Spectratyping

Spectratype is a useful way to represent distributions of genes per sequence length. Parameter .quant controls the quantity that used to compute proportions of genes - either by clonotype (id) or by number of clones per clonotype (count). Parameter .col controls which column to choose, e.g., "nt" for lengths of CDR3 nucleotide sequences only (without grouping by gene segments), "aa+v" for lengths of CDR3 amino acid sequences (grouped by V gene segments).

p1 <- vis(spectratype(immdata$data[[1]], .quant = "id", .col = "nt"))
p2 <- vis(spectratype(immdata$data[[1]], .quant = "count", .col = "aa+v"))

p1 + p2


Try the immunarch package in your browser

Any scripts or data that you put into this service are public.

immunarch documentation built on Dec. 28, 2022, 2:59 a.m.