abseqR: reporting and data analysis functionalities for Rep-Seq datasets of antibody libraries
In abseqR: Reporting and data analysis functionalities for Rep-Seq datasets of antibody libraries

library(abseqR)
library(png)
library(plotly)
library(grid)
library(gridExtra)
library(BiocStyle)
# knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(message = FALSE)

sabseqR <- function() { return(Biocpkg("abseqR")); }
#sabseqR <- function() { return("`abseqR`"); }
sabseqPy <- function() { return("[abseqPy](https://github.com/malhamdoosh/abseqPy)"); }
sabseq <- function() { return("`AbSeq`"); }
read_png <- function(path) {
    if (file.exists(path)) {
        #rasterGrob(as.raster(readPNG(path)), interpolate = TRUE)
        # in windows, the path is \ but LaTEX does not appreciate these slashes
        knitr::include_graphics(gsub("\\", "/",
                                     normalizePath(path),
                                     fixed = TRUE))
    } else {
        stop("File at ", path, " not found, fatal.")
    }
}

Introduction

A plethora of high throughput sequencing (HTS) analysis pipelines are available as open source tools to analyze and validate the quality of Rep-seq [^Repseq] datasets. OmicTools provides a summary of repertoire sequencing tools that implements different techniques and algorithms in analyzing and visualizing datasets from B-cell receptors (BCR) and T-cell receptors (TCR). However, high throughput analysis pipelines of antibody library sequencing datasets are scarce.

r sabseq() is a comprehensive bioinformatic pipeline for the analysis of sequencing datasets generated from antibody libraries and r sabseqR() is one of its packages. The r sabseq() suite is implemented as a set of functions in R and Python that can be used together to provide insights into the quality of antibody libraries. r sabseqPy() processes paired-end or single-end FASTA/FASTQ files generated from NGS sequencers and converts them into CSV and HDF files. r sabseqR() visualizes the output of r sabseqPy() and generates a self-contained HTML report. Furthermore, r sabseqR() provides additional functionalities to explicitly compare multiple samples and perform further downstream analyses.

r sabseqR() provides the following functionalities:

Visualizations: the output from r sabseqPy() is summarized succintly into static and interactive plots. The plots are also stored in Rdata object files that provide flexibility for users to easily customize the aesthetics of any plot generated by r sabseqR().
Interactive reports: the plots generated by r sabseqR() can be collated and presented in a self-contained HTML document for convenience and ease of sharing.
Sample comparison: r sabseqR() overloads the + operator via the S4 classes AbSeqCRep and AbSeqRep to compare multiple samples with each other. The comparative reports include additional plots, for example, sample similarity clustering, overlapping clonotypes, etc. The usual plots are also generated for all the compared samples by adding an extra layer of aesthetic.

AbSeq core analyses

r sabseq() includes, but is not limited to, merging paired-end reads, annotating V-(D)-J germlines, calculating unique clonotypes, analyzing primer specificity, facilitating the selection of best restriction enzymes, predicting frameshifts, identifying functional clones, and calculating diversity indices and estimations. These analyses are seamlessly extrapolated to analyze multiple library repertoires simultaneously when multiple samples are present. Figure \@ref(fig:abseq-wf) depicts the complete r sabseq() workflow. Sequencing files are taken as input to be annotated and analyzed by r sabseqPy() before they are further analyzed and visualized by r sabseqR().

read_png("abseq_workflow_MA.jpg")

Installation

r sabseqR() can be installed from bioconductor.org or its GitHub repository at https://github.com/malhamdoosh/abseqR.

Bioconductor

To install r sabseqR() via the BiocManager, type in R console:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("abseqR")

GitHub

To install the development version of r sabseqR() from GitHub, type in R console:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("malhamdoosh/abseqR")

System prerequisites

r sabseqR() requires pandoc version 1.19.2.1 or higher to render the HTML reports. If pandoc cannot be detected while executing r sabseqR(), the HTML report will not be generated. r sabseqR() is a cross-platform library and will work on any major operating system [^windows_biocparallel].

R package dependencies

r sabseqR() depends on several packages from the CRAN and Bioconductor repositories:

r CRANpkg("RColorBrewer") provides colour schemes for maps and graphics. To install it, type in R console install.packages("RColorBrewer")
r CRANpkg("VennDiagram") provides a set of functions to generate Venn diagrams. To install it, type in R console install.packages("VennDiagram")
r CRANpkg("circlize") is a visualization tool used to summarize the distributions of associations between V-J gene segments. To install it, type in R console install.packages("circlize")
r CRANpkg("flexdashboard") is a package that provides a template for RMarkdown that resembles a grid oriented dashboard and is used to generate the HTML reports. To install it, type in R console install.packages("flexdashboard")
r CRANpkg("ggplot2") is an implementation of the "Grammar of Graphics" in R. It is used extensively to generate plots. To install it, type in R console install.packages("ggplot2")
r CRANpkg("ggcorrplot") is used to visualize a correlation matrix using r CRANpkg("ggplot2"). To install it, type in R console install.packages("ggcorrplot")
r CRANpkg("ggdendro") provides a set of tools for drawing dendrograms and tree plots using r CRANpkg("ggplot2"). To install it, type install.packages("ggdendro")
r CRANpkg("grid") is used to arrange plots. It has been integrated into the base R package.
r CRANpkg("gridExtra") provides functions to work with "grid" graphics and used to arrange grid-based plots in the HTML reports. To install it, type in R console install.packages("gridExtra")
r CRANpkg("knitr") provides the capability to dynamically generate reports in R. To install it, type in R console install.packages("knitr")
r CRANpkg("plotly") is used to translate r CRANpkg("ggplot2") graphs to interactive web-based plots. To install it, type in R console install.packages("plotly")
r CRANpkg("plyr") offers a set of tools used in this package to apply operations on subsets of data in manageable pieces. To install it, type in R console install.packages("plyr")
r CRANpkg("png") is used to read and display PNG images. To install it, type in R console install.packages("png")
r CRANpkg("reshape2") allows this package to restructure and aggregate dataframes. To install it, type in R console install.packages("reshape2")
r CRANpkg("rmarkdown") converts R Markdown documents into a variety of formats. To install it, type in R console install.packages("rmarkdown")
r CRANpkg("vegan") provides a suite of functions to calculate diversity and distance statistics between repertoires. To install it, type in R console install.packages("vegan")
r Biocpkg("BiocParallel") is a package from Bioconductor used to enable parallel computing. To install it, type in R console

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("BiocParallel")

Quick start

To leverage all the functionalities provided by r sabseqR(), the main functions to note are abseqR::abseqReport, abseqR::report, and +. This section uses a small simulated dataset to walk through the use cases of each function.

Datasets

The example dataset is packaged with r sabseqR(). For the sake of brevity, the dataset generation is described under the Appendices section.

Briefly, the dataset includes three samples, namely PCR1, PCR2, and PCR3, that was generated using in silico simulations. r sabseqPy() was then used to analyze the datasets and the output directory argument --outdir specified in r sabseqPy() was initiated with the value "./ex/".

read_png("simulation_flowchart_full.jpg")

Fetch data files {#obtain-dataset}

The output of r sabseqPy() on the simulated datasets is first fetched into a local directory as follows:

# substitute with any directory that you have read/write access to
sandboxDirectory <- tempdir()

# path to provided data (comes installed with abseqR)
exdata <- system.file("extdata", "ex", package = "abseqR")

# copy the provided data to sandboxDirectory
file.copy(exdata, sandboxDirectory, recursive = TRUE)

Then, the following commands can be executed in R console to verify that the three PCR datasets are fetched successfully:

# dataPath now holds the path to a local copy of the data
dataPath <- file.path(sandboxDirectory, "ex")
# the sample names can be found inside the auxiliary directory
list.files(path = file.path(dataPath, "auxiliary"))

Basic analysis {#basic-analysis}

After obtaining the datasets, the abseqReport function from r sabseqR() is invoked to visualize the different analysis results as follows:

# This section will visualize all the datasets individually
# and compare PCR1 with PCR2 with PCR3

# Interim solution
if (Sys.info()["sysname"] == "Darwin") {
    BPPARAM <- BiocParallel::SerialParam()
} else {
    BPPARAM <- BiocParallel::bpparam()
}

# you should use report = 3 to generate a HTML report
samples <- abseqReport(dataPath,
                       compare = c("PCR1, PCR2, PCR3"),
                       report = 1,
                       BPPARAM = BPPARAM)

# ignore the message:
# "Sample output directory <path> is different from provided path
#   <path> assuming directory was moved"
# This warning message tells us that the directory has
# been moved (we copied the provided examples to "dataPath")

This creates plots for all samples included in dataPath. In addition, The compare = c("PCR1, PCR2, PCR3") argument specifies that samples PCR1, PCR2, and PCR3 are explicitly compared against each other. Other possible values for compare, report , and BPPARAM will be discussed in detail in later sections (here, here, and here).

HTML reports' directory structure

Figure \@ref(fig:abseq-final-folder-structure) shows the folder structure of ./ex/ after r sabseqR() completes.

grid.raster(readPNG("post_report_fs.png"))

Invoking abseqReport generates plots in the same folder as the corresponding data files within the auxiliary directory. They are then collated together in an HTML document found in the report directory.

The report directory is structured as follows:

index.html file is the entry point to browse AbSeq's HTML reports. It summarizes the r sabseq() analysis and provides a convenient way for navigating individual and comparative analysis results. For example, within this file, there are links to the reports generated for PCR1, PCR2, PCR3 and PCR1 vs PCR2 vs PCR3.
html_files directory contains HTML files that are used build the individual and comparative reports. They can be accessed directly or via the main page index.html.

In conclusion, the individual sample reports in html_files can be shared as-is, but index.html must be accompanied by the html_files directory and thus it is recommended to share the entire report folder.

Comparative analysis

Comparative analysis using a single dataset {#abseqReport-compare}

This section describes the possible values for abseqReport's compare parameter. In the previous section, abseqReport was called with compare = c("PCR1, PCR2, PCR3"). This compares the three samples all together. However, it is also possible to compare only a subset of samples in the dataPath folder, multiple subsets of samples, or none at all.

The compare parameter accepts a vector of one or more strings. Each string denotes a comparison between samples separated by commas, for example, compare = c("PCR1, PCR2, PCR3")[^compare-ws].

If sample comparison is not required, then the following can be simply invoked samples <- abseqReport(dataPath).

Example of other combinations:

# Example of 1 comparison
# Analyze all samples, but only compare PCR1 with PCR2
pcr1.pcr2 <- abseqReport(dataPath,
                         compare = c("PCR1, PCR2"),
                         report = 0)

# Example of 2 comparisons
# Analyze all samples. In addition, compare:
#  * PCR1 with PCR2
#  * PCR2 with PCR3
multiComparison <- abseqReport(dataPath,
                               compare = c("PCR1, PCR2", "PCR2, PCR3"),
                               report = 0)

Note, abseqReport always returns S4 objects of the class AbSeqRep for each sample in the dataPath directory regardless of the value of the compare argument as illustrated next:

# compare = c("PCR1,PCR2")
names(pcr1.pcr2)

# compare = c("PCR1, PCR2", "PCR2 ,PCR3")
names(multiComparison)

The next subsection explains the motivation behind this behaviour.

Comparative analysis using multiple datasets {#customize-cmp}

When constructing antibody libraries, users might be interested in comparing Ab repertoires from different stages of the construction process. Usually, each stage has its own sequencing runs and thus would be analyzed indepedent of others. The report function in r sabseqR() was written to enable this behaviour as illustrated next.

Previously, the S4 objects of three samples of our toy example loaded into a variable named samples. As a hypothetical example, if the reports show an interesting observation between PCR1 and PCR3, it might be of interest to isolate the two samples from the rest.

This code shows how the + operator can be used to create customized comparisons as follows:

# recall that samples is a named list where each element's name
# is the sample's own name (see names(samples))

# use report = 3 if the results should be collated in a HTML document
pcr1.pcr3 <- samples[["PCR1"]] + samples[["PCR3"]]
refinedComparison <- report(pcr1.pcr3,
                            outputDir = file.path(sandboxDirectory,
                                                  "refined_comparison"),
                            report = 1)

Here, the + operator creates a new comparison between PCR1 and PCR3 of class AbSeqCRep. S4 objects of this class can be passed to the report function to generate a standalone HTML report of this particular comparison. Similar to abseqReport, this function returns S4 objects of the individual samples - PCR1 and PCR3.

names(refinedComparison)

Similarly, samples can be loaded from two different r sabseqPy()'s directories as illustrated in the following example:

```{bash abseqR_different_dirs, eval=FALSE}

first abseqPy run on SRR dataset from control group

abseq --file1 fasta/SRR_ACGT_CTRL.fasta --outdir SRR_CTRL

second abseqPy run on SRR dataset from experiment group

abseq --file1 fasta/SRR_ACGT_EXP.fasta --outdir SRR_EXP

analyzing these samples in `r sabseqR()`:

```r
SRRControl <- abseqReport("SRR_CTRL")
SRRExp <- abseqReport("SRR_EXP")

then comparing all samples in SRRControl with all samples in SRRExp can be done using + and report.

# short for SRRControl[[1]] + SRRControl[[2]] + ... + SRRExp[[1]] + ...
all.samples <- Reduce("+", c(SRRControl, SRRExp))
report(all.samples, outputDir = "SRRControl_vs_SRRExp")

Important: The sample names in SRR_CTRL and SRR_EXP must be unique.

In conclusion, the + operator along with the report function enables users to carry out a wide range of customized downstream analyses on the output of r sabseqPy().

Advanced Examples

Lazy loading

In the previous section, the SRRControl dataset was loaded using SRRControl <- abseqReport("SRR_CTRL"), which will generate all plots and reports by default. However, this dataset might have already been analyzed and users are interested in only loading the S4 objects of the samples. This can be efficiently carried out by using the report=0 argument as follows:

# in essence, this is identical to SRRControl <- abseqReport("SRR_CTRL"),
# but will not regenerate plots and reports!
SRRControl.lazy <- abseqReport("SRR_CTRL", report = 0)

# exactly the same return value
stopifnot(names(SRRControl.lazy) == names(SRRControl))

Alternative reporting options {#fine-tune}

In the previous section, the report parameter of abseqReport was used to load the samples in SRRControl without actually plotting any data.

The report argument can accept one of four possible values as follows:

abseqReport("SRR_CTRL", report = 0) does not generate plots and HTLM reports and only returns a named list of S4 objects.
abseqReport("SRR_CTRL", report = 1) generates static plots in PNG format but does not generate HTML reports.
abseqReport("SRR_CTRL", report = 2) generates static plots in PNG format in addition to HTML reports.
abseqReport("SRR_CTRL", report = 3) generates static plots in PNG format and interactive plots in the HTML reports using plotly. This is the default behaviour when the report argument is not specified.

Parallelization {#parallelization}

One of abseqReport's parameters is BPPARAM, which is used to pass arguments into the BiocParallel::bplapply function for customizing parallelization. More information regarding the accepted values to BPPARAM can be found at BiocParallel's page.

Below is a simplified example of using 4 cores and serializing the loop.

# use 4 CPU cores
nproc <- 4
samples <- abseqReport(dataPath,
                       BPPARAM = BiocParallel::MulticoreParam(nproc))


# run sequentially - no multiprocessing
samples <- abseqReport(dataPath,
                       BPPARAM = BiocParallel::SerialParam())

Interpretation of report's figures {#abseqR-interpret}

# [s]ingle sample [dir]ectory
sdir <- file.path(dataPath, "auxiliary", "PCR1")
# [m]ulti sample [dir]ectory
mdir <- file.path(dataPath, "auxiliary", "PCR1_vs_PCR2_vs_PCR3")

This section presents the plots generated by r sabseqR() on the dataset discussed above and provides some insights on how to interpret them.

The visualizations and analyses can be broken down into 5 categories:

Sequence quality analysis
Abundance analysis
Productivity analysis
Diversity analysis
Comparative analysis

Sequence quality analysis

The plots described in this section can be found in the Summary and Quality tabs of the HTML report.

Sequence length

The sequence length distribution is a simple way of validating the quality of a sequencing run. The histogram is expected to show a large proportion of sequences falling within the expected lengths.

read_png(file.path(sdir, "annot", "PCR1_all_clones_len_dist.png"))

Figure \@ref(fig:seq-len) plots sequence length (x-axis) against proportion (y-axis). A similar plot with outliers removed can be found in the Summary tab.

Alignment quality

r sabseqPy() filters low quality sequences based on the quality of the sequence alignment against germline sequence databases and thus the following parameters can be used:

Alignment bitscore
Subject start index
Query start index
Alignment length

However, setting the optimal cut-off thresholds for these parameters is challenging. Stringent values could filter too many sequences while lenient values could retain low quality sequences.

The alignment quality heatmaps generated in the Quality tab of the HTML report shows the relationship between alignment lengths and these alignment parameters to help determine the percentage of sequences falling within a given range and thus inform the selection of cut-off thresholds.

For example, Figure \@ref(fig:align-qual) shows one of the 5 heatmaps: bitscore against V germline alignment length. r sabseqPy() has some recommendations on the values of these parameters to retain good quality sequences.

read_png(file.path(sdir, "abundance", "PCR1_igv_align_quality_bitscore_hm_static.png"))

Insertions, deletions, and mismatches

Indels (Insertions or deletions) and mismatches can be used as an indicator of sequence quality.

Figure \@ref(fig:indel) shows the proportions of indels in PCR1. This graph plots the rate of indels (y-axis) in the V germlines of PCR1 against the number of indels (x-axis). A similar plot for rate of mismatches in V germlines can be found in the same directory. A high rate of indels in the germline sequences might indicate a quality problem with the library because this would affect the functionality of the sequenced clones. However, this could be due to the sequencing quality depending on which sequencing technology is used. For example, long read sequencing technologies tend to produce more indel errors than short read sequencing technologies.

read_png(file.path(sdir, "abundance", "PCR1_igv_gaps_dist.png"))

Abundance and association analysis

The plots generated in this section can be found in the Abundance tab of the HTML report.

V-(D)-J germline abundance

The proportions of V-(D)-J germlines is essential in some experiment designs. For example, it can be used to validate that the germline abundance of an in-house antibody library is in-line with the donor antibody repertoire. Experiments that artificially amplify certain germline families can also be validated similarly using this analysis.

read_png(file.path(sdir, "abundance", "PCR1_igv_dist_family_level.png"))

Figure \@ref(fig:vgermline) shows the distribution of IGHV families in PCR1. Similar plots can be generated for individual V germlines genes and for the D and J germlines.

V-J germline associations

This plot visualizes the recombination process of V and J germlines. Figure \@ref(fig:vjassoc) summarizes the associations between V and J family germlines in a plot generated using r CRANpkg("circlize").

read_png(file.path(sdir, "abundance", "PCR1_vjassoc.png"))

This plot can be used to check whether the Ab library is biased towards a particular combination of germline genes due to cloning errors.

Productivity analysis

The plots described in this section can be found in the Productivity tab in PCR1's HTML report. The main factors affecting the productiveness of a clone by r sabseq()'s interpretation are:

Concordance of the coding frames of V and J germline genes
Presence of stop codons
Presence of indels within a framework or complementarity-determining region

Any sequence that satisfies at least one of the above condition will be classified as unproductive and thus it is unlikely that it will express a functional antibody.

Figure \@ref(fig:prod-summ) summarizes the productivity analysis results of PCR1. Factors that cause sequences to be non-functional are colour coded as follows:

Green -- sequences without stop codons but have a frameshift
Cyan -- sequences that are in-frame but contain at least one stop codon
Purple -- sequences that contain at least one stop codon and have a frameshift

read_png(file.path(sdir, "productivity", "PCR1_productivity.png"))

A good antibody library should have as low unproductive clones as possible. Cloning strategies that are used to clone sequences from the donor libraries or used to construct the Ab library play a key role in this aspect of the library quality.

V-J frameshifts

Figure \@ref(fig:frameshift) shows the percentage of clones that are out-of-frame due to either misaligned V-J coding frames or to the presence of non-multiple of three-indels in one of the framework or complementarity-determining regions.

read_png(file.path(sdir, "productivity", "PCR1_vjframe_dist.png"))

Stop codons

Figure \@ref(fig:stopcod) shows the hot spots for stop codons segregated by framework and complementarity-determining regions.

read_png(file.path(sdir, "productivity", "PCR1_stopcodon_region_outframe.png"))

The figure above shows the percentage of stop codons in the FR and CDR regions of out-of-frame sequences. As discussed earlier, these stop codons may be introduced due to cloning or sequencing errors, hence a similar plot for in-frame sequences can also be found within the same tab.

Indels and mismatches

Some sequences are productive despite having indels and mismatches. This occurs when indels are multiple of three and mismatches do not introduce stop codons. The following figures show the proportion of indels and mismatches for each germline, framework region, and complementarity-determining region on productive sequences (unless specified otherwise).

Figure \@ref(fig:mismatches), Figure \@ref(fig:gaps), and Figure \@ref(fig:gaps-out) plots the proportion of mismatches in productive sequences, indels in productive sequences, and indels in out-of-frame (unproductive) sequences for framework region 3 (FR3). The motivation behind these plots is to quickly identify the quality of productive sequences.

read_png(file.path(sdir, "productivity", "PCR1_fr3_mismatches_dist.png"))

read_png(file.path(sdir, "productivity", "PCR1_fr3_gaps_dist.png"))

read_png(file.path(sdir, "productivity", "PCR1_fr3_gaps_dist_out_of_frame.png"))

Ideally, the number of mismatches in framework regions and IGJ are expected to be low because they are relatively conserved regions. Similarly, the number of indels in productive sequences are expected to be low or some multiple of 3.

Similar plots are generated for other FR and CDR regions, IGV, IGD, and IGJ.

Diversity analysis

The plots described in this section can be found in the Diversity tab of the HTML report. r sabseq() only conducts diversity analysis on clones that are productive.

Clonotype-based analysis

To estimate repertoire diversity, r sabseqR() employs three commonly used techniques:

Duplication-level analysis in which the number of times a clone appears in the sequenced sample is calculated. Figure \@ref(fig:duplication) plots the proportion of sequences (y-axis) that appear once (singletons), twice (doubletons), and at higher-orders (x-axis). The higher the percentage of singletones and doubletones, the more diverse the library would likely be.

Rarefaction analysis in which bootstrapping is used to estimate the richness of a library by calculating the proportion of unique sequences at different sample sizes. Figure \@ref(fig:rarefaction) plots the number of deduplicated clonotypes (y-axis) againse sample sizes (x-axis). For each sample size, five samples are drawn and the mean with confidence intervals are calculated. In a highly diverse library, the percentage of unique clones should significantly increase as the sample size increases.

Capture-recapture analysis in Figure \@ref(fig:recapture) plots the percentage of clonotypes recaptured (y-axis) in a capture-recapture experiment at different sample sizes (x-axis). For each sample size, the percentage of recaptured clonotypes is calculated for five repeated capture-recapture experiments and the mean and confidence intervals are reported.

In below figures, the complementarity-determining regions (CDRs) are used to define a "clonotype". Similar plots can be generated for the framework regions (FRs) and the entire variable domain sequences.

read_png(file.path(sdir, "diversity", "PCR1_cdr_duplication.png"))

read_png(file.path(sdir, "diversity", "PCR1_cdr_rarefaction.png"))

read_png(file.path(sdir, "diversity", "PCR1_cdr_recapture.png"))

Spectratype analysis

Spectratypes are histograms of the clonotype lengths calculated based on the amino acid sequences. Figure \@ref(fig:spectra) shows a CDR3 spectratype. Spectratypes for other FRs and CDRs are available in the same tab. In a good quality library, the framework regions would have quite conserved lengths while CDRs show high length diversity. CDR3 spectratype tends to follow a normal distribution in libraries cloned from naive repertoires.

read_png(file.path(sdir, "diversity", "spectratypes", "PCR1_cdr3_spectratype_no_outliers.png"))

Position-specific analysis

This analysis examines the diversity of Ab library at each amino acid position of the variable domain by estimating the utilization of each of the 20 amino acids at each position. Position-specific frequency matrices (PSFMs) are calculated by aligning all the sequences of a region of interest to anchor at the first position and then the frequency of each amino acid is calculated accordingly. Two PSFMs are calculated: (1) the unscaled PSFM, in which the frequencies are calculated based on the total number of observed sequences per sample at each position and (2) the sacled PSFM, in which the frequencies are calculated based on the total number of observed sequences per sample.

Figure \@ref(fig:comp) shows the PSFM of CDR3 in PCR1. Amino acids are coloured based on their physiochemical properties. The left plot shows the unscaled composition logo and the right plot shows the scaled composition logo. Similar plots for other FRs and CDRs are available in the same tab.

unscaled <- file.path(sdir, "diversity", "composition_logos", "CDR3", "PCR1_cumulative_logo.png")
scaled <- file.path(sdir, "diversity", "composition_logos", "CDR3", "PCR1_cumulative_logo_scaled.png")
img1 <-  rasterGrob(as.raster(readPNG(unscaled)), interpolate = TRUE)
img2 <- rasterGrob(as.raster(readPNG(scaled)), interpolate = TRUE)
grid.arrange(img1, img2, ncol = 2)

Comparative analysis

The plots described in this section can be found in the Clonotypes tab of PCR1 vs PCR2 vs PCR3's HTML report.

Since comparative analysis deals with overlapping clonotypes, this analysis only applies when compare was supplied with at least one sample comparison. Earlier, the call to abseqReport had compare = c("PCR1, PCR2, PCR3"), therefore PCR1, PCR2, and PCR3 are compared against each other.

Throughout this analysis, a clonotype is synonymous to its CDR3 amino acid sequence.

Over-representation analysis

Figure \@ref(fig:topclones) offers a simple overview of the top 10 over-represented clones found in each sample. Since the clonotypes are colour coded, overlapping clonotypes can easily be identified within the top 10 of each samples. Note that the proportions are scaled relative to the top 10 clones in the respective samples.

This plot complements the scatter plot mentioned above. It displays the most abundant clonotypes in each sample with the amino acid sequence in the legend.

read_png(file.path(mdir, "clonotype_analysis", "PCR1_PCR2_PCR3_top10Clonotypes.png"))

Overlapping analysis

Multi-sample analysis

While Figure \@ref(fig:topclones) is capable of showing overlapping clones, it is restricted to the top 10 over-represented clones from each sample. Figure \@ref(fig:overlap) aims to overcome the restriction by using a venn diagram to visualize the number of overlapping (and non-overlapping) clones from each sample. Each number within the venn diagram shows the number of unique clonotypes that are overlapping (in an intersection) or are non-overlapping (not in any intersection). That is, by taking the sum of all the numbers in a sample segment, it becomes the number of unique clonotypes found in that sample.

read_png(file.path(mdir, "clonotype_analysis", "PCR1_PCR2_PCR3_cdr3_clonotypeIntersection.png"))

Note that this venn diagram will not be plotted if there are more than 5 samples.

Two-sample analysis

In order to visualize the correlation between any pair of samples, r sabseqR() plots a scatter plot of every possible combination. Figure \@ref(fig:scatter) shows one of them, plotting the clonotype frequencies in PCR2 against PCR1.

The scatter plot:

has $log_{10}$-scaled clonotype frequencies
has a point for each clonotype
- the coordinate of each point denotes the $log_{10}(frequency)$ in the sample on the x and y axis respectively
- point sizes are mapped to the mean frequency of a clonotype
has marginal density plots colour coded as such:
- blue: density of overlapping clonotypes
- purple: density of non-overlapping clonotypes
- grey: density of all clonotypes

read_png(file.path(mdir, "clonotype_analysis", "PCR1_vs_PCR2_clone_scatter.png"))

This plot is heavily inspired by VDJTools.

Correlation analysis

In addition to Figure \@ref(fig:scatter), the linear correlation of clonotype frequencies between samples can be directly quantified using Pearson's correlation coefficient. Figure \@ref(fig:corr) shows the plot generated by r CRANpkg("ggcorrplot") used to visualize pearson coefficients. A similar plot using Spearman's correlation coefficient (rank-based) is also available in the same directory.

read_png(file.path(mdir, "clonotype_analysis", "pearson.png"))

Clustering analysis

The r CRANpkg("vegan") package was used to calculate distances between samples. The distances between samples are calculated using its clonotype frequencies by applying methods from Morisita-Horn's overlap index, Jaccard index, and Dice's coefficient.

Figure \@ref(fig:cluster) shows a dendrogram plotted using Morisita-Horn's overlap index. The length of each line denotes the distance between the 2 samples or clusters it is connected to. Other dendrograms using Jaccard and Dice's formula are available in the same directory.

read_png(file.path(mdir, "clonotype_analysis", "morisita_horn.png"))

Appendices

Datasets {#example-data}

The datasets used in the above examples was obtained from a combination of synthetic sample datasets generated using MiXCR's program here. Firstly, three distinct samples were generated, each simulated with the following parameters in MiXCR:

tabl <- "
| Parameter     | sample 1      | sample 2      | sample 3      |
|---------------|---------------|---------------|---------------|
| reads         |  10000        |  10000        | 10000         |
| clones        |  5000         |  5000         | 2000          |
| seed          |  4228         |  2428         | 2842          |
| conf          | MiSeq-300-300 | MiSeq-300-300 | MiSeq-300-300 |
| loci          |  IGH          | IGH           | IGH           |
| species       | hsa           | hsa           | hsa           |
"
cat(tabl)

Following that, an arbitrary number of sequences were randomly drawn from each of the three samples and randomly amplified. This process was repeated 3 times, resulting in a final repertoire of three samples, named PCR1, PCR2, and PCR3.

Finally, these three samples were analyzed by r sabseqPy(). The command used to analyze these samples are as follows:

```{bash abseqR__data_abseqPy_run, eval=FALSE} abseq -y params.yml

where the contents of `params.yml` is:

```yaml
# params.yml
defaults:
    bitscore: 300
    sstart: 1-3
    alignlen: 250
    outdir: ex
    task: all
    threads: 1
---
file1: PCR1.fasta
name: PCR1
---
file1: PCR2.fasta
name: PCR2
---
file1: PCR3.fasta
name: PCR3

r sabseqPy()'s analysis output on these three samples are contained within the dataset described in this vignette.

Session Info

This vignette was rendered in the following environment:

sessionInfo()

References

[^Repseq]: Repertoire sequencing. [^windows_biocparallel]: The performace offered by BiocParallel may differ across different operating systems. [^compare-ws]: Trailing and leading whitespace between sample names are trimmed. That is, "PCR1,PCR2" is identical to "PCR1 , PCR2".

Any scripts or data that you put into this service are public.

abseqR documentation built on Nov. 8, 2020, 8:28 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.