library(abseqR) library(png) library(plotly) library(grid) library(gridExtra) library(BiocStyle) # knitr::opts_chunk$set(tidy = TRUE) knitr::opts_chunk$set(message = FALSE)
sabseqR <- function() { return(Biocpkg("abseqR")); } #sabseqR <- function() { return("`abseqR`"); } sabseqPy <- function() { return("[abseqPy](https://github.com/malhamdoosh/abseqPy)"); } sabseq <- function() { return("`AbSeq`"); } read_png <- function(path) { if (file.exists(path)) { #rasterGrob(as.raster(readPNG(path)), interpolate = TRUE) # in windows, the path is \ but LaTEX does not appreciate these slashes knitr::include_graphics(gsub("\\", "/", normalizePath(path), fixed = TRUE)) } else { stop("File at ", path, " not found, fatal.") } }
A plethora of high throughput sequencing (HTS) analysis pipelines are available as open source tools to analyze and validate the quality of Rep-seq [^Repseq] datasets. OmicTools provides a summary of repertoire sequencing tools that implements different techniques and algorithms in analyzing and visualizing datasets from B-cell receptors (BCR) and T-cell receptors (TCR). However, high throughput analysis pipelines of antibody library sequencing datasets are scarce.
r sabseq()
is a comprehensive bioinformatic pipeline for the analysis of sequencing datasets generated from antibody libraries and r sabseqR()
is one of its packages. The r sabseq()
suite is implemented as a set of functions
in R
and Python
that can be used together to provide insights into the quality of antibody libraries. r sabseqPy()
processes paired-end
or single-end FASTA/FASTQ files generated from NGS sequencers and converts them into
CSV and HDF files.
r sabseqR()
visualizes the output of r sabseqPy()
and generates a self-contained
HTML report. Furthermore, r sabseqR()
provides additional functionalities to
explicitly compare multiple samples and perform further downstream analyses.
r sabseqR()
provides the following functionalities:
Visualizations: the output from r sabseqPy()
is summarized succintly into
static and interactive plots. The plots are also stored in Rdata
object files that provide flexibility for
users to easily customize the aesthetics of any plot generated by r sabseqR()
.
Interactive reports: the plots generated by r sabseqR()
can be collated and presented
in a self-contained HTML document for convenience and ease of sharing.
Sample comparison: r sabseqR()
overloads the +
operator
via the S4 classes AbSeqCRep
and AbSeqRep
to compare
multiple samples with each other. The comparative reports
include additional
plots, for example, sample similarity clustering, overlapping clonotypes, etc. The usual plots are also generated for all the compared samples by adding an extra layer of aes
thetic.
r sabseq()
includes, but is not limited to, merging paired-end reads,
annotating V-(D)-J germlines, calculating unique clonotypes,
analyzing primer specificity, facilitating the selection of best restriction enzymes,
predicting frameshifts, identifying functional clones, and
calculating diversity indices and estimations. These analyses are seamlessly
extrapolated to analyze multiple library repertoires simultaneously when
multiple samples are present. Figure \@ref(fig:abseq-wf) depicts the complete
r sabseq()
workflow. Sequencing files are taken as input to be annotated and
analyzed by r sabseqPy()
before they are further analyzed and visualized
by r sabseqR()
.
read_png("abseq_workflow_MA.jpg")
r sabseqR()
can be installed from bioconductor.org or
its GitHub repository at https://github.com/malhamdoosh/abseqR.
To install r sabseqR()
via the BiocManager
, type in R console:
if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("abseqR")
To install the development version of r sabseqR()
from GitHub,
type in R console:
if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("malhamdoosh/abseqR")
r sabseqR()
requires pandoc version 1.19.2.1 or higher
to render the HTML reports. If pandoc
cannot be detected while executing r sabseqR()
,
the HTML report will not be generated. r sabseqR()
is a cross-platform
library and will work on any major operating system [^windows_biocparallel].
r sabseqR()
depends on several packages from the CRAN
and Bioconductor repositories:
r CRANpkg("RColorBrewer")
provides colour
schemes for maps and graphics. To install it, type in R console
install.packages("RColorBrewer")
r CRANpkg("VennDiagram")
provides a set of
functions to generate Venn diagrams. To install it, type in R console install.packages("VennDiagram")
r CRANpkg("circlize")
is a visualization tool
used to summarize the distributions of associations between V-J gene segments. To install it, type in R console install.packages("circlize")
r CRANpkg("flexdashboard")
is a package
that provides a template for RMarkdown that resembles a grid oriented dashboard and is used to generate the HTML reports. To
install it, type in R console install.packages("flexdashboard")
r CRANpkg("ggplot2")
is an implementation of
the "Grammar of Graphics" in R. It is used extensively to
generate plots. To install it, type in R console install.packages("ggplot2")
r CRANpkg("ggcorrplot")
is used to visualize
a correlation matrix using r CRANpkg("ggplot2")
.
To install it, type in R console install.packages("ggcorrplot")
r CRANpkg("ggdendro")
provides a set of tools
for drawing dendrograms and tree plots using r CRANpkg("ggplot2")
. To install it, type
install.packages("ggdendro")
r CRANpkg("grid")
is used to arrange plots.
It has been integrated into the base R package.
r CRANpkg("gridExtra")
provides functions
to work with "grid" graphics and used to arrange grid-based plots
in the HTML reports. To install it, type in R console install.packages("gridExtra")
r CRANpkg("knitr")
provides the capability to
dynamically generate reports in R. To install it, type in R console install.packages("knitr")
r CRANpkg("plotly")
is used to translate
r CRANpkg("ggplot2")
graphs to interactive web-based plots. To install it, type in R console
install.packages("plotly")
r CRANpkg("plyr")
offers a set of tools used in
this package to apply operations on subsets of data in manageable pieces. To
install it, type in R console install.packages("plyr")
r CRANpkg("png")
is used to read and display PNG images.
To install it, type in R console install.packages("png")
r CRANpkg("reshape2")
allows this package to
restructure and aggregate dataframes. To install it, type in R console install.packages("reshape2")
r CRANpkg("rmarkdown")
converts R Markdown
documents into a variety of formats. To install it, type in R console install.packages("rmarkdown")
r CRANpkg("vegan")
provides a suite of functions
to calculate diversity and distance statistics between repertoires. To install it,
type in R console install.packages("vegan")
r Biocpkg("BiocParallel")
is
a package from Bioconductor used to enable parallel computing. To install it, type in R console
if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("BiocParallel")
To leverage all the functionalities provided by r sabseqR()
, the main
functions to note are abseqR::abseqReport
, abseqR::report
, and +
.
This section uses a small simulated dataset to walk through the use cases of each function.
The example dataset is packaged with r sabseqR()
.
For the sake of brevity, the dataset generation is described under the
Appendices section.
Briefly, the dataset includes three samples, namely PCR1, PCR2, and PCR3, that was generated using in silico simulations.
r sabseqPy()
was then used to analyze the datasets and
the output directory argument --outdir
specified in r sabseqPy()
was initiated with the value "./ex/"
.
read_png("simulation_flowchart_full.jpg")
The output of r sabseqPy()
on the simulated datasets is first fetched into a local directory as follows:
# substitute with any directory that you have read/write access to sandboxDirectory <- tempdir() # path to provided data (comes installed with abseqR) exdata <- system.file("extdata", "ex", package = "abseqR") # copy the provided data to sandboxDirectory file.copy(exdata, sandboxDirectory, recursive = TRUE)
Then, the following commands can be executed in R console to verify that the three PCR
datasets are fetched successfully:
# dataPath now holds the path to a local copy of the data dataPath <- file.path(sandboxDirectory, "ex") # the sample names can be found inside the auxiliary directory list.files(path = file.path(dataPath, "auxiliary"))
After obtaining the datasets, the abseqReport
function from r sabseqR()
is invoked to visualize the different analysis results as follows:
# This section will visualize all the datasets individually # and compare PCR1 with PCR2 with PCR3 # Interim solution if (Sys.info()["sysname"] == "Darwin") { BPPARAM <- BiocParallel::SerialParam() } else { BPPARAM <- BiocParallel::bpparam() } # you should use report = 3 to generate a HTML report samples <- abseqReport(dataPath, compare = c("PCR1, PCR2, PCR3"), report = 1, BPPARAM = BPPARAM) # ignore the message: # "Sample output directory <path> is different from provided path # <path> assuming directory was moved" # This warning message tells us that the directory has # been moved (we copied the provided examples to "dataPath")
This creates plots for all samples included in dataPath
.
In addition, The compare = c("PCR1, PCR2, PCR3")
argument specifies
that samples PCR1
, PCR2
, and PCR3
are explicitly
compared against each other. Other possible values for
compare
, report
, and BPPARAM
will be discussed in detail
in later sections (here, here,
and here).
Figure \@ref(fig:abseq-final-folder-structure) shows the folder structure of ./ex/
after r sabseqR()
completes.
grid.raster(readPNG("post_report_fs.png"))
Invoking abseqReport
generates plots in the same folder as the
corresponding data files within the auxiliary
directory. They are then
collated together in an HTML document found in the report
directory.
The report
directory is structured as follows:
index.html
file is the entry point to browse AbSeq's HTML reports. It summarizes the r sabseq()
analysis and provides a convenient way
for navigating individual and comparative analysis results. For example, within this file, there are links to
the reports generated for PCR1
, PCR2
, PCR3
and PCR1 vs PCR2 vs PCR3
.
html_files
directory contains HTML files that are used build the individual and comparative reports. They can be accessed directly or via the main page index.html
.
In conclusion, the individual sample reports in html_files
can
be shared as-is, but index.html
must be accompanied by the html_files
directory and thus it is recommended to share the entire report
folder.
This section describes the possible values for abseqReport
's
compare
parameter. In the previous section, abseqReport
was called with
compare = c("PCR1, PCR2, PCR3")
. This
compares the three samples all together. However,
it is also possible to compare only a subset of samples in the dataPath
folder, multiple subsets of samples, or none at all.
The compare
parameter accepts a vector of one or more strings. Each string
denotes a comparison between samples separated by commas, for example,
compare = c("PCR1, PCR2, PCR3")
[^compare-ws].
If sample comparison is not required, then the following can be simply invoked samples <- abseqReport(dataPath)
.
Example of other combinations:
# Example of 1 comparison # Analyze all samples, but only compare PCR1 with PCR2 pcr1.pcr2 <- abseqReport(dataPath, compare = c("PCR1, PCR2"), report = 0) # Example of 2 comparisons # Analyze all samples. In addition, compare: # * PCR1 with PCR2 # * PCR2 with PCR3 multiComparison <- abseqReport(dataPath, compare = c("PCR1, PCR2", "PCR2, PCR3"), report = 0)
Note, abseqReport
always returns S4 objects of the class AbSeqRep for each sample in the dataPath
directory regardless of the value of the compare
argument as illustrated next:
# compare = c("PCR1,PCR2") names(pcr1.pcr2) # compare = c("PCR1, PCR2", "PCR2 ,PCR3") names(multiComparison)
The next subsection explains the motivation behind this behaviour.
When constructing antibody libraries, users might be interested in comparing Ab repertoires from different stages of the construction process. Usually, each stage has its own sequencing runs and thus would be analyzed indepedent of others. The report
function in r sabseqR()
was written to enable this behaviour as illustrated next.
Previously, the S4 objects of three samples of our toy example loaded into a variable named samples
. As a hypothetical example,
if the reports show an interesting observation between PCR1
and PCR3
,
it might be of interest to isolate the two samples from the rest.
This code shows how the +
operator can be used to create customized comparisons as follows:
# recall that samples is a named list where each element's name # is the sample's own name (see names(samples)) # use report = 3 if the results should be collated in a HTML document pcr1.pcr3 <- samples[["PCR1"]] + samples[["PCR3"]] refinedComparison <- report(pcr1.pcr3, outputDir = file.path(sandboxDirectory, "refined_comparison"), report = 1)
Here, the +
operator creates a new comparison between PCR1
and PCR3
of class AbSeqCRep. S4 objects of this class can be passed to the report
function to generate a standalone HTML report of this particular comparison.
Similar to abseqReport
, this function returns S4 objects of the individual samples - PCR1
and PCR3
.
names(refinedComparison)
Similarly, samples can be loaded from two different r sabseqPy()
's directories as illustrated in the following example:
```{bash abseqR_different_dirs, eval=FALSE}
abseq --file1 fasta/SRR_ACGT_CTRL.fasta --outdir SRR_CTRL
abseq --file1 fasta/SRR_ACGT_EXP.fasta --outdir SRR_EXP
analyzing these samples in `r sabseqR()`: ```r SRRControl <- abseqReport("SRR_CTRL") SRRExp <- abseqReport("SRR_EXP")
then comparing all samples in SRRControl
with all samples in SRRExp
can be done using +
and report
.
# short for SRRControl[[1]] + SRRControl[[2]] + ... + SRRExp[[1]] + ... all.samples <- Reduce("+", c(SRRControl, SRRExp)) report(all.samples, outputDir = "SRRControl_vs_SRRExp")
Important: The sample names in
SRR_CTRL
andSRR_EXP
must be unique.
In conclusion, the +
operator along with the report
function enables users to carry out a wide range of customized downstream analyses on the output of r sabseqPy()
.
In the previous section, the SRRControl
dataset was loaded using
SRRControl <- abseqReport("SRR_CTRL")
, which will generate all plots and reports by default. However, this dataset might have already been analyzed and users are interested in only loading the S4 objects of the samples. This can be efficiently carried out by using the report=0
argument as follows:
# in essence, this is identical to SRRControl <- abseqReport("SRR_CTRL"), # but will not regenerate plots and reports! SRRControl.lazy <- abseqReport("SRR_CTRL", report = 0) # exactly the same return value stopifnot(names(SRRControl.lazy) == names(SRRControl))
In the previous section, the report
parameter of abseqReport
was used to
load the samples in SRRControl
without actually plotting any data.
The report
argument can accept one of four possible values as follows:
abseqReport("SRR_CTRL", report = 0)
does not generate plots and HTLM reports and only returns a named list of S4 objects.
abseqReport("SRR_CTRL", report = 1)
generates static plots in PNG format but does not generate HTML reports.
abseqReport("SRR_CTRL", report = 2)
generates static plots in PNG format in addition to HTML reports.
abseqReport("SRR_CTRL", report = 3)
generates static plots in PNG format and interactive plots in the HTML reports using plotly. This is the default behaviour when the report
argument is not specified.
One of abseqReport
's parameters is BPPARAM
, which is used to pass arguments into the BiocParallel::bplapply
function for customizing parallelization. More
information regarding the accepted values to BPPARAM
can be found at
BiocParallel's
page.
Below is a simplified example of using 4 cores and serializing the loop.
# use 4 CPU cores nproc <- 4 samples <- abseqReport(dataPath, BPPARAM = BiocParallel::MulticoreParam(nproc)) # run sequentially - no multiprocessing samples <- abseqReport(dataPath, BPPARAM = BiocParallel::SerialParam())
# [s]ingle sample [dir]ectory sdir <- file.path(dataPath, "auxiliary", "PCR1") # [m]ulti sample [dir]ectory mdir <- file.path(dataPath, "auxiliary", "PCR1_vs_PCR2_vs_PCR3")
This section presents the plots generated by r sabseqR()
on the
dataset discussed above and provides some insights on how to interpret them.
The visualizations and analyses can be broken down into 5 categories:
The plots described in this section can be found in the Summary
and Quality
tabs of the HTML report.
The sequence length distribution is a simple way of validating the quality of a sequencing run. The histogram is expected to show a large proportion of sequences falling within the expected lengths.
read_png(file.path(sdir, "annot", "PCR1_all_clones_len_dist.png"))
Figure \@ref(fig:seq-len) plots sequence length (x-axis) against proportion (y-axis). A similar plot with outliers removed can be found in the Summary
tab.
r sabseqPy()
filters low quality sequences based on the quality of the sequence alignment against germline sequence databases and thus the following parameters can be used:
However, setting the optimal cut-off thresholds for these parameters is challenging. Stringent values could filter too many sequences while lenient values could retain low quality sequences.
The alignment quality heatmaps generated in the Quality
tab of the HTML report shows the relationship
between alignment lengths and these alignment parameters to help determine the percentage of sequences falling within a given range and thus inform the selection of cut-off thresholds.
For example, Figure \@ref(fig:align-qual) shows one of the 5 heatmaps: bitscore
against V germline alignment length. r sabseqPy()
has some recommendations on the values of these parameters to retain good quality sequences.
read_png(file.path(sdir, "abundance", "PCR1_igv_align_quality_bitscore_hm_static.png"))
Indels (Insertions or deletions) and mismatches can be used as an indicator of sequence quality.
Figure \@ref(fig:indel) shows the proportions of indels in PCR1
. This graph plots the
rate of indels (y-axis) in the V germlines of PCR1
against the number of indels (x-axis).
A similar plot for rate of mismatches in V germlines can be found in the same directory. A high rate of indels in the germline sequences might indicate a quality problem with the library because this would affect the functionality of the sequenced clones. However, this could be due to the sequencing quality depending on which sequencing technology is used. For example, long read sequencing technologies tend to produce more indel errors than short read sequencing technologies.
read_png(file.path(sdir, "abundance", "PCR1_igv_gaps_dist.png"))
The plots generated in this section can be found in the Abundance
tab of the HTML report.
The proportions of V-(D)-J germlines is essential in some experiment designs. For example, it can be used to validate that the germline abundance of an in-house antibody library is in-line with the donor antibody repertoire. Experiments that artificially amplify certain germline families can also be validated similarly using this analysis.
read_png(file.path(sdir, "abundance", "PCR1_igv_dist_family_level.png"))
Figure \@ref(fig:vgermline) shows the distribution of IGHV families in PCR1
. Similar plots can be generated for individual V germlines genes and for the D and J germlines.
This plot visualizes the recombination process of V and J germlines.
Figure \@ref(fig:vjassoc) summarizes the associations between V
and J family germlines in a plot generated using r CRANpkg("circlize")
.
read_png(file.path(sdir, "abundance", "PCR1_vjassoc.png"))
This plot can be used to check whether the Ab library is biased towards a particular combination of germline genes due to cloning errors.
The plots described in this section can be found in the Productivity
tab in PCR1
's HTML report. The main factors affecting the productiveness of a clone by
r sabseq()
's interpretation are:
Any sequence that satisfies at least one of the above condition will be classified as unproductive and thus it is unlikely that it will express a functional antibody.
Figure \@ref(fig:prod-summ) summarizes the productivity analysis results of PCR1
. Factors that cause sequences to be non-functional are colour coded as follows:
read_png(file.path(sdir, "productivity", "PCR1_productivity.png"))
A good antibody library should have as low unproductive clones as possible. Cloning strategies that are used to clone sequences from the donor libraries or used to construct the Ab library play a key role in this aspect of the library quality.
Figure \@ref(fig:frameshift) shows the percentage of clones that are out-of-frame due to either misaligned V-J coding frames or to the presence of non-multiple of three-indels in one of the framework or complementarity-determining regions.
read_png(file.path(sdir, "productivity", "PCR1_vjframe_dist.png"))
Figure \@ref(fig:stopcod) shows the hot spots for stop codons segregated by framework and complementarity-determining regions.
read_png(file.path(sdir, "productivity", "PCR1_stopcodon_region_outframe.png"))
The figure above shows the percentage of stop codons in the FR and CDR regions of out-of-frame sequences. As discussed earlier, these stop codons may be introduced due to cloning or sequencing errors, hence a similar plot for in-frame sequences can also be found within the same tab.
Some sequences are productive despite having indels and mismatches. This occurs when indels are multiple of three and mismatches do not introduce stop codons. The following figures show the proportion of indels and mismatches for each germline, framework region, and complementarity-determining region on productive sequences (unless specified otherwise).
Figure \@ref(fig:mismatches), Figure \@ref(fig:gaps), and Figure \@ref(fig:gaps-out) plots the proportion of mismatches in productive sequences, indels in productive sequences, and indels in out-of-frame (unproductive) sequences for framework region 3 (FR3). The motivation behind these plots is to quickly identify the quality of productive sequences.
read_png(file.path(sdir, "productivity", "PCR1_fr3_mismatches_dist.png"))
read_png(file.path(sdir, "productivity", "PCR1_fr3_gaps_dist.png"))
read_png(file.path(sdir, "productivity", "PCR1_fr3_gaps_dist_out_of_frame.png"))
Ideally, the number of mismatches in framework regions and IGJ are expected to be low because they are relatively conserved regions. Similarly, the number of indels in productive sequences are expected to be low or some multiple of 3.
Similar plots are generated for other FR and CDR regions, IGV, IGD, and IGJ.
The plots described in this section can be found in the Diversity
tab of the HTML report. r sabseq()
only conducts diversity analysis on clones that are productive.
To estimate repertoire diversity, r sabseqR()
employs three commonly used techniques:
Duplication-level analysis in which the number of times a clone appears in the sequenced sample is calculated. Figure \@ref(fig:duplication) plots the proportion of sequences (y-axis) that appear once (singletons), twice (doubletons), and at higher-orders (x-axis). The higher the percentage of singletones and doubletones, the more diverse the library would likely be.
Rarefaction analysis in which bootstrapping is used to estimate the richness of a library by calculating the proportion of unique sequences at different sample sizes. Figure \@ref(fig:rarefaction) plots the number of deduplicated clonotypes (y-axis) againse sample sizes (x-axis). For each sample size, five samples are drawn and the mean with confidence intervals are calculated. In a highly diverse library, the percentage of unique clones should significantly increase as the sample size increases.
Capture-recapture analysis in Figure \@ref(fig:recapture) plots the percentage of clonotypes recaptured (y-axis) in a capture-recapture experiment at different sample sizes (x-axis). For each sample size, the percentage of recaptured clonotypes is calculated for five repeated capture-recapture experiments and the mean and confidence intervals are reported.
In below figures, the complementarity-determining regions (CDRs) are used to define a "clonotype". Similar plots can be generated for the framework regions (FRs) and the entire variable domain sequences.
read_png(file.path(sdir, "diversity", "PCR1_cdr_duplication.png"))
read_png(file.path(sdir, "diversity", "PCR1_cdr_rarefaction.png"))
read_png(file.path(sdir, "diversity", "PCR1_cdr_recapture.png"))
Spectratypes are histograms of the clonotype lengths calculated based on the amino acid sequences. Figure \@ref(fig:spectra) shows a CDR3 spectratype. Spectratypes for other FRs and CDRs are available in the same tab. In a good quality library, the framework regions would have quite conserved lengths while CDRs show high length diversity. CDR3 spectratype tends to follow a normal distribution in libraries cloned from naive repertoires.
read_png(file.path(sdir, "diversity", "spectratypes", "PCR1_cdr3_spectratype_no_outliers.png"))
This analysis examines the diversity of Ab library at each amino acid position of the variable domain by estimating the utilization of each of the 20 amino acids at each position. Position-specific frequency matrices (PSFMs) are calculated by aligning all the sequences of a region of interest to anchor at the first position and then the frequency of each amino acid is calculated accordingly. Two PSFMs are calculated: (1) the unscaled PSFM, in which the frequencies are calculated based on the total number of observed sequences per sample at each position and (2) the sacled PSFM, in which the frequencies are calculated based on the total number of observed sequences per sample.
Figure \@ref(fig:comp) shows the PSFM of CDR3 in PCR1
. Amino acids are coloured based on their physiochemical properties. The left plot shows the unscaled composition logo and the right plot
shows the scaled composition logo. Similar plots for other FRs and CDRs are available in the same tab.
unscaled <- file.path(sdir, "diversity", "composition_logos", "CDR3", "PCR1_cumulative_logo.png") scaled <- file.path(sdir, "diversity", "composition_logos", "CDR3", "PCR1_cumulative_logo_scaled.png") img1 <- rasterGrob(as.raster(readPNG(unscaled)), interpolate = TRUE) img2 <- rasterGrob(as.raster(readPNG(scaled)), interpolate = TRUE) grid.arrange(img1, img2, ncol = 2)
The plots described in this section can be found in the Clonotypes
tab of
PCR1 vs PCR2 vs PCR3
's HTML report.
Since comparative analysis deals with overlapping clonotypes,
this analysis only applies when compare
was supplied with at
least one sample comparison. Earlier, the call to abseqReport
had compare = c("PCR1, PCR2, PCR3")
, therefore
PCR1
, PCR2,
and PCR3
are compared against each other.
Throughout this analysis, a clonotype is synonymous to its CDR3 amino acid sequence.
Figure \@ref(fig:topclones) offers a simple overview of the top 10 over-represented clones found in each sample. Since the clonotypes are colour coded, overlapping clonotypes can easily be identified within the top 10 of each samples. Note that the proportions are scaled relative to the top 10 clones in the respective samples.
This plot complements the scatter plot mentioned above. It displays the most abundant clonotypes in each sample with the amino acid sequence in the legend.
read_png(file.path(mdir, "clonotype_analysis", "PCR1_PCR2_PCR3_top10Clonotypes.png"))
While Figure \@ref(fig:topclones) is capable of showing overlapping clones, it is restricted to the top 10 over-represented clones from each sample. Figure \@ref(fig:overlap) aims to overcome the restriction by using a venn diagram to visualize the number of overlapping (and non-overlapping) clones from each sample. Each number within the venn diagram shows the number of unique clonotypes that are overlapping (in an intersection) or are non-overlapping (not in any intersection). That is, by taking the sum of all the numbers in a sample segment, it becomes the number of unique clonotypes found in that sample.
read_png(file.path(mdir, "clonotype_analysis", "PCR1_PCR2_PCR3_cdr3_clonotypeIntersection.png"))
Note that this venn diagram will not be plotted if there are more than 5 samples.
In order to visualize the correlation between any pair of samples, r sabseqR()
plots
a scatter plot of every possible combination. Figure \@ref(fig:scatter) shows
one of them, plotting the clonotype frequencies in PCR2
against PCR1
.
The scatter plot:
read_png(file.path(mdir, "clonotype_analysis", "PCR1_vs_PCR2_clone_scatter.png"))
This plot is heavily inspired by VDJTools.
In addition to Figure \@ref(fig:scatter), the linear correlation of clonotype frequencies
between samples can be directly quantified using Pearson's correlation coefficient.
Figure \@ref(fig:corr) shows the plot generated by r CRANpkg("ggcorrplot")
used to visualize
pearson coefficients. A similar plot using Spearman's correlation coefficient (rank-based)
is also available in the same directory.
read_png(file.path(mdir, "clonotype_analysis", "pearson.png"))
The r CRANpkg("vegan")
package was used to calculate distances between samples.
The distances between samples are calculated using its clonotype frequencies by
applying methods from Morisita-Horn's overlap index, Jaccard index, and Dice's coefficient.
Figure \@ref(fig:cluster) shows a dendrogram plotted using Morisita-Horn's overlap index. The length of each line denotes the distance between the 2 samples or clusters it is connected to. Other dendrograms using Jaccard and Dice's formula are available in the same directory.
read_png(file.path(mdir, "clonotype_analysis", "morisita_horn.png"))
The datasets used in the above examples
was obtained from a combination of synthetic sample datasets generated
using MiXCR's program
here.
Firstly, three distinct samples were generated, each simulated with the following parameters
in MiXCR
:
tabl <- " | Parameter | sample 1 | sample 2 | sample 3 | |---------------|---------------|---------------|---------------| | reads | 10000 | 10000 | 10000 | | clones | 5000 | 5000 | 2000 | | seed | 4228 | 2428 | 2842 | | conf | MiSeq-300-300 | MiSeq-300-300 | MiSeq-300-300 | | loci | IGH | IGH | IGH | | species | hsa | hsa | hsa | " cat(tabl)
Following that, an arbitrary number of sequences were randomly drawn from each of the three samples and randomly amplified. This process was repeated 3 times, resulting in a final repertoire of three samples, named PCR1, PCR2, and PCR3.
Finally, these three samples were analyzed by r sabseqPy()
. The command used to
analyze these samples are as follows:
```{bash abseqR__data_abseqPy_run, eval=FALSE} abseq -y params.yml
where the contents of `params.yml` is: ```yaml # params.yml defaults: bitscore: 300 sstart: 1-3 alignlen: 250 outdir: ex task: all threads: 1 --- file1: PCR1.fasta name: PCR1 --- file1: PCR2.fasta name: PCR2 --- file1: PCR3.fasta name: PCR3
r sabseqPy()
's analysis output on these three samples are contained
within the dataset described in this vignette.
This vignette was rendered in the following environment:
sessionInfo()
[^Repseq]: Repertoire sequencing. [^windows_biocparallel]: The performace offered by BiocParallel may differ across different operating systems. [^compare-ws]: Trailing and leading whitespace between sample names are trimmed. That is, "PCR1,PCR2" is identical to "PCR1 , PCR2".
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.