## For links
library("BiocStyle")
## Track time spent on making the vignette
startTime <- Sys.time()
## Bib setup
library("RefManageR")
## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    DT = citation("DT")[1],
    ggplot2 = citation("ggplot2")[1],
    golem = citation("golem")[1],
    knitr = citation("knitr")[3],
    plotly = citation("plotly")[1],
    RColorBrewer = citation("RColorBrewer")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    shiny = citation("shiny")[1],
    testthat = citation("testthat")[1],
    biocthis = citation("biocthis")[1],
    shinyjs = citation("shinyjs")[1],
    shinyFiles = citation("shinyFiles")[1],
    plyr = citation("plyr")[1],
    data.table = citation("data.table")[1],
    stringr = citation("stringr")[1],
    stringdist = citation("stringdist")[1],
    plot3D = citation("plot3D")[1], 
    gridExtra = citation("gridExtra")[1],
    dplyr = citation("dplyr")[1],
    pryr = citation("pryr")[1],
    shinyBS = citation("shinyBS")[1],
    fs = citation("fs")[1]
)
knitr::opts_chunk$set(
  collapse = TRUE,
    comment = "#>",
    error = FALSE,
    warning = FALSE,
    message = FALSE,
    crop = NULL
)
knitr::include_graphics(path = system.file("app", "www", "tripr.png", 
                                           package="tripr", mustWork=TRUE))

Introduction

tripr is a Bioconductor package, written in shiny that provides analytics services on antigen receptor (B cell receptor immunoglobulin, BcR IG | T cell receptor, TR) gene sequence data. Every step of the analysis can be performed interactively, thus not requiring any programming skills. It takes as input the output files of the IMGT/HighV-Quest tool. Users can select to analyze the data from each of the input samples separately, or the combined data files from all samples and visualize the results accordingly. Functions for an R command-line use are also available.

Installation

tripr is distributed as a Bioconductor package and requires R (version "4.1"), which can be installed on any operating system from CRAN, and Bioconductor (version "3.14").

To install tripr package enter the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("tripr")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

Launching the app

Once tripr is successfully installed, it can be loaded as follow:

library(tripr)

Running tripr as a shiny application

In order to start the shiny app, please run the following command:

tripr::run_app()

tripr should be opening in a browser (ideally Chrome, Firefox or Opera). If this does not happen automatically, please open a browser and navigate to the address shown on the R console (for example, Listening on http://127.0.0.1:6134).

Home

In this tab users can import their data by selecting the directory where the data is stored, by pressing the Choose directory button. The tool takes as input the 10 output files of the IMGT/HighV-Quest tool in text format (.txt). Users can also choose only some of the files depending on the type of the downstream analysis.

Note that every sample of the dataset must have its own individual folder and every sample folder must be in one root folder (See example below). For the dataset to be selected for upload, this root folder must be selected and then the button Load Data has to be pressed.

knitr::include_graphics(path = system.file("app", "www", "upload_data.png", 
                                           package="tripr", mustWork=TRUE))

Previous sessions can also be loaded with the Restore Previous Sessions button.

There are 2 options regarding the cell type (T cell and B cell) as well as 2 options based on the amount of available data (High- or Low-Throughput). Concerning the latter, the main difference is the application of the preselection and selection steps. In the case of High-Throughput data, all filters are applied consequentially (i.e. if a sequence fails >1 selection criteria, only the first unsatisfied criterion will be reported), whereas for Low-Throughput data all criteria are applied at the same time.

Preprocessing

tripr offers 2 steps of preprocessing:

Preselection{#preselection}

The Preselection process comprises 4 different criteria:

knitr::include_graphics(path = system.file("app", "www", "Preselection.png", 
                                           package="tripr", mustWork=TRUE))

The execution starts when the Apply button is pressed.

Users can visualize the results of the preselection (first cleaning) process in the Preselection tab. In the case of multi-sample datasets, results are provided for each individual sample separately, or for the combined dataset by scrolling through the Select Dataset option.

The output consists of 4 table files:

  1. Summary: a summary table with both the included and excluded sequences for each different criterion
  2. All Data table: the entire set of data
  3. Clean table: the sequences that meet the preselection criteria and are included in the analysis and
  4. Clean out table: the excluded sequences. The last column of the “Clean out table” refers to the unsatisfied criteria.

The figure below shows an example Clean table from this Tab.

knitr::include_graphics(path = system.file("app", "www", 
                                            "Preselection_Clean_table.png", 
                                           package="tripr", mustWork=TRUE))

All 4 tables can be downloaded as text files.

Selection{#selection}

The sequences that passed through the Preselection process (“Clean table”) are used as input for the data Selection (filtering) process.

This step comprises 6 different filters:

Using the above 3 filters the user can select for sequences that carry one or more particular V, J and D genes or gene alleles, respectively. Different genes/gene alleles should be separated with a vertical line (|), e.g. TRBV11-2|TRBV29-1*03.

The execution starts when the Execute button is pressed.

knitr::include_graphics(path = system.file("app", "www", "Selection.png", 
                                           package="tripr", mustWork=TRUE))

The results of the Selection (filtering) process are presented in the Selection tab.

This process provides 4 output files:

  1. Summary: a summary table with both the included and excluded sequences for each filter
  2. All Data table: the data used as input after the Preselection process
  3. Filter in table: the sequences that passed through the selection filters and
  4. Filter out table: the excluded sequences. The last column of the “Filter out table” refers to the filters that were not passed by each individual sequence.
knitr::include_graphics(path = system.file("app", "www", "Selection_visual.png", 
                                           package="tripr", mustWork=TRUE))

All the tables can be downloaded as text files.

Pipelines & Step dependencies {#pipelines}

Users can select the workflow that they want to apply to their dataset(s).

There are 11 different tools in the pipeline tab. 7 of them can be applied for both T- and B-cells, while the remaining 4 can be applied only for B-cells.


Step Dependencies in Pipeline

  1. In order to apply Highly Similar Clonotypes computation, Clonotypes computation should have been selected previously.
  2. In order to apply Repertoires Extraction, Clonotypes computation should have run previously. If Highly Similar Clonotypes computation has been selected, repertoires will be extracted for both total clonotypes and highly similar clonotypes.
  3. The Somatic hypermutation status is applied using the groups that have been selected at Insert Identity groups.
  4. If both Alignment and Clonotypes computation have been selected, the cluster ID in the alignment table corresponds to the cluster ID in the clonotype table. Otherwise, all elements in the “cluster_ID” column of the alignment table are assigned to zero.
  5. In order to apply Alignment using the Select top N clonotypes option, Clonotypes computation should have run previously.
  6. In order to apply Mutations, Alignment should have run previously, using the corresponding “AA or Nt” option. The Mutation table is computed based on the grouped alignment table.
  7. In order to apply Mutations using the Select top N clonotypes or the Select clonotypes separately option, Clonotypes computation should have previously run.
  8. In order to apply Logo using the Select top N clonotypes option, Clonotypes computation should have run previously.
  9. Ιn order to run the Shared Clonotype computation and the Repertoire comparison steps, the user must have loaded more than one datasets.
knitr::include_graphics(path = system.file("app", "www", 
                                            "pipeline_overview.png", 
                                           package="tripr", mustWork=TRUE))

For both T- and B-cells:

Clonotype computation

The frequencies for all unique clonotypes of each sample are computed. There are 10 different options for clonotype definition.

knitr::include_graphics(path = system.file("app", "www", 
                                            "Clonotypes_pipeline.png", 
                                           package="tripr", mustWork=TRUE))

The results are presented in the Clonotypes tab in the form of a table, where the clonotype, the count, the frequency and the convergent evolution (if feasible) are given. Each clonotype is also a link that provides a table with all relevant immunogenetic data for that particular clonotype, based on the uploaded files. This table consists of all reads/sequences assigned to that clonotype and all relevant information. Each clonotype is given a unique cluster identifier (cluster ID).

knitr::include_graphics(path = system.file("app", "www", 
                                            "clonotypes_tab.png", 
                                           package="tripr", mustWork=TRUE))

Highly similar clonotypes computation

Frequencies for all highly similar clonotypes are computed. The user can set the number of mismatches allowed for each CDR3 length found in the dataset and a clonotype frequency threshold (range: 0-1). Only clonotypes with a frequency above the applied threshold will be used in the subsequent grouping. The whole process can be performed with or without taking into account the rearranged V-gene.

knitr::include_graphics(path = system.file("app", "www", 
                                            "highly_similar_pipeline.png", 
                                           package="tripr", mustWork=TRUE))

The results are presented in the Highly Similar Clonotypes tab as a table. A second table is also provided containing information regarding the clonotype grouping.

knitr::include_graphics(path = system.file("app", "www", 
                                            "highly_sim_clono_tab.png", 
                                           package="tripr", mustWork=TRUE))

Repertoires extraction

The number of clonotypes using each V, J or D gene/allele is computed over the total number of clonotypes based on the clonotype definition given in the previous Clonotype computation step. If multiple samples are analyzed together the tool provides a total repertoire as well as the repertoire for each individual sample.

knitr::include_graphics(path = system.file("app", "www", 
                                            "repertoires_pipeline.png", 
                                           package="tripr", mustWork=TRUE))

Results are provided in the Repertoires tab as tables. Each table includes the gene/allele and information concerning the absolute count and frequency of sequences expressing that particular gene/allele.

knitr::include_graphics(path = system.file("app", "www", 
                                            "repertoires_tab.png", 
                                           package="tripr", mustWork=TRUE))

Highly Similar Repertoires extraction

Same as above except for the fact that the tool uses as input the clonotypes as computed in the Highly Similar Clonotypes computation.

Multiple value comparison

The tool performs cross-tabulation analysis between 2 selected variables. Many different variables can be selected by the user for this type of analysis depending on the selected input files from the Home tab.

knitr::include_graphics(path = system.file("app", "www", 
                                            "mv_pipeline.png", 
                                           package="tripr", mustWork=TRUE))

The results are presented at the Multiple value comparison tab as tables. Each table contains the values that were found to be associated and the relevant frequency.

knitr::include_graphics(path = system.file("app", "www", 
                                            "multiple_tab.png", 
                                           package="tripr", mustWork=TRUE))

CDR3 with 1 amino acid length difference

This tool can be applied for datasets that consist of sequences with highly similar CDR3. The tool is able to align and create sequence logos for sequences with the same length as well as for sequences that differ by a single amino acid in terms of length.

Logo

This tool creates an amino acid frequency table for the selected sequence region (CDR3, VDJ REGION, VJ REGION) of a given length. The frequency table is computed by counting the frequency of appearance of each of the 20 different amino acids at any given position of the sequence. The users have the option to select over the total frequency table or the table of the top clusters according to the clonotype frequencies.

A logo is created using the above frequency table. The color code of the amino acids is created based on the 11 IMGT amino acid physicochemical classes.


Only for B cells:

Insert identity groups

Input sequences are grouped into different categories based on the V-region identity percent. The user can determine the number and the identity percent range of mutational groups. (high limit: <, low limit: ≥)

knitr::include_graphics(path = system.file("app", "www", 
                                            "insert_identity_pipeline.png", 
                                           package="tripr", mustWork=TRUE))

Somatic hypermutation status

The relative frequency of each germline identity group is computed. If the user has not defined any groups based on the somatic hypermutation (SHM) status using the Insert identity groups tool, the tool will group together only sequences that display the exact SHM status (e.g. sequences with an identity percent of 98.6% will be grouped together whereas sequences with 98.7% identity will form a distinct group). Relative frequencies for each SHM group will be computed based on the total number of sequences.

Alignment

An alignment table is created for the user-selected region (VDJ REGION, VJ REGION). Sequences that are identical in terms of amino acid or nucleotide sequence level are grouped together in order to create the grouped alignment table. Alignments for the selected region can be provided at the nucleotide or amino acid level or both. Default reference sequences are extracted from the IMGT reference directory. Reference sequences can be used either at the gene or gene allele level. At the gene level, allele *01 is considered as reference. Users can also submit their own reference sequence. There is also the possibility to align only a number of selected clonotypes through the Select topN clonotype option or select those clonotypes that have an individual frequency above a given percent cutoff.

knitr::include_graphics(path = system.file("app", "www", 
                                            "Alignment_pipeline.png", 
                                           package="tripr", mustWork=TRUE))

Results are presented in the Alignment tab as tables.

knitr::include_graphics(path = system.file("app", "www", 
                                            "alignment_tab.png", 
                                           package="tripr", mustWork=TRUE))

Each table can be downloaded in txt format.

Somatic hypermutations

A table with all somatic hypermutations for all samples together as well as for each individual sample is computed based on the alignment table provided by the previous tool.

The output table includes:

  1. the mutation type,
  2. the position of the change,
  3. the region where the change occurs,
  4. the number of sequences carrying each change and
  5. the frequency of the change for every gene or allele based on the grouped alignment table regardless the clonotype.

There is the possibility to analyze only a number of clonotypes by choosing the Select topN clonotypes or the Select threshold for clonotypes option or even some clonotypes separately by choosing the Select clonotypes separately option. Different clonotype/cluster identifiers (cluster IDs) should be separated by comma (e.g. 1,3,7).

Results are given in the Mutations tab as tables. When different clonotypes are selected separately, different tables are created for each given clonotype.

Each table can be downloaded in text format.

Visualization

In the Visualization tab different types of charts (scatter, plots, bars etc.) are available for the visualization of the analysis results. Clonotypes are presented as bars and the user can select the frequency above which the clonotypes will be presented.

knitr::include_graphics(path = system.file("app", "www", 
                                            "visual_clono.png", 
                                           package="tripr", mustWork=TRUE))

The convergent evolution is also available for visualization with more than one chart type options.

The computed repertoires are presented as pie-charts and the user can again select the minimum frequency of the gene/allele that will be presented.

Regarding the Multiple value comparison tool, a plot of the 2 selected variables is presented.

knitr::include_graphics(path = system.file("app", "www", 
                                            "hist3d_visual.png", 
                                           package="tripr", mustWork=TRUE))

All the tables that are presented to the user can be downloaded in text format, whereas the plots and the graphics can be downloaded in .png format.

Overview

This section provides an overview of the user's total options for the analysis.

knitr::include_graphics(path = system.file("app", "www", 
                                            "overview.png", 
                                           package="tripr", mustWork=TRUE))

Running tripr via R command line

As mentioned before, tripr can also be used via R command line with the run_TRIP() function.

Usage

run_TRIP() works as a wrapper function for the analysis that tripr provides. To see its detailed documentation write:

    ?tripr::run_TRIP

Some of its most important arguments:

Output of Command Line tool

Every output of tripr analysis with run_TRIP() function will be stored in the output_path directory as mentioned before. Therefore, no table or plot will be presented through RStudio or any other graphics device when the analysis is run, on contrary with the shiny app, where the user has access to output tables and plots via the User Interface.

Output Directory contains two folders:

  1. output : Where data tables are stored.
  2. Analysis : Where plots are stored.

The output directory has a unique name for every analysis, that points to the system time that it was run.

Example with run_TRIP()

An example of run_TRIP() analysis, using the example dataset of 2 B-cells that is provided, is presented below.

    datapath <- fs::path_package("extdata/dataset", package="tripr")
    output_path <- fs::path_home("Documents/my_output")
    cell <- "Bcell"
    preselection <- "1,2,3,4C:W"
    selection <- "5"
    filelist <- c("1_Summary.txt", 
                  "2_IMGT-gapped-nt-sequences.txt", 
                  "4_IMGT-gapped-AA-sequences.txt", 
                  "6_Junction.txt")
    throughput <- "High Throughput"
    preselection <- "1,2,3,4C:W"
    selection <- "5"
    identity_range <- "88:100"
    pipeline <- "1"
    select_clonotype <- "V Gene + CDR3 Amino Acids"

    run_TRIP(
        datapath=datapath,
        output_path=output_path,
        filelist=filelist,
        cell=cell, 
        throughput=throughput, 
        preselection=preselection, 
        selection=selection, 
        identity_range=identity_range,
        pipeline=pipeline, 
        select_clonotype=select_clonotype)

Tool dependencies

The r Biocpkg('tripr') package was made possible thanks to:

Citation

We hope that r Biocpkg('tripr') will be useful for your research. Please use the following information to cite the package and the research article. Thank you!

## Citation info
citation("tripr")

Session info {.unnumbered}

Here is the output of sessionInfo() on the system on which this document was compiled running pandoc r rmarkdown::pandoc_version():

sessionInfo()

Bibliography {.unnumbered}

This vignette was generated using r Biocpkg('BiocStyle') r Citep(bib[['BiocStyle']]), r CRANpkg('knitr') r Citep(bib[['knitr']]) and r CRANpkg('rmarkdown') r Citep(bib[['rmarkdown']]) running behind the scenes.

Citations made with r CRANpkg('RefManageR') r Citep(bib[['RefManageR']]).

## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))


iofeidis/tripr documentation built on Dec. 20, 2021, 7:58 p.m.