knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.height = 5,
  fig.width = 5,
  fig.align = "center"
)
set.seed(100)

Introduction

GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots) is an algorithm developed by Glanville et al. to identify specificity groups in the T cell receptor repertoire based on local (motif sharing) and global (hamming distance) similarities.(^1)(^,)(^2)

The authors of the paper published a version of GLIPH written in Perl(^2), but its disadvantage is a long runtime, especially for larger sample sizes. Recently, with a new version of the package, GLIPH2, they introduced an improved algorithm that speeds up the analysis by substituting repeated sampling by Fisher's exact test to assess the statistical significance of a given motif.(^3)(^,)(^4)

In this R implementation of GLIPH we followed their Perl script and we tried to include all the options of the original algorithm, but with a constant look on the analysis speed. With an input size of ~ $10^4$, this R implementation is about 50 times faster and for an input size of ~ $10^5$ sequences it is about 500 times faster than the Perl script. In addition, we implemented a function for visualizing the clusters generated by GLIPH, which enables graphical processing and analysis of GLIPH results through numerous customization options. The implementation of GLIPH2 as well as the function gliph_combined, which combines the different features of GLIPH and GLIPH2 in a customizable way, complete this package and allow the user to use the different GLIPH algorithms according to his own needs.

For more details about the method we recommend reading the original papers and the GLIPH/GLIPH2 documentation on github.(^1)(^,)(^2)(^,)(^3)(^,)(^4)

Installation

turboGliph can be installed as part of the Bioconductor repository with the following code from the R terminal. In the first lines, if not already done, the basic Bioconductor packages are installed. Once this is done, the last line can be used to install the turboGliph package.

if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
    BiocManager::install()
}

BiocManager::install("turboGliph")

Required input data

In this introduction we will use gliph_input_data, the dataset published with GLIPH.(^1) First, let's take a look on the required format of the input data.

library(turboGliph)
data("gliph_input_data")
gliph_input_data[1:3,]

To use turbo_gliph, gliph2 and gliph_combined, CDR3 beta sequences are required. These must be specified either by a character vector or in a dataframe in a column named CDR3b. Additional information is not required for clustering, but is recommended for automatic cluster scoring. This includes the following information:

Notice: For turbo_gliph, gliph2 and gliph_combined the column names are important, the order of the columns can be arbitrary. In this example, four additional columns (TRBJ, CDR3a, TRAV, TRAJ) are given but will not be used by the GLIPH algorithms. However, this additional information is stored by the functions and will be visible in the returned clusters.The additional columns do not affect the algorithms as long as all column names specified in the input and output in this tutorial are avoided.

Individual reference repertoire

By default, the GLIPH algorithms in this package use the naive reference repertoire from the original GLIPH publication. However, there are several other reference repertoires available on the GLIPH2 website(^4) for different T cell subtypes as well as different species, which can be used in this package as well as user-created reference repertoires. For this, the reference sequences must be passed to the functions as a data frame under the parameter refdb_beta. In any case, a column named CDR3b is required which contains the sequences. Optionally (but mandatory for some parameter settings), a column named TRBV is expected to contain the V genes of the sequences. Additional information is ignored by the functions.

Function: turbo_gliph

The turbo_gliph function is a runtime-efficient implementation of the GLIPH algorithm provided by Glanville et al. in their publication as a Perl script.(^1)(^,)(^2) The choice of the R programming language, a restructuring of some parts of the original code, and the incorporated option of parallelization allow a significant reduction in runtime, especially with respect to larger sample input sizes. The ability to customize the algorithm using a variety of input parameters was retained and extended to include some parameters that are listed on the github website but not included in the Perl script. Details about the algorithm and the input parameters can be found in the original publication, the github website and the documentation of this package.

Brief overview of the algorithm

  1. Preparation of the input and the reference repertoire
  2. Exclusion of CDR3b sequences with characters outside the amino acid one-letter code.
  3. Exclusion of CDR3b sequences with length below a customizable threshold.
  4. If desired, exclusion of CDR3b sequences that do not start with the amino acid cysteine and end with the amino acid phenylalanine.
  5. Identification of all desired motifs as local similarity (default: 2, 3 and 4 kmers)
  6. Identification and summary of all motifs in the sample sequences
  7. Repeated random sampling from the reference repertoire with sample set depth and identification and summary of all motifs in these subsamples.
  8. Identification of significantly enriched motifs in the sample set by the following criteria:
  9. Falling below an adjustable p-value (= probability of encountering a motif in repeated random sampling at least as often as in the sample set).
  10. Exceeding an adjustable fold change (= factor of the overrepresentation of the motif in the sample set compared to the mean value of the repeated random sampling)
  11. Exceeding a minimum frequency to reduce contamination of significant motifs with rare motifs.
  12. Connecting all sequences with a Hamming distance below a customizable threshold as an expression of global similarity
  13. Finding all connections of the sample sequences via identified local and global similarities and clustering based on these connections
  14. Scoring the clusters based on the following quantities:
  15. cluster size
  16. CDR3b length enrichment
  17. Clonal expansion enrichment (optional)
  18. V gene enrichment (optional)
  19. Enrichment of common HLA alleles (optional).
  20. A summary score for all other scores (similar to the product of all scores and an additional factor)

Calling the function and output explanation

The sequences are simply analyzed by calling the function turbo_gliph with the input dataframe as cdr3_sequences parameter.

res_turbogliph <- turboGliph::turbo_gliph(cdr3_sequences = gliph_input_data,
                                          n_cores = 1)

The output of the function is a list containing the identified similarities, the clustering result and some intermediate results of the algorithm. In the following, the complex output is explained in more detail.

$sample_log

In this list element the results of the repeated random sampling are summarized as a data frame. The columns of this data frame represent the individual motifs named in the column name. The rows represent the individual simulations of the repeated random sampling with the reference repertoire, where the first row refers to the sample set. The cells of the data frame contain numerical values that represent the frequency of the motif (indicated by the column) in the respective simulation step or in the sample set (indicated by the row).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: kmer_resample_sim_depth_log.txt

The frequencies of six motifs in the sample set and in the first four simulations are given below as examples:

res_turbogliph$sample_log[1:5, 1:6]

$motif_enrichment

This list element contains the summary of the statistical evaluation of the identified motifs and is itself a list containing, on the one hand, the summary of all motifs (all_motifs) and, on the other hand, the summary of the significantly enriched motifs (selected_motifs).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_motifs : kmer_resample_sim_depthminplcminpovelcminove.txt
File name of all_motifs : kmer_resample_sim_depth_all_motifs.txt

In the following, the statistics of the first three significantly enriched motifs are presented as an example:

res_turbogliph$motif_enrichment$selected_motifs[1:3,]

The columns contain the following information:

$connections:

In this data frame all connections of the sequences are summarized by local and global similarities. The first two columns contain the connected sequences, whereas the third column indicates the type of connection (local, global or singleton for sequences without connections).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt

The following is an example of the first three connections:

res_turbogliph$connections[1:3,]

$cluster_properties:

This data frame summarizes all identified clusters, their members and the individual scores of the clusters. If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: convergence_groups.txt

The following is an example of the information provided by three clusters:

res_turbogliph$cluster_properties[2:4,]

As shown, several pieces of information are provided for each cluster:

$cluster_list:

The elements of this list, named with the corresponding cluster tags, contain data frames with the members of the respective cluster and all additional information provided by the user at the beginning of the algorithm in the parameter cdr3_sequences. If the output is stored in a file via the input parameter result_folder, this part of the output can be found as data frame in the following file: cluster_member_details.txt

The following is an example of the information of the first two members of the first cluster:

res_turbogliph$cluster_list[[1]][1:2,]

$parameters:

This list contains all input parameters with which the run of the turbo_Gliph function was called to retrieve the settings for other methods. Also, the user can check again at a later time which settings he had chosen for the run.

If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt

Function: gliph2

In April 2020, a second version of GLIPH, called GLIPH2, was released by Huang et al.. This version can be used firstly via a server interface and secondly via two compiled programs (on Centos and IOS). For completeness, we have also implemented this version in R and included it in this package based on the information provided in the paper.(^3)(^,)(^4)

GLIPH2 differs from GLIPH in the following ways, which are explained in more detail in the original publication and on the corresponding website:(^3)(^,)(^4)

Brief overview of the algorithm

  1. Preparation of the input and the reference repertoire
  2. Exclusion of CDR3b sequences with characters outside the amino acid one-letter code
  3. Exclusion of CDR3b sequences with length below a customizable threshold.
  4. If desired, exclusion of CDR3b sequences that do not start with the amino acid cysteine and end with the amino acid phenylalanine
  5. Identification of all motifs as local similarity (default: 2,3 and 4 kmers)
  6. Identification of all motifs in the sample sequences and the reference sequences
  7. Listing the frequency of sequences with respective neighboring motifs in the sample set
  8. Identification of significantly enriched motifs in the sample set by falling below an adjustable p-value of the Fisher's Exact Test, exceeding an adjustable fold change and exceeding a minimum frequency
  9. increasing the significance of the clusters in case of partial overlap of the motif positions with N-nucleotide encoded sequence regions
  10. Identification of all global structures in samples and reference sequences
  11. Partitioning (if desired) by V gene match and interchangeable amino acid pairs according to the BLOSUM62 matrix
  12. Calculation of a p-value using Fisher's Exact Test
  13. Summary of global and local clusters and scoring of clusters based on the following variables:
  14. cluster size
  15. CDR3b length enrichment
  16. Clonal expansion enrichment (optional)
  17. V gene enrichment (optional)
  18. Enrichment of common HLA alleles (optional)
  19. A summary score for all other scores (Similar to the product of all scores and an additional factor)

Calling the function and output explanation

The sequences are simply analyzed by calling the function gliph2 with the input dataframe as cdr3_sequences parameter.

res_gliph2 <- turboGliph::gliph2(cdr3_sequences = gliph_input_data,
                                 n_cores = 1)

The output of the function is a list containing the identified similarities, the result of the clustering and some intermediate results of the algorithm. In the following, the complex output is explained in more detail.

$motif_enrichment

This list element contains the summary of the statistical evaluation of the identified motifs and is itself a list containing on the one hand the summary of all motifs (all_motifs) and on the other hand the summary of the significantly enriched motifs (selected_motifs).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_motifs : local_similarities_minp_lcminpminovelcminovekmer_mindepthkmer_mindepth.txt
File name of all_motifs : all_motifs.txt

In the following, the statistics of the first three significantly enriched motifs are presented as an example:

res_gliph2$motif_enrichment$selected_motifs[1:3,]

The columns contain the following information:

$global_enrichment

This list element contains the summary of the statistical evaluation of the identified global structures. If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following file: global_similarities.txt

The following is an example of the statistics of the first three structures:

res_gliph2$global_enrichment[1:3,]

The columns contain the following information:

$connections:

In this data frame, all connections of the sequences are clustered by local and global similarities. The first two columns contain the connected sequences, the third column the type of connection (local, global or singleton for sequences without connections) and the fourth column the tag of the cluster.

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt

The following is an example of the first three connections:

res_gliph2$connections[1:3,]

$cluster_properties:

This data frame summarizes all identified clusters, their members and the individual scores of the clusters. If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following file: convergence_groups.txt

The following is an example of the information provided by the first three clusters:

res_gliph2$cluster_properties[1:3,]

As shown, several pieces of information are provided for each cluster:

$cluster_list

The elements of this list, named with the corresponding cluster tags, contain data frames with the members of the respective cluster and all additional information provided by the user at the beginning of the algorithm in the parameter cdr3_sequences. If the output is stored in a file via the input parameter result_folder, this part of the output can be found as data frame in the following file: cluster_member_details.txt

The following is an example of the information of the first two members of the first cluster:

res_gliph2$cluster_list[[1]][1:2,]

$parameters

This list contains all input parameters with which the run of the turbo_Gliph function was called to get the settings for other methods. Also, the user can check again at a later time which settings he had chosen for the run.

If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt

Function: gliph_combined

With the publication of GLIPH2, Fisher's Exact Test was introduced in an attempt to compensate for the runtime disadvantage of GLIPH and to assign a significance value to the global similarities for better assessment. Additional changes, such as the altered clustering to clusters restricted to single motifs, prevent a comparability of the results of GLIPH and GLIPH2. With gliph_combined we provide the user with a function in which he can combine different features of both GLIPH versions for an individualized analysis adapted to his own goals. Additionally, we added the possibility to filter the global similarities similar to the local similarities based on thresholds for their significance for clustering.

Brief overview of the algorithm

  1. Preparation of the input and the reference repertoire
  2. Exclusion of CDR3b sequences with characters outside the amino acid one-letter code
  3. Exclusion of CDR3b sequences with length below a customizable threshold.
  4. If desired, exclusion of CDR3b sequences that do not start with the amino acid cysteine and end with the amino acid phenylalanine
  5. Identification of all motifs as local similarity (default: 2,3 and 4 kmers)
  6. Identification of all motifs in the sample sequences and the reference sequences
  7. Calculation of a p-value for each motif using A) random repeated sampling (as in turbogliph) or B) using Fisher's Exact Test (as in gliph2)
  8. Identification of significantly enriched motifs in the sample set by falling below an adjustable p-value, exceeding an adjustable fold change and exceeding a minimum frequency
  9. Increasing the significance of the clusters in case of partial overlap of the motif positions with N-nucleotide encoded sequence regions
  10. Restriction of local similarities to sequences in which the difference of the start position of a given motif must not exceed an adjustable distance
  11. Identification of global similarities
  12. Global similarities are defined either A) as sequences with a Hamming instance below an adjustable threshold (as in turbogliph) or B) as sequences with an identical global structure (variation at a particular amino acid position, as in gliph2)
  13. Partitioning (if desired) by V gene match and in case of B) interchangeable amino acid pairs (if desired) according to the BLOSUM62 matrix
  14. In case of B) calculation of a p-value using Fisher's Exact Test and identification of significantly enriched global structures in the sample set by falling below an adjustable p-value, exceeding an adjustable fold change and exceeding a minimum frequency
  15. Clustering of sequences
  16. For clustering, the algorithm of turbogliph, in which sequences are clustered by local similarities and global similarities, and the algorithm of gliph2, in which only sequences with one identical motif or global structure are combined in a cluster, are available
  17. Summary of global and local clusters and scoring of clusters based on the following variables:
  18. cluster size
  19. CDR3b length enrichment
  20. Clonal expansion enrichment (optional)
  21. V gene enrichment (optional)
  22. Enrichment of common HLA alleles (optional)
  23. A summary score for all other scores (Similar to the product of all scores and an additional factor)

Calling the function and output explanation

The sequences are simply analyzed by calling the function gliph_combined with the input dataframe as cdr3_sequences parameter.

res_gliph_combined <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data,
                                                 n_cores = 1)

The default settings of gliph_combined are similar to the analysis of turbogliph, except that motifs for local similarities must start within three amino acid positions. The following shows the settings that reflect the gliph2 function:

res_gliph_combined_2 <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data,
                                                 min_seq_length = 0,
                                                 local_method = "fisher",
                                                 boost_local_significance = TRUE,
                                                 global_method = "fisher",
                                                 clustering_method = "GLIPH2.0",
                                                 scoring_method = "GLIPH2.0",
                                                 n_cores = 1)

The intention of the implementation of gliph_combined, was to combine the features of GLIPH and GLIPH2 in one function in an individualizable way. The following settings perform the search for local and global similarities with the Fisher's Exact Test, as in gliph2, but cluster and score the clusters afterwards as in turbogliph.

res_gliph_combined_3 <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data,
                                                 min_seq_length = 0,
                                                 local_method = "fisher",
                                                 boost_local_significance = TRUE,
                                                 global_method = "fisher",
                                                 clustering_method = "GLIPH1.0",
                                                 scoring_method = "GLIPH1.0",
                                                 n_cores = 1)

The output of the function is a list containing the identified similarities, the result of the clustering and some intermediate results of the algorithm. In the following, the complex output is explained in more detail.

$sample_log

In this list element the results of the repeated random sampling, if performed, are summarized as a data frame. The columns of this data frame represent the individual motifs named in the column name. The rows represent the individual simulations of the repeated random sampling with the reference repertoire, where the first row refers to the sample set. The cells of the data frame contain numerical values that represent the frequency of the motif (indicated by the column) in the respective simulation step or in the sample set (indicated by the row).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: kmer_resample_sim_depth_log.txt

The frequencies of six motifs in the sample set and in the first four simulations are given below as examples:

res_gliph_combined$sample_log[1:5, 1:6]

$motif_enrichment

This list element contains the summary of the statistical evaluation of the identified motifs and is itself a list containing on the one hand the summary of all motifs (all_motifs) and on the other hand the summary of the significantly enriched motifs (selected_motifs).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_motifs : local_similarities_minp_lcminpminovelcminovekmer_mindepthkmer_mindepth.txt
File name of all_motifs : all_motifs.txt

In the following, the statistics of the first three significantly enriched motifs are presented as an example:

res_gliph_combined$motif_enrichment$selected_motifs[1:3,]

The columns contain the following information:

$global_enrichment

If Fisher's Exact Test was carried out for global simialrity search, this list element contains the summary of the statistical evaluation of the identified global structures and is itself a list containing on the one hand the summary of all global structures (all_structs) and on the other hand the summary of the significantly enriched global structures (selected_structs). If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_structs : global_similarities_minp_gcminpminovegcminovekmer_mindepthgckmer_mindepth.txt
File name of all_structs : all_global_similarities.txt

The following is an example of the statistics of the first three structures:

res_gliph_combined_2$global_enrichment$selected_structs[1:3,]

The columns contain the following information:

$connections:

In this data frame, all connections of the sequences are clustered by local and global similarities. The first two columns contain the connected sequences, the third column the type of connection (local, global or singleton for sequences without connections) and the fourth column the tag of the cluster.

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt

The following is an example of the first three connections:

res_gliph_combined$connections[1:3,]

$cluster_properties:

This data frame summarizes all identified clusters, their members and the individual scores of the clusters. If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following file: convergence_groups.txt

The following is an example of the information provided by the first three clusters:

res_gliph_combined$cluster_properties[1:3,]

As shown, several pieces of information are provided for each cluster:

$cluster_list

The elements of this list, named with the corresponding cluster tags, contain data frames with the members of the respective cluster and all additional information provided by the user at the beginning of the algorithm in the parameter cdr3_sequences. If the output is stored in a file via the input parameter result_folder, this part of the output can be found as data frame in the following file: cluster_member_details.txt

The following is an example of the information of the first two members of the first cluster:

res_gliph_combined$cluster_list[[1]][1:2,]

$parameters

This list contains all input parameters with which the run of the turbo_Gliph function was called to get the settings for other methods. Also, the user can check again at a later time which settings he had chosen for the run.

If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt

Function: plot_network

In this package we provide a method to visualize the results of turboGliph, gliph2 and gliph_combined. The default configuration provides the plot shown below.

plot_network(clustering_output = res_turbogliph,
             n_cores = 1)

This plot is interactive. Scrolling zooms, hovering over a node displays additional information about that node, and clicking on a node highlights all direct neighbours. Individual clusters can be selected by the cluster tag using the selection field in the upper left corner.

With the default configuration all nodes have the same size. The nodes are colored according to the log value of their cluster score using the viridis palette. Low scores (and thus more significant clusters) are represented by a high proportion of purple. High scores (and thus less significant clusters) are represented by a high proportion of yellow.

The provided method offers additional options to individualize the graph. For example, the nodes can be automatically colored according to the values of a particular column in the input data frame. For each sequence, individually selected colors can also be specified in a column named color in the input data frame. Also the size of the nodes can be changed automatically by values from a column in the input data frame. For more details, please look in the documentation of plot_network. Additionally the used color palette for nodes, the colors for edges and some more is customizable (read the documentation for more details).

In the following, an example plot is generated in which the nodes are colored according to the pathogen against which these TCRs are directed. Additionally, the color palette is changed and two additional properties of the sequences are shown in the info box. It can be clearly seen that turboGliph accumulates TCRs with identical pathogen specificity, characterized by identical node color, in the convergence groups with low scores.

plot_network(clustering_output = res_turbogliph,
             color_info = "antigen.species",
             color_palette = grDevices::rainbow,
             local_edge_color = "darkgrey",
             global_edge_color = "orange",
             show_additional_columns = c("antigen.species", "antigen"),
             cluster_min_size = 6,
             n_cores = 1)

The parameter show_additional_columns is a powerful tool to assemble the info box displayed by hovering over the nodes with all information you need to see. By default, the cluster tag and total cluster score are shown. For the function gliph2, additionally the fisher score is specified. In the following list, all further package intern values for this parameter are listed. Additional values (as in the example above) refer to columns in the input data frame cdr3_sequences.

Function: de_novo_TCRs

This function is an automated implementation of the method used in the GLIPH paper for in silico prediction of CDR3b sequences with similar specificity based on a specificity cluster identified by GLIPH. The underlying algorithm is based on the method used in the original publication.(^1)

Briefly, depending on the CDR3b sequence length, a global positional weight matrix (PWM) is formed with the positional abundances of amino acids in the cluster. Based on the probabilities of the PWM, a huge number of random sequences are generated. To rank the generated sequences, a score is assigned to each sequence. This score is calculated from the product of the frequencies of the first ten N-terminal amino acids of a sequence in the PWM. In addition, this score can be normalized to the probability of obtaining at most such a low score in a naive reference repertoire.

This function is used by passing the results of the turbo_gliph, gliph2 or gliph_combined function and the tag of the desired specificity cluster to the function. In this case, the first cluster is used as an example. The parameter make_figure is used to specify whether a figure similar to the original publication should be automatically displayed for convergence of the scores of the generated sequences.(^1)

new_seqs <- turboGliph::de_novo_TCRs(convergence_group_tag = res_turbogliph$cluster_properties$tag[1],
                                     clustering_output = res_turbogliph,
                                     sims = 10000,
                                     make_figure = TRUE,
                                     n_cores = 1)

The generated sequences and their scores can be found as a data frame in the first element of the output list:

new_seqs$de_novo_sequences[1:3,]

References

1: Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.
2: https://github.com/immunoengineer/gliph
3: Huang, Huang, et al. "Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening." Nature Biotechnology 38.10 (2020): 1194-1202.
4: http://50.255.35.37:8080/tools

Sessioninfo

packageVersion("turboGliph")

sessionInfo()


HetzDra/turboGliph documentation built on Oct. 2, 2022, 2:22 a.m.