In HetzDra/turboGliph: Find Specificity Groups with GLIPH and GLIPH2 Method

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.height = 5,
  fig.width = 5,
  fig.align = "center"
)
set.seed(100)

Introduction

GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots) is an algorithm developed by Glanville et al. to identify specificity groups in the T cell receptor repertoire based on local (motif sharing) and global (hamming distance) similarities.(^1)(^,)(^2)

The authors of the paper published a version of GLIPH written in Perl(^2), but its disadvantage is a long runtime, especially for larger sample sizes. Recently, with a new version of the package, GLIPH2, they introduced an improved algorithm that speeds up the analysis by substituting repeated sampling by Fisher's exact test to assess the statistical significance of a given motif.(^3)(^,)(^4)

In this R implementation of GLIPH we followed their Perl script and we tried to include all the options of the original algorithm, but with a constant look on the analysis speed. With an input size of ~ $10^4$, this R implementation is about 50 times faster and for an input size of ~ $10^5$ sequences it is about 500 times faster than the Perl script. In addition, we implemented a function for visualizing the clusters generated by GLIPH, which enables graphical processing and analysis of GLIPH results through numerous customization options. The implementation of GLIPH2 as well as the function gliph_combined, which combines the different features of GLIPH and GLIPH2 in a customizable way, complete this package and allow the user to use the different GLIPH algorithms according to his own needs.

For more details about the method we recommend reading the original papers and the GLIPH/GLIPH2 documentation on github.(^1)(^,)(^2)(^,)(^3)(^,)(^4)

Installation

turboGliph can be installed as part of the Bioconductor repository with the following code from the R terminal. In the first lines, if not already done, the basic Bioconductor packages are installed. Once this is done, the last line can be used to install the turboGliph package.

if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
    BiocManager::install()
}

BiocManager::install("turboGliph")

Required input data

In this introduction we will use gliph_input_data, the dataset published with GLIPH.(^1) First, let's take a look on the required format of the input data.

library(turboGliph)
data("gliph_input_data")
gliph_input_data[1:3,]

To use turbo_gliph, gliph2 and gliph_combined, CDR3 beta sequences are required. These must be specified either by a character vector or in a dataframe in a column named CDR3b. Additional information is not required for clustering, but is recommended for automatic cluster scoring. This includes the following information:

column TRBV : V gene information of the particular sequence
column patient : index number or similar of the patient from which the clone was isolated
column counts : number of clones in the sample
column HLA : HLA alleles of the particular patient, separated by commas

Notice: For turbo_gliph, gliph2 and gliph_combined the column names are important, the order of the columns can be arbitrary. In this example, four additional columns (TRBJ, CDR3a, TRAV, TRAJ) are given but will not be used by the GLIPH algorithms. However, this additional information is stored by the functions and will be visible in the returned clusters.The additional columns do not affect the algorithms as long as all column names specified in the input and output in this tutorial are avoided.

Individual reference repertoire

By default, the GLIPH algorithms in this package use the naive reference repertoire from the original GLIPH publication. However, there are several other reference repertoires available on the GLIPH2 website(^4) for different T cell subtypes as well as different species, which can be used in this package as well as user-created reference repertoires. For this, the reference sequences must be passed to the functions as a data frame under the parameter refdb_beta. In any case, a column named CDR3b is required which contains the sequences. Optionally (but mandatory for some parameter settings), a column named TRBV is expected to contain the V genes of the sequences. Additional information is ignored by the functions.

Function: turbo_gliph

The turbo_gliph function is a runtime-efficient implementation of the GLIPH algorithm provided by Glanville et al. in their publication as a Perl script.(^1)(^,)(^2) The choice of the R programming language, a restructuring of some parts of the original code, and the incorporated option of parallelization allow a significant reduction in runtime, especially with respect to larger sample input sizes. The ability to customize the algorithm using a variety of input parameters was retained and extended to include some parameters that are listed on the github website but not included in the Perl script. Details about the algorithm and the input parameters can be found in the original publication, the github website and the documentation of this package.

Brief overview of the algorithm

Preparation of the input and the reference repertoire
Exclusion of CDR3b sequences with characters outside the amino acid one-letter code.
Exclusion of CDR3b sequences with length below a customizable threshold.
If desired, exclusion of CDR3b sequences that do not start with the amino acid cysteine and end with the amino acid phenylalanine.
Identification of all desired motifs as local similarity (default: 2, 3 and 4 kmers)
Identification and summary of all motifs in the sample sequences
Repeated random sampling from the reference repertoire with sample set depth and identification and summary of all motifs in these subsamples.
Identification of significantly enriched motifs in the sample set by the following criteria:
Falling below an adjustable p-value (= probability of encountering a motif in repeated random sampling at least as often as in the sample set).
Exceeding an adjustable fold change (= factor of the overrepresentation of the motif in the sample set compared to the mean value of the repeated random sampling)
Exceeding a minimum frequency to reduce contamination of significant motifs with rare motifs.
Connecting all sequences with a Hamming distance below a customizable threshold as an expression of global similarity
Finding all connections of the sample sequences via identified local and global similarities and clustering based on these connections
Scoring the clusters based on the following quantities:
cluster size
CDR3b length enrichment
Clonal expansion enrichment (optional)
V gene enrichment (optional)
Enrichment of common HLA alleles (optional).
A summary score for all other scores (similar to the product of all scores and an additional factor)

Calling the function and output explanation

The sequences are simply analyzed by calling the function turbo_gliph with the input dataframe as cdr3_sequences parameter.

res_turbogliph <- turboGliph::turbo_gliph(cdr3_sequences = gliph_input_data,
                                          n_cores = 1)

The output of the function is a list containing the identified similarities, the clustering result and some intermediate results of the algorithm. In the following, the complex output is explained in more detail.

$sample_log

In this list element the results of the repeated random sampling are summarized as a data frame. The columns of this data frame represent the individual motifs named in the column name. The rows represent the individual simulations of the repeated random sampling with the reference repertoire, where the first row refers to the sample set. The cells of the data frame contain numerical values that represent the frequency of the motif (indicated by the column) in the respective simulation step or in the sample set (indicated by the row).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: kmer_resample_sim_depth_log.txt

The frequencies of six motifs in the sample set and in the first four simulations are given below as examples:

res_turbogliph$sample_log[1:5, 1:6]

$motif_enrichment

This list element contains the summary of the statistical evaluation of the identified motifs and is itself a list containing, on the one hand, the summary of all motifs (all_motifs) and, on the other hand, the summary of the significantly enriched motifs (selected_motifs).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_motifs : kmer_resample_sim_depthminplcminpovelcminove.txt
File name of all_motifs : kmer_resample_sim_depth_all_motifs.txt

In the following, the statistics of the first three significantly enriched motifs are presented as an example:

res_turbogliph$motif_enrichment$selected_motifs[1:3,]

The columns contain the following information:

Motif : the current motif
counts : the frequency of the motif in the sample set
avgRef : the average frequency of the motif in a simulation step of the repeated random sampling
topRef : the maximum frequency of the motif in a single simulation step of repeated random sampling
OvE : the fold enrichment of the motif in the sample set compared to the average in repeated random sampling
p-value : the probability of encountering at least the frequency of the motif in the sample set in a simulation step of the repeated random sampling

$connections:

In this data frame all connections of the sequences are summarized by local and global similarities. The first two columns contain the connected sequences, whereas the third column indicates the type of connection (local, global or singleton for sequences without connections).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt

The following is an example of the first three connections:

res_turbogliph$connections[1:3,]

$cluster_properties:

This data frame summarizes all identified clusters, their members and the individual scores of the clusters. If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: convergence_groups.txt

The following is an example of the information provided by three clusters:

res_turbogliph$cluster_properties[2:4,]

As shown, several pieces of information are provided for each cluster:

cluster_size : number of sequences in the cluster
tag : leader-tag of the cluster
total.score : summarized value of all following scores
network.size.score : score for the probability to obtain a cluster with this size
cdr3.length.score : score for the enrichment of CDR3b lengths
vgene.score : score for the enrichment of V genes (optional)
clonal.expansion.score : Score for enrichment of clonal expansion (optional)
hla.score : score for enrichment of common HLA alleles (optional)
lowest.hlas : list of enriched HLA alleles (optional)
members : string that contains all sequences in the cluster separated by a blank

$cluster_list:

The elements of this list, named with the corresponding cluster tags, contain data frames with the members of the respective cluster and all additional information provided by the user at the beginning of the algorithm in the parameter cdr3_sequences. If the output is stored in a file via the input parameter result_folder, this part of the output can be found as data frame in the following file: cluster_member_details.txt

The following is an example of the information of the first two members of the first cluster:

res_turbogliph$cluster_list[[1]][1:2,]

$parameters:

This list contains all input parameters with which the run of the turbo_Gliph function was called to retrieve the settings for other methods. Also, the user can check again at a later time which settings he had chosen for the run.

If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt

Function: gliph2

In April 2020, a second version of GLIPH, called GLIPH2, was released by Huang et al.. This version can be used firstly via a server interface and secondly via two compiled programs (on Centos and IOS). For completeness, we have also implemented this version in R and included it in this package based on the information provided in the paper.(^3)(^,)(^4)

GLIPH2 differs from GLIPH in the following ways, which are explained in more detail in the original publication and on the corresponding website:(^3)(^,)(^4)

The significance of local and global motifs is assessed by applying Fisher's Exact Test.
Clusters are no longer formed by connections between sequences, but for each local and global motif a separate cluster is generated that includes all sequences with that motif.
For a local similarity of two sequences, the N-terminal position of the motif may only vary within a certain position range (by default three positions).
The significance of local clusters is increased if the motifs are partially based on non-germline encoded N-nucleotides.
For global similarities, two sequences may differ in one position only. It is customizable by the user whether all amino acids are interchangeable or the amino acid pair at this position must additionally have a BLOSUM62 value >= 0.

Brief overview of the algorithm

Preparation of the input and the reference repertoire
Exclusion of CDR3b sequences with characters outside the amino acid one-letter code
Exclusion of CDR3b sequences with length below a customizable threshold.
If desired, exclusion of CDR3b sequences that do not start with the amino acid cysteine and end with the amino acid phenylalanine
Identification of all motifs as local similarity (default: 2,3 and 4 kmers)
Identification of all motifs in the sample sequences and the reference sequences
Listing the frequency of sequences with respective neighboring motifs in the sample set
Identification of significantly enriched motifs in the sample set by falling below an adjustable p-value of the Fisher's Exact Test, exceeding an adjustable fold change and exceeding a minimum frequency
increasing the significance of the clusters in case of partial overlap of the motif positions with N-nucleotide encoded sequence regions
Identification of all global structures in samples and reference sequences
Partitioning (if desired) by V gene match and interchangeable amino acid pairs according to the BLOSUM62 matrix
Calculation of a p-value using Fisher's Exact Test
Summary of global and local clusters and scoring of clusters based on the following variables:
cluster size
CDR3b length enrichment
Clonal expansion enrichment (optional)
V gene enrichment (optional)
Enrichment of common HLA alleles (optional)
A summary score for all other scores (Similar to the product of all scores and an additional factor)

Calling the function and output explanation

The sequences are simply analyzed by calling the function gliph2 with the input dataframe as cdr3_sequences parameter.

res_gliph2 <- turboGliph::gliph2(cdr3_sequences = gliph_input_data,
                                 n_cores = 1)

The output of the function is a list containing the identified similarities, the result of the clustering and some intermediate results of the algorithm. In the following, the complex output is explained in more detail.

$motif_enrichment

This list element contains the summary of the statistical evaluation of the identified motifs and is itself a list containing on the one hand the summary of all motifs (all_motifs) and on the other hand the summary of the significantly enriched motifs (selected_motifs).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_motifs : local_similarities_minp_lcminpminovelcminovekmer_mindepthkmer_mindepth.txt
File name of all_motifs : all_motifs.txt

In the following, the statistics of the first three significantly enriched motifs are presented as an example:

res_gliph2$motif_enrichment$selected_motifs[1:3,]

The columns contain the following information:

motif : the current motif
start : N-terminal position where the range of the motif start begins
stop : N-terminal position where the range of the motif start ends
num_in_sample : the frequency of the motif in the sample set
num_in_ref : the frequency of the motif in the reference repertoire
fisher.score : p-value of the Fisher's Exact Test
num_fold : the fold enrichment of the motif in the sample set compared to the reference repertoire

$global_enrichment

This list element contains the summary of the statistical evaluation of the identified global structures. If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following file: global_similarities.txt

The following is an example of the statistics of the first three structures:

res_gliph2$global_enrichment[1:3,]

The columns contain the following information:

cluster_tag : the current structure, the % indicates the variable position
cluster_size : number of sequences in this cluster
unique_CDR3b : number of unique sample sequences with this structure
num_in_ref : frequency of the structure in the reference repertoire
fisher.score : p-value of the Fisher's Exact Test
aa_at_position : string containing all amino acids at the variable position in this structure in the sample sequences
TRBV : contains (if needed for the users settings) the common V gene of all sequences
CDR3b : string that contains all sequences in the cluster separated by a blank

$connections:

In this data frame, all connections of the sequences are clustered by local and global similarities. The first two columns contain the connected sequences, the third column the type of connection (local, global or singleton for sequences without connections) and the fourth column the tag of the cluster.

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt

The following is an example of the first three connections:

res_gliph2$connections[1:3,]

$cluster_properties:

This data frame summarizes all identified clusters, their members and the individual scores of the clusters. If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following file: convergence_groups.txt

The following is an example of the information provided by the first three clusters:

res_gliph2$cluster_properties[1:3,]

As shown, several pieces of information are provided for each cluster:

type : similarity type of the cluster, either global or local
tag : tag of the cluster. Composed of the motif, the first and the last N-terminal starting position of the motif for local similarities. For global similarities, it is composed of the global structure, the V gene if necessary and all unique amino acids at the variable position. Information are separated by underscores.
cluster_size : number of members in the cluster
unique_cdr3_sample : number of unique CDR3b sequences in the cluster
unique_cdr3_ref : number of unique CDR3b sequences containing this motif or structure in the reference repertoire
OvE : fold enrichment of the motif or structure in the sample set compared to the reference set.
fisher.score : p-value of the Fisher's Exact Test for this motif/structure
total.score : summarized value of all following scores.
network.size.score : score for the probability to obtain a cluster with this size.
cdr3.length.score : score for the enrichment of CDR3b lengths.
vgene.score : score for the enrichment of the V genes (optional).
clonal.expansion.score : score for enrichment of clonal expansion (optional).
hla.score : score for enrichment of common HLA alleles (optional).
lowest.hlas : list of enriched HLA alleles (optional).
members : string that contains all sequences in the cluster separated by a blank

$cluster_list

The following is an example of the information of the first two members of the first cluster:

res_gliph2$cluster_list[[1]][1:2,]

$parameters

This list contains all input parameters with which the run of the turbo_Gliph function was called to get the settings for other methods. Also, the user can check again at a later time which settings he had chosen for the run.

If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt

Function: gliph_combined

With the publication of GLIPH2, Fisher's Exact Test was introduced in an attempt to compensate for the runtime disadvantage of GLIPH and to assign a significance value to the global similarities for better assessment. Additional changes, such as the altered clustering to clusters restricted to single motifs, prevent a comparability of the results of GLIPH and GLIPH2. With gliph_combined we provide the user with a function in which he can combine different features of both GLIPH versions for an individualized analysis adapted to his own goals. Additionally, we added the possibility to filter the global similarities similar to the local similarities based on thresholds for their significance for clustering.

Brief overview of the algorithm

Preparation of the input and the reference repertoire
Exclusion of CDR3b sequences with characters outside the amino acid one-letter code
Exclusion of CDR3b sequences with length below a customizable threshold.
If desired, exclusion of CDR3b sequences that do not start with the amino acid cysteine and end with the amino acid phenylalanine
Identification of all motifs as local similarity (default: 2,3 and 4 kmers)
Identification of all motifs in the sample sequences and the reference sequences
Calculation of a p-value for each motif using A) random repeated sampling (as in turbogliph) or B) using Fisher's Exact Test (as in gliph2)
Identification of significantly enriched motifs in the sample set by falling below an adjustable p-value, exceeding an adjustable fold change and exceeding a minimum frequency
Increasing the significance of the clusters in case of partial overlap of the motif positions with N-nucleotide encoded sequence regions
Restriction of local similarities to sequences in which the difference of the start position of a given motif must not exceed an adjustable distance
Identification of global similarities
Global similarities are defined either A) as sequences with a Hamming instance below an adjustable threshold (as in turbogliph) or B) as sequences with an identical global structure (variation at a particular amino acid position, as in gliph2)
Partitioning (if desired) by V gene match and in case of B) interchangeable amino acid pairs (if desired) according to the BLOSUM62 matrix
In case of B) calculation of a p-value using Fisher's Exact Test and identification of significantly enriched global structures in the sample set by falling below an adjustable p-value, exceeding an adjustable fold change and exceeding a minimum frequency
Clustering of sequences
For clustering, the algorithm of turbogliph, in which sequences are clustered by local similarities and global similarities, and the algorithm of gliph2, in which only sequences with one identical motif or global structure are combined in a cluster, are available
Summary of global and local clusters and scoring of clusters based on the following variables:
cluster size
CDR3b length enrichment
Clonal expansion enrichment (optional)
V gene enrichment (optional)
Enrichment of common HLA alleles (optional)
A summary score for all other scores (Similar to the product of all scores and an additional factor)

Calling the function and output explanation

The sequences are simply analyzed by calling the function gliph_combined with the input dataframe as cdr3_sequences parameter.

res_gliph_combined <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data,
                                                 n_cores = 1)

The default settings of gliph_combined are similar to the analysis of turbogliph, except that motifs for local similarities must start within three amino acid positions. The following shows the settings that reflect the gliph2 function:

res_gliph_combined_2 <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data,
                                                 min_seq_length = 0,
                                                 local_method = "fisher",
                                                 boost_local_significance = TRUE,
                                                 global_method = "fisher",
                                                 clustering_method = "GLIPH2.0",
                                                 scoring_method = "GLIPH2.0",
                                                 n_cores = 1)

The intention of the implementation of gliph_combined, was to combine the features of GLIPH and GLIPH2 in one function in an individualizable way. The following settings perform the search for local and global similarities with the Fisher's Exact Test, as in gliph2, but cluster and score the clusters afterwards as in turbogliph.

res_gliph_combined_3 <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data,
                                                 min_seq_length = 0,
                                                 local_method = "fisher",
                                                 boost_local_significance = TRUE,
                                                 global_method = "fisher",
                                                 clustering_method = "GLIPH1.0",
                                                 scoring_method = "GLIPH1.0",
                                                 n_cores = 1)

$sample_log

In this list element the results of the repeated random sampling, if performed, are summarized as a data frame. The columns of this data frame represent the individual motifs named in the column name. The rows represent the individual simulations of the repeated random sampling with the reference repertoire, where the first row refers to the sample set. The cells of the data frame contain numerical values that represent the frequency of the motif (indicated by the column) in the respective simulation step or in the sample set (indicated by the row).

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: kmer_resample_sim_depth_log.txt

The frequencies of six motifs in the sample set and in the first four simulations are given below as examples:

res_gliph_combined$sample_log[1:5, 1:6]

$motif_enrichment

In the following, the statistics of the first three significantly enriched motifs are presented as an example:

res_gliph_combined$motif_enrichment$selected_motifs[1:3,]

The columns contain the following information:

Motif : the current motif
counts : the frequency of the motif in the sample set
avgRef : the average frequency of the motif in a simulation step of the repeated random sampling
topRef : the maximum frequency of the motif in a single simulation step of repeated random sampling (0 for Fisher's Exact Test)
OvE : the fold enrichment of the motif in the sample set compared to the average in repeated random sampling
p.value : the probability of encountering at least the frequency of the motif in the sample set in a simulation step of the repeated random sampling or the p-value of the Fisher's Exact Test

$global_enrichment

If Fisher's Exact Test was carried out for global simialrity search, this list element contains the summary of the statistical evaluation of the identified global structures and is itself a list containing on the one hand the summary of all global structures (all_structs) and on the other hand the summary of the significantly enriched global structures (selected_structs). If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_structs : global_similarities_minp_gcminpminovegcminovekmer_mindepthgckmer_mindepth.txt
File name of all_structs : all_global_similarities.txt

The following is an example of the statistics of the first three structures:

res_gliph_combined_2$global_enrichment$selected_structs[1:3,]

The columns contain the following information:

struct : the current structure, the % indicates the variable position
counts : number of unique sample sequences with this structure
num_in_ref : frequency of the structure in the reference repertoire
avgRef : the average frequency of the global structure in a subset of the reference repertoire with the same size as the sample set
OvE : the fold enrichment of the global structure in the sample set compared to the average in the reference repertoire
p.value : p-value of the Fisher's Exact Test

$connections:

If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt

The following is an example of the first three connections:

res_gliph_combined$connections[1:3,]

$cluster_properties:

The following is an example of the information provided by the first three clusters:

res_gliph_combined$cluster_properties[1:3,]

As shown, several pieces of information are provided for each cluster:

type : similarity type of the cluster, either global or local
tag : tag of the cluster. Composed of the motif, the first and the last N-terminal starting position of the motif for local similarities. For global similarities, it is composed of the global structure, the V gene if necessary and all unique amino acids at the variable position. Information are separated by underscores.
cluster_size : number of members in the cluster
unique_cdr3_sample : number of unique CDR3b sequences in the cluster
unique_cdr3_ref : number of unique CDR3b sequences containing this motif or structure in the reference repertoire
OvE : fold enrichment of the motif or structure in the sample set compared to the reference set.
p.value : p-value of the repeated random sampling or Fisher's Exact Test for this motif/structure, if clustering was performed as in gliph2.
total.score : summarized value of all following scores.
network.size.score : score for the probability to obtain a cluster with this size.
cdr3.length.score : score for the enrichment of CDR3b lengths.
vgene.score : score for the enrichment of the V genes (optional).
clonal.expansion.score : score for enrichment of clonal expansion (optional).
hla.score : score for enrichment of common HLA alleles (optional).
lowest.hlas : list of enriched HLA alleles (optional).
members : string that contains all sequences in the cluster separated by a blank

$cluster_list

The following is an example of the information of the first two members of the first cluster:

res_gliph_combined$cluster_list[[1]][1:2,]

$parameters

If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt

Function: plot_network

In this package we provide a method to visualize the results of turboGliph, gliph2 and gliph_combined. The default configuration provides the plot shown below.

plot_network(clustering_output = res_turbogliph,
             n_cores = 1)

This plot is interactive. Scrolling zooms, hovering over a node displays additional information about that node, and clicking on a node highlights all direct neighbours. Individual clusters can be selected by the cluster tag using the selection field in the upper left corner.

With the default configuration all nodes have the same size. The nodes are colored according to the log value of their cluster score using the viridis palette. Low scores (and thus more significant clusters) are represented by a high proportion of purple. High scores (and thus less significant clusters) are represented by a high proportion of yellow.

The provided method offers additional options to individualize the graph. For example, the nodes can be automatically colored according to the values of a particular column in the input data frame. For each sequence, individually selected colors can also be specified in a column named color in the input data frame. Also the size of the nodes can be changed automatically by values from a column in the input data frame. For more details, please look in the documentation of plot_network. Additionally the used color palette for nodes, the colors for edges and some more is customizable (read the documentation for more details).

In the following, an example plot is generated in which the nodes are colored according to the pathogen against which these TCRs are directed. Additionally, the color palette is changed and two additional properties of the sequences are shown in the info box. It can be clearly seen that turboGliph accumulates TCRs with identical pathogen specificity, characterized by identical node color, in the convergence groups with low scores.

plot_network(clustering_output = res_turbogliph,
             color_info = "antigen.species",
             color_palette = grDevices::rainbow,
             local_edge_color = "darkgrey",
             global_edge_color = "orange",
             show_additional_columns = c("antigen.species", "antigen"),
             cluster_min_size = 6,
             n_cores = 1)

The parameter show_additional_columns is a powerful tool to assemble the info box displayed by hovering over the nodes with all information you need to see. By default, the cluster tag and total cluster score are shown. For the function gliph2, additionally the fisher score is specified. In the following list, all further package intern values for this parameter are listed. Additional values (as in the example above) refer to columns in the input data frame cdr3_sequences.

GLIPH :
cluster_size : number of sequences in the cluster
tag : leader-tag of the cluster
total.score : summarized value of all following scores
network.size.score : score for the probability to get a cluster with this size
cdr3.length.score : score for the enrichment of CDR3b lengths
vgene.score : score for the enrichment of the V genes (optional)
clonal.expansion.score : score for enrichment of clonal expansion (optional)
hla.score : score for enrichment of common HLA alleles (optional)
lowest.hlas : listing of enriched HLA alleles (optional)
GLIPH2 :
type : similarity type of the cluster, either global or local
tag : tag of the cluster. Composed of the motif, the first and the last N-terminal starting position of the motif for local similarities. For global similarities, it is composed of the global structure, the V gene if necessary and all unique amino acids at the variable position. Information are separated by underscores.
cluster_size : number of members in the cluster
unique_cdr3_sample : number of unique CDR3b sequences in the cluster
unique_cdr3_ref : number of unique CDR3b sequences containing this motif or structure in the reference repertoire
OvE : fold enrichment of the motif or structure in the sample set compared to the reference set
fisher.score : p-value of the Fisher's Exact Test for this motif/structure
total.score : summarized value of all following scores
network.size.score : score for the probability to get a cluster with this size
cdr3.length.score : score for the enrichment of CDR3b lengths
vgene.score : score for the enrichment of the V genes (optional)
clonal.expansion.score : score for enrichment of clonal expansion (optional)
hla.score : score for enrichment of common HLA alleles (optional)
lowest.hlas : list of enriched HLA alleles (optional)

Function: de_novo_TCRs

This function is an automated implementation of the method used in the GLIPH paper for in silico prediction of CDR3b sequences with similar specificity based on a specificity cluster identified by GLIPH. The underlying algorithm is based on the method used in the original publication.(^1)

Briefly, depending on the CDR3b sequence length, a global positional weight matrix (PWM) is formed with the positional abundances of amino acids in the cluster. Based on the probabilities of the PWM, a huge number of random sequences are generated. To rank the generated sequences, a score is assigned to each sequence. This score is calculated from the product of the frequencies of the first ten N-terminal amino acids of a sequence in the PWM. In addition, this score can be normalized to the probability of obtaining at most such a low score in a naive reference repertoire.

This function is used by passing the results of the turbo_gliph, gliph2 or gliph_combined function and the tag of the desired specificity cluster to the function. In this case, the first cluster is used as an example. The parameter make_figure is used to specify whether a figure similar to the original publication should be automatically displayed for convergence of the scores of the generated sequences.(^1)

new_seqs <- turboGliph::de_novo_TCRs(convergence_group_tag = res_turbogliph$cluster_properties$tag[1],
                                     clustering_output = res_turbogliph,
                                     sims = 10000,
                                     make_figure = TRUE,
                                     n_cores = 1)

The generated sequences and their scores can be found as a data frame in the first element of the output list:

new_seqs$de_novo_sequences[1:3,]

References

1: Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.
2: https://github.com/immunoengineer/gliph
3: Huang, Huang, et al. "Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening." Nature Biotechnology 38.10 (2020): 1194-1202.
4: http://50.255.35.37:8080/tools

Sessioninfo

packageVersion("turboGliph")

sessionInfo()

HetzDra/turboGliph documentation built on Oct. 2, 2022, 2:22 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

HetzDra/turboGliph Find Specificity Groups with GLIPH and GLIPH2 Method

In HetzDra/turboGliph: Find Specificity Groups with GLIPH and GLIPH2 Method

Introduction

Installation

Required input data

Individual reference repertoire

Function: turbo_gliph

Brief overview of the algorithm

Calling the function and output explanation

$sample_log

$motif_enrichment

$connections:

$cluster_properties:

$cluster_list:

$parameters:

Function: gliph2

Brief overview of the algorithm

Calling the function and output explanation

$motif_enrichment

$global_enrichment

$connections:

$cluster_properties:

$cluster_list

$parameters

Function: gliph_combined

Brief overview of the algorithm

Calling the function and output explanation

$sample_log

$motif_enrichment

$global_enrichment

$connections:

$cluster_properties:

$cluster_list

$parameters

Function: plot_network

Function: de_novo_TCRs

References

Sessioninfo

R Package Documentation

Browse R Packages

We want your feedback!

HetzDra/turboGliph
Find Specificity Groups with GLIPH and GLIPH2 Method