turbo_gliph: Grouping of Lymphocyte Interactions by Paratope Hotspots

View source: R/turbo_gliph_function_foreach.R

turbo_gliphR Documentation

Grouping of Lymphocyte Interactions by Paratope Hotspots

Description

Identification of specificity groups in the T cell repertoire based on local and global similarities between sample sequences. This R implementation is based on the GLIPH algorithm described by Glanville et al. The R implementation of GLIPH presented here is ~ 100 times faster than the original perl script depending on the input sample size.

Usage

turbo_gliph(
  cdr3_sequences,
  result_folder = "",
  refdb_beta = "gliph_reference",
  v_usage_freq = NULL,
  cdr3_length_freq = NULL,
  ref_cluster_size = "original",
  sim_depth = 1000,
  lcminp = 0.01,
  lcminove = c(1000, 100, 10),
  kmer_mindepth = 3,
  accept_sequences_with_C_F_start_end = TRUE,
  min_seq_length = 8,
  gccutoff = NULL,
  structboundaries = TRUE,
  boundary_size = 3,
  motif_length = base::c(2, 3, 4),
  discontinuous = FALSE,
  make_depth_fig = FALSE,
  local_similarities = TRUE,
  global_similarities = TRUE,
  global_vgene = FALSE,
  positional_motifs = FALSE,
  cdr3_len_stratify = FALSE,
  vgene_stratify = FALSE,
  public_tcrs = TRUE,
  cluster_min_size = 2,
  hla_cutoff = 0.1,
  n_cores = 1
)

Arguments

cdr3_sequences

vector or dataframe. This dataframe must contain the cdr3 sequences and optional additional information. The columns must be named as specified in the following list in arbitrary order.

  • "CDR3b": cdr3 sequences of beta chains

  • "TRBV": optional. V-genes of beta chains

  • "patient": optional. Index of donor the appropriate sequence is obtained from. The value is composed of the index of the donor and an optional experimental condition separated by a colon (example: 09/0410:MtbLys). For the calculation of the HLA-scores only the index before the colon is used.

  • "HLA": optional. HLA alleles of the appropriate donor. The HLA alleles of a patient are separated by commas. The standard notation of the HLA alleles is expected (example: DPA1*01:03). For the calculation of HLA scores, information after the colon is neglected.

  • "counts": optional. Frequency of occurrence of the appropriate clone.

result_folder

character. By default "". Path to the folder in which the output files should be stored. If the value is "" the results will not be saved in files.

refdb_beta

character or data frame. By default "gliph_reference". Specifies the reference database to be used. For an individual reference database, a data frame is expected as input. In its first column, the CDR3b sequences must be specified and, if required, the V genes must be specified in the second column. Additional reference databases were provided for download by the developers of GLIPH2 in the web tool (http://50.255.35.37:8080/tools). To use the predefined database, the following keyword must be specified:

  • "gliph_reference": Reference database of 162,165 CDR3b sequences of naive human CD4+ or CD8+ T cells of two individuals used for the GLIPH paper.

v_usage_freq

data frame. By default NULL. This data frame contains the frequency of V-genes in a naive T cell repertoire required for scoring. The first column provides the V-gene alleles and the second column the frequencies. If the value is NULL, default frequencies are used.

cdr3_length_freq

data frame. By default NULL. This data frame contains the frequency of CDR3 lengths in a naive T cell repertoire required for scoring. The first column provides the CDR3 lengths and the second column the frequencies. If the value is NULL, default frequencies are used.

ref_cluster_size

character. Either "original" or "simulated", by default "original". Defines the probabilities used for calculating the cluster size score. In the case of "original", the standard probabilities of the original algorithm which are constant for all sample sizes are used. However, since the distribution of cluster sizes depends on the sample size, we estimated the probabilities for different sample sizes in a 500-step simulation using random sequences from the reference database. To use these probabilities, the keyword "simulated" must be specified.

sim_depth

numeric. By default 1000. Simulated resampling depth for non-parametric convergence significance tests. This defines the number of random repeat samplings into the reference distribution that GLIPH performs. A higher number will take longer to run but will produce more reproducible and reliable results.

lcminp

numeric. By default 0.01. Local convergence maximum probability score cutoff. The score reports the probability that a random sample of the same size as the sample set but of the reference set (i.e. naive repertoire) would generate an enrichment of the given motif at least as high as has been observed in the sample set.

lcminove

numeric. Local convergence minimum observed vs expected fold change. This is a cutoff for the minimum fold enrichment over a reference distribution that a given motif should have in the sample set in order to be considered for further evaluation. By default, the minimum fold enrichment (1000,100,10) is dependent on the motif length (2,3,4 amino acids). lcminove has to be either a single numeric value or a numeric vector with equal length as motif_length representing the minimum fold enrichment depending on the respective motif_length.

kmer_mindepth

numeric. By default 3. Minimum observations of kmer for it to be evaluated. This is the minimum number of times a kmer should be observed in the sample set in order for it to be considered for further evaluation. The number can be set higher to provide less motif-based clusters with higher confidence. This could be recommended if the sample set is greater than 5000 reads. Lowering the value to 2 will identify more groups but likely at a cost of an increased False Discovery Rate.

accept_sequences_with_C_F_start_end

logical. This logical flag if TRUE, by default, only accepts sequences with amino-acid C at the start position and amino-acid F at the end position. This flag should be set to FALSE if you wish to analyze sequences of different origin for example B-cells.

min_seq_length

numeric. By default 8. All the sequences with a length less than this value will be filtered out in input and reference database. If structboundaries is TRUE, it is recommended not to go below the default. In this case, the min_seq_length will be set to the maximum of 2*boundary_size+2 and min_seq_length.

gccutoff

numeric. Global convergence distance cutoff. This is the maximum CDR3 Hamming mutation distance between two clones sharing the same CDR3 length in order for them to be considered to be likely binding the same antigen.

This number will change depending on sample depth, as with more reads, the odds of finding a similar sequence increases even in a naive repertoire.

This number will also change depending on the species evaluated and even the choice of reference database (memory TCRs will be more likely to have similar TCRs than naive TCR repertoires). Thus, by default this is calculated at runtime if not specified. If the sample depth is less than 125 it will be set to 2, otherwise it will be set to 1.

structboundaries

logical. By default TRUE. By setting this flag to TRUE the first boundary_size and the last boundary_size amino acids of each sequence will not be considered in the analysis for computing the Hamming distance and motif enrichment in input and reference database.

boundary_size

numeric. By default 3. Specifies the boundary size if structboundaries is active.

motif_length

accepts a numeric vector of motif lengths you want GLIPH to find and study. By default it searches for motifs of size 2, 3 and 4 amino acids.

discontinuous

logical. By default FALSE. Determines whether discontinuous motifs should be considered.

make_depth_fig

logical. By default FALSE. If true, repeated random sampling is performed at the set sample depth to visualize the global convergence.

local_similarities

logical. By default TRUE. Determines whether the sequences should be analyzed for local similarity.

global_similarities

logical. By default TRUE. Determines whether the sequences should be analyzed for global similarity.

global_vgene

logical. By default FALSE. If TRUE global similarities are restricted to TCRs with shared V-gene. Requires V-gene information in cdr3_sequences.

positional_motifs

logical. By default FALSE. If TRUE, local similarity is restricted to TCRs with identical motif position relative to the N-terminus.

cdr3_len_stratify

logical. By default FALSE. Specifies whether the distribution of the cdr3 lengths in the sample should be retained during repeat random sampling.

vgene_stratify

logical. By default FALSE. Specifies whether the distribution of V-genes in the sample should be retained during repeat random sampling. Requires V-gene information in cdr3_sequences.

public_tcrs

logical. By default TRUE. Specifies whether a cluster may only contain sequences of the same donor. Requires donor information in cdr3_sequences.

cluster_min_size

numeric. By default 2. Minimal size of a cluster required to be considered for scoring.

hla_cutoff

numeric. By default 0.1. Defines the threshold of HLA probability scores below which HLA alleles are considered significant.

n_cores

numeric. Number of cores to use, by default 1. In case of NULL it will be set to number of cores in your machine minus 2.

Value

This function returns a list of six elements whose contents are explained below. If a file path is specified under result_folder, the results are additionally stored there. The individual file names are also specified below (italic name parts indicate the given value of the corresponding parameter).

$sample_log: A data frame with 1 + sim_depth rows representing observations for all the possible k-mer motifs. The first observation, Discovery is the actual observation counts in the input sample and the rest shows the observation counts in a subsample from reference database.
File name: kmer_resample_sim_depth_log.txt

$motif_enrichment: A list of two data frames. selected_motifs contains only the motifs that pass the filtering criterion (ove and p-value), whereas all_motifs contains p-value and ove of all motifs.
File name of selected_motifs: kmer_resample_sim_depth_minp lcminp_ove lcminove.txt File name of all_motifs: kmer_resample_sim_depth_all_motifs.txt

$connections: Contains the edge list. Each row consists of two nodes (cdr3 sequences) and a third column which shows whether they are similar based on global or local similarity.
File name: clone_network.txt

$cluster_properties: A data frame consisting of cluster_size, leader_tag, all available cluster scores (total score, cluster size, cdr3 length enrichment, V-gene enrichment, enrichment of clonal expansion and enrichment of common HLA) and members for each cluster/component of the specificity network.
File name: convergence_groups.txt

$cluster_list: A list containing the members and their additional information of each cluster. The elements of the list are named according to the appropriate cluster tag.
File name: cluster_member_details.txt

$parameters: A data frame containing all given input parameter values.
File name: parameter.txt

References

Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.

https://github.com/immunoengineer/gliph

Examples

utils::data("gliph_input_data")
res <- turbo_gliph(cdr3_sequences = gliph_input_data[base::seq_len(200),],
                   sim_depth = 100,
                   n_cores = 1)


HetzDra/turboGliph documentation built on Oct. 2, 2022, 2:22 a.m.