gliph2: Grouping of Lymphocyte Interactions by Paratope Hotspots...
In HetzDra/turboGliph: Find Specificity Groups with GLIPH and GLIPH2 Method

gliph2

R Documentation

Grouping of Lymphocyte Interactions by Paratope Hotspots version 2

Description

Identifying specificity groups in the T cell receptor repertoire. Implementation of GLIPH2 following the instructions of the publication of Huang et al.

Usage

gliph2(
  cdr3_sequences,
  result_folder = "",
  refdb_beta = "gliph_reference",
  v_usage_freq = NULL,
  cdr3_length_freq = NULL,
  ref_cluster_size = "original",
  sim_depth = 1000,
  lcminp = 0.01,
  lcminove = c(1000, 100, 10),
  motif_distance_cutoff = 3,
  kmer_mindepth = 3,
  accept_sequences_with_C_F_start_end = TRUE,
  min_seq_length = 0,
  structboundaries = TRUE,
  boundary_size = 3,
  motif_length = base::c(2, 3, 4),
  discontinuous_motifs = FALSE,
  local_similarities = TRUE,
  global_similarities = TRUE,
  global_vgene = FALSE,
  all_aa_interchangeable = FALSE,
  boost_local_significance = TRUE,
  cluster_min_size = 2,
  hla_cutoff = 0.1,
  n_cores = 1
)

Arguments

`cdr3_sequences`	vector or dataframe. This dataframe must contain the cdr3 sequences and optional additional information. The columns must be named as specified in the following list in arbitrary order. "CDR3b": cdr3 sequences of beta chains "TRBV": optional. V-genes of beta chains "TRBV": optional. V-genes of beta chains "patient": optional. Index of donor the appropriate sequence is obtained from. The value is composed of the index of the donor and an optional experimental condition separated by a colon (example: 09/0410:MtbLys). For the calculation of the HLA-scores only the index before the colon is used. "HLA": optional. HLA alleles of the appropriate donor. The HLA alleles of a patient are separated by commas. The standard notation of the HLA alleles is expected (example: DPA1*01:03). For the calculation of HLA scores, information after the colon is neglected. "counts": optional. Frequency of occurrence of the appropriate clone.
`result_folder`	character. By default `""`. Path to the folder in which the output files should be stored. If the value is `""` the results will not be saved in files.
`refdb_beta`	character or data frame. By default `"gliph_reference"`. Specifies the reference database to be used. For an individual reference database, a data frame is expected as input. In its first column, the CDR3b sequences must be specified and, if required, the V genes must be specified in the second column. Additional reference databases were provided for download by the developers of GLIPH2 in the web tool (http://50.255.35.37:8080/tools). To use the predefined database, the following keyword must be specified: "gliph_reference": Reference database of 162,165 CDR3b sequences of naive human CD4+ or CD8+ T cells of two individuals used for the GLIPH paper.
`v_usage_freq`	data frame. By default `NULL`. This data frame contains the frequency of V-genes in a naive T cell repertoire required for scoring. The first column provides the V-gene alleles and the second column the frequencies. If the value is `NULL`, default frequencies are used.
`cdr3_length_freq`	data frame. By default `NULL`. This data frame contains the frequency of CDR3 lengths in a naive T cell repertoire required for scoring. The first column provides the CDR3 lengths and the second column the frequencies. If the value is `NULL`, default frequencies are used.
`ref_cluster_size`	character. Either `"original"` or `"simulated"`, by default `"original"`. Defines the probabilities used for calculating the cluster size score. In the case of `"original"`, the standard probabilities of the original algorithm which are constant for all sample sizes are used. However, since the distribution of cluster sizes depends on the sample size, we estimated the probabilities for different sample sizes in a 500-step simulation using random sequences from the reference database. To use these probabilities, the keyword `"simulated"` must be specified.
`sim_depth`	numeric. By default 1000. Simulated resampling depth for assessing V gene and CDR3 length enrichment scores of clusters.
`lcminp`	numeric. By default 0.01. Local convergence maximum probability score cutoff. The score reports the probability that a random sample of the same size as the sample set but of the reference set (i.e. naive repertoire) would generate an enrichment of the given motif at least as high as has been observed in the sample set.
`lcminove`	numeric. Local convergence minimum observed vs expected fold change. This is a cutoff for the minimum fold enrichment over a reference distribution that a given motif should have in the sample set in order to be considered for further evaluation. By default, the minimum fold enrichment (1000,100,10) is dependent on the motif length (2,3,4 amino acids). `lcminove` has to be either a single numeric value or a numeric vector with equal length as `motif_length` representing the minimum fold enrichment depending on the respective motif_length.
`motif_distance_cutoff`	numeric. By default 3. Defines the number of positions between which motifs for a local connection are allowed to vary.
`kmer_mindepth`	numeric. By default 3. Minimum observations of kmer for it to be evaluated. This is the minimum number of times a kmer should be observed in the sample set in order for it to be considered for further evaluation. The number can be set higher to provide less motif-based clusters with higher confidence. This could be recommended if the sample set is greater than 5000 reads. Lowering the value to 2 will identify more groups but likely at a cost of an increased False Discovery Rate.
`accept_sequences_with_C_F_start_end`	logical. This logical flag if `TRUE`, by default, only accepts sequences with amino-acid C at the start position and amino-acid F at the end position. This flag should be set to `FALSE` if you wish to analyze sequences of different origin for example B-cells.
`min_seq_length`	numeric. By default 8. All the sequences with a length less than this value will be filtered out in input and reference database. If structboundaries is `TRUE`, it is recommended not to go below the default. In this case, the min_seq_length will be set to the maximum of 2*boundary_size+2 and min_seq_length.
`structboundaries`	logical. By default `TRUE`. By setting this flag to `TRUE` the first boundary_size and the last boundary_size amino acids of each sequence will not be considered in the analysis for computing the Hamming distance and motif enrichment in input and reference database.
`boundary_size`	numeric. By default 3. Specifies the boundary size if structboundaries is active.
`motif_length`	accepts a numeric vector of motif lengths you want GLIPH2 to find and study. By default it searches for motifs of size 2, 3 and 4 amino acids.
`discontinuous_motifs`	logical. By default `FALSE`. Determines whether discontinuous_motifs motifs are to be considered.
`local_similarities`	logical. By default `TRUE`. Determines whether the sequences should be analyzed for local similarity.
`global_similarities`	logical. By default `TRUE`. Determines whether the sequences should be analyzed for global similarity.
`global_vgene`	logical. By default `FALSE`. If `TRUE` global similarities are restricted to TCRs with shared V-gene. Requires V-gene information in `cdr3_sequences`.
`all_aa_interchangeable`	logical. By default `TRUE`.In the case of TRUE, all sequences with a Hamming distance of 1 are evaluated as global similarities. In the case of FALSE, all sequences with a Hamming distance of 1 whose different amino acid has a BLOSUM62 score >= 0 are evaluated as global similarities.
`boost_local_significance`	logical. By default `TRUE`. If set to `TRUE`, fisher scores of local clusters are repeatedly divided by 2 for every unique CDR3 sequence in the cluster in which the motif overlaps with non-germline encoded N- or P-nucleotides.
`cluster_min_size`	numeric. By default 2. Minimal size of a cluster required to be considered for scoring.
`hla_cutoff`	numeric. By default 0.1. Defines the threshold of HLA probability scores below which HLA alleles are considered significant.
`n_cores`	numeric. Number of cores to use, by default 1. In case of `NULL` it will be set to number of cores in your machine minus 2.

Value

This function returns a list of six elements whose contents are explained below. If a file path is specified under result_folder, the results are additionally stored there. The individual file names are also specified below (italic name parts indicate the given value of the corresponding parameter).

$motif_enrichment: A list of two data frames. selected_motifs contains only the motifs that pass the filtering criterion (ove and p-value), whereas all_motifs contains p-value and ove of all motifs.
File name of selected_motifs: local_similarities_minp_lcminp_ove lcminove_kmer_mindepth kmer_mindepth.txt File name of all_motifs: all_motifs.txt

$global_enrichment: A data frame containing all identified global structures and their corresponding information.
File name: global_similarities.txt

$connections: Contains the edge list. Each row consists of two nodes (cdr3 sequences) and a third column which shows whether they are similar based on global or local similarity. An additional fourth columns contains the cluster tag (motif or sequence structure), by which the sequences are clustered.
File name: clone_network.txt

$cluster_properties: A data frame summarising the following information for each cluster:

"type": Indicates the type of similarity in the cluster (either global or local).
"tag": In the case of local similarities, the motif is indicated as well as the range of positions where the motif is positioned in the sequences. In the case of global similarities, the basic structure of the sequence is given as well as all amino acids separated by spaces that occur in the sample at the position marked by the
"cluster_size": Number of all sample sequences in the cluster.
"unique_cdr3_sample": Number of all unique CDR3b sequences of the sample in the cluster.
"unique_cdr3_ref": Number of all unique CDR3b sequences of the reference database matching the cluster properties.
"OvE": Factor of enrichment of the local or global motif in the sample compared to the reference database.
"fisher.score": The p-value obtained by performing the Fisher's exact test with a contingency table containing unique_cdr3_sample, unique_cdr3_ref, the number of remaining sample sequences and the number of remaining reference sequences.The score reports the probability that a random sample of the same size as the sample set but into the reference set (i.e. naive repertoire) would generate an enrichment of the given motif at least as high as has been observed in the sample set.
"members": All unique CDR3b sequences of the cluster separated by spaces.
"total.score": The product of all following scores.
"network.size.score": Probability of obtaining a cluster with this size in a naive repertoire.
"cdr3.length.score": enrichment of CDR3b lengths within the cluster.
"vgene.score": enrichment of V-genes within the cluster.
"clonal.expansion.score": enrichment of clonal expansion within the cluster.
"hla.score": enrichment of common HLA among donor TCR contributors in cluster.
"lowest.hlas": Enriched HLA alleles within the cluster.

File name: convergence_groups.txt

$cluster_list: A list containing the members and their additional information of each cluster. The elements of the list are named according to the appropriate cluster tag.
File name: cluster_member_details.txt

$parameters: A data frame containing all given input parameter values.
File name: parameter.txt

References

Huang, Huang, et al. "Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening." Nature Biotechnology 38.10 (2020): 1194-1202.

http://50.255.35.37:8080/

Examples

utils::data("gliph_input_data")
res <- gliph2(cdr3_sequences = gliph_input_data[base::seq_len(200),],
sim_depth = 50,
n_cores = 1)

HetzDra/turboGliph documentation built on Oct. 2, 2022, 2:22 a.m.

HetzDra/turboGliph index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

HetzDra/turboGliph
Find Specificity Groups with GLIPH and GLIPH2 Method

gliph2: Grouping of Lymphocyte Interactions by Paratope Hotspots...
In HetzDra/turboGliph: Find Specificity Groups with GLIPH and GLIPH2 Method

Grouping of Lymphocyte Interactions by Paratope Hotspots version 2

Description

Usage

Arguments

Value

References

Examples

Related to gliph2 in HetzDra/turboGliph...

R Package Documentation

Browse R Packages

We want your feedback!

HetzDra/turboGliph Find Specificity Groups with GLIPH and GLIPH2 Method

gliph2: Grouping of Lymphocyte Interactions by Paratope Hotspots... In HetzDra/turboGliph: Find Specificity Groups with GLIPH and GLIPH2 Method

Grouping of Lymphocyte Interactions by Paratope Hotspots version 2

Description

Usage

Arguments

Value

References

Examples

Related to gliph2 in HetzDra/turboGliph...

R Package Documentation

Browse R Packages

We want your feedback!

HetzDra/turboGliph
Find Specificity Groups with GLIPH and GLIPH2 Method

gliph2: Grouping of Lymphocyte Interactions by Paratope Hotspots...
In HetzDra/turboGliph: Find Specificity Groups with GLIPH and GLIPH2 Method