View source: R/gliph_combined.R
gliph_combined | R Documentation |
Identifying specificity groups in the T cell receptor repertoire. Detailed customizable implementation of GLIPH/GLIPH2.
gliph_combined( cdr3_sequences, result_folder = "", refdb_beta = "gliph_reference", v_usage_freq = NULL, cdr3_length_freq = NULL, ref_cluster_size = "original", min_seq_length = 8, accept_sequences_with_C_F_start_end = TRUE, structboundaries = TRUE, boundary_size = 3, local_similarities = TRUE, local_method = "rrs", lcsim_depth = 1000, motif_length = base::c(2, 3, 4), motif_distance_cutoff = 3, lcminp = 0.01, lcminove = c(1000, 100, 10), lckmer_mindepth = 3, discontinuous_motifs = FALSE, boost_local_significance = FALSE, cdr3_len_stratify = FALSE, vgene_stratify = FALSE, global_similarities = TRUE, global_method = "cutoff", gccutoff = NULL, gcminp = 1, gcminove = 0, gckmer_mindepth = 2, all_aa_interchangeable = FALSE, clustering_method = "GLIPH1.0", vgene_match = "none", public_tcrs = "all", cluster_min_size = 2, scoring_method = "GLIPH1.0", scoring_sim_depth = 1000, hla_cutoff = 0.1, n_cores = 1 )
cdr3_sequences |
vector or dataframe. This dataframe must contain the cdr3 sequences and optional additional information. The columns must be named as specified in the following list in arbitrary order.
|
result_folder |
character. By default |
refdb_beta |
character or data frame. By default
|
v_usage_freq |
data frame. By default |
cdr3_length_freq |
data frame. By default |
ref_cluster_size |
character. Either |
min_seq_length |
numeric. By default 8. All the sequences with a length less than this
value will be filtered out in input and reference database. If structboundaries
is |
accept_sequences_with_C_F_start_end |
logical. This logical flag
if |
structboundaries |
logical. By default |
boundary_size |
numeric. By default 3. Specifies the boundary size if structboundaries is active. |
local_similarities |
logical. By default |
local_method |
character. Either 'fisher' or 'rrs' (default). Determines the method for searching local similarities. In the case of 'rrs', repeated random sampling is performed as in the GLIPH algorithm. In the case of 'fisher', the long runtime of repeated. random sampling is shortened by approximation using Fisher's Exact Test as established in GLIPH2. |
lcsim_depth |
numeric. By default 1000. Number of iterations for repeated random sampling, if local_method is set to 'rrs'. |
motif_length |
accepts a numeric vector of motif lengths you want GLIPH2 to find and study. By default it searches for motifs of size 2, 3 and 4 amino acids. |
motif_distance_cutoff |
numeric. By default 3. Defines the number of positions between which motifs for a local connection are allowed to vary. |
lcminp |
numeric. By default 0.01. Local convergence maximum probability score cutoff. The score reports the probability that a random sample of the same size as the sample set but of the reference set (i.e. naive repertoire) would generate an enrichment of the given motif at least as high as has been observed in the sample set. |
lcminove |
numeric. By default 10. Local convergence minimum observed vs expected fold change.
This is a cutoff for the minimum fold enrichment over a
reference distribution that a given motif should have in
the sample set in order to be considered for further evaluation. By default, the minimum fold enrichment (1000,100,10) is
dependent on the motif length (2,3,4 amino acids). |
lckmer_mindepth |
numeric. By default 3. Minimum observations of kmer for it to be evaluated. This is the minimum number of times a kmer should be observed in the sample set in order for it to be considered for further evaluation. The number can be set higher to provide less motif-based clusters with higher confidence. This could be recommended if the sample set is greater than 5000 reads. Lowering the value to 2 will identify more groups but likely at a cost of an increased False Discovery Rate. |
discontinuous_motifs |
logical. By default |
boost_local_significance |
logical. By default |
cdr3_len_stratify |
logical. By default |
vgene_stratify |
logical. By default |
global_similarities |
logical. By default |
global_method |
character. Either 'fisher' or 'cutoff' (default). Determines the method for searching global similarities. In the case of 'cutoff', global similarity is defined as falling below a cutoff (gccutoff) of the Hamming distance between two sequences, as in the GLIPH algorithm. THis option requires clustering_method to be set to 'GLIPH1.0'. In the case of 'fisher', as in GLIPH2, Fisher's Exact Test is used to test for a significant enrichment of global structures in the sample set relative to the reference set. |
gccutoff |
numeric. Global convergence distance cutoff. Only considered, if global_method is set to 'cutoff'. This is the maximum CDR3 Hamming mutation distance between two clones sharing the same CDR3 length in order for them to be considered to be likely binding the same antigen. This number will change depending on sample depth, as with more reads, the odds of finding a similar sequence increases even in a naive repertoire. This number will also change depending on the species evaluated and even the choice of reference database (memory TCRs will be more likely to have similar TCRs than naive TCR repertoires). Thus, by default this is calculated at runtime if not specified. If the sample depth is less than 125 it will be set to 2, otherwise it will be set to 1. |
gcminp |
numeric. By default 1. Global convergence maximum probability score cutoff in case of global_method = 'fisher'. The score reports the probability that a random sample of the same size as the sample set but of the reference set (i.e. naive repertoire) would generate an enrichment of the given global structure at least as high as has been observed in the sample set. |
gcminove |
numeric. By default 0. Global convergence minimum observed vs expected fold change in case of global_method = 'fisher'. This is a cutoff for the minimum fold enrichment over a reference distribution that a given global enrichment should have in the sample set in order to be considered for further evaluation. |
gckmer_mindepth |
numeric. By default 2. Minimum observations of a global structure for it to be evaluated in case of global_method = 'fisher'. This is the minimum number of times a global structure should be observed in the sample set in order for it to be considered for further evaluation. The number can be set higher to provide less motif-based clusters with higher confidence. Lowering the value will identify more groups but likely at a cost of an increased False Discovery Rate. |
all_aa_interchangeable |
logical. Only used if global_method = 'fisher'. By default |
clustering_method |
character. Either 'GLIPH1.0' (default) or 'GLIPH2.0'. Determines the method for clustering. In the case of 'GLIPH1.0' clusters are generated by all locally and globally connected sequences. In the case of 'GLIPH2.0' clusters only contain sequences with one specific enriched local motif or one specific enriched global structure. |
vgene_match |
character. Specifies which connections are restricted by shared V genes. Can be set to the following values:
|
public_tcrs |
character. Specifies which connections are restricted by isolation from the same donor. Can be set to the following values:
|
cluster_min_size |
numeric. By default 2. Minimal size of a cluster required to be considered for scoring. |
scoring_method |
character. Either 'GLIPH1.0' (default) or 'GLIPH2.0'. Determines from which GLIPH version the scoring algorithm should be used for cluster scoring. The differences are mainly in the final multiplication of all subscores. |
scoring_sim_depth |
numeric. By default 1000. Simulated resampling depth for assessing V gene and CDR3 length enrichment scores of clusters. |
hla_cutoff |
numeric. By default 0.1. Defines the threshold of HLA probability scores below which HLA alleles are considered significant. |
n_cores |
numeric. Number of cores to use, by default 1. In case of |
This function returns a list of seven elements whose contents are explained below. If a file path is specified under result_folder
,
the results are additionally stored there. The individual file names are also specified below (italic name parts indicate the given value of the
corresponding parameter).
$sample_log:
Only generated if local_method
= 'rrs'. Contains a data frame with 1 + lcsim_depth
rows representing observations for all the possible
k-mer motifs. The first observation, Discovery
is the actual
observation counts in the input sample and the rest shows the observation
counts in a subsample from reference database.
File name: kmer_resample_lcsim_depth
_log.txt
$motif_enrichment:
A list of two data frames. selected_motifs
contains only the motifs that pass the filtering criterion (ove and p-value),
whereas all_motifs
contains p-value and ove of all motifs.
File name of selected_motifs
: local_similarities_minp_lcminp
_ove lcminove
_kmer_mindepth lckmer_mindepth
.txt
File name of all_motifs
: all_motifs.txt
$global_enrichment:
Only generated if global_method
= 'fisher'. Contains a list of two data frames. selected_structs
contains only the sequence structures that pass the filtering criterion (p-value),
whereas all_structs
contains p-value and ove of all sequence structures
File name of selected_structs
: global_similarities_minp_gcminp
_ove gcminove
_kmer_mindepth gckmer_mindepth
.txt
File name of all_structs
: all_global_similarities.txt
$connections:
Contains the edge list. Each row consists of two nodes (cdr3 sequences) and a
third column which shows whether they are similar based on global or local similarity.
An additional fourth columns contains the cluster tag (motif or sequence structure), by which the sequences are clustered.
File name: clone_network.txt
$cluster_properties: A data frame summarising the following information for each cluster:
"type": Indicates the type of similarity in the cluster (either global or local).
"tag": In the case of local similarities, the motif is indicated as well as the range of positions where the motif is positioned in the sequences. In the case of global similarities, the basic structure of the sequence is given as well as all amino acids separated by spaces that occur in the sample at the position marked by the
"cluster_size": Number of all sample sequences in the cluster.
"unique_cdr3_sample": Number of all unique CDR3b sequences of the sample in the cluster.
"unique_cdr3_ref": Number of all unique CDR3b sequences of the reference database matching the cluster properties.
"OvE": Factor of enrichment of the local or global motif in the sample compared to the reference database.
"p.value": The p-value of the motif or global structure, if clustering is performed as in GLIPH2.
"members": All unique CDR3b sequences of the cluster separated by spaces.
"total.score": The product of all following scores.
"network.size.score": Probability of obtaining a cluster with this size in a naive repertoire.
"cdr3.length.score": enrichment of CDR3b lengths within the cluster.
"vgene.score": enrichment of V-genes within the cluster.
"clonal.expansion.score": enrichment of clonal expansion within the cluster.
"hla.score": enrichment of common HLA among donor TCR contributors in cluster.
"lowest.hlas": Enriched HLA alleles within the cluster.
File name: convergence_groups.txt
$cluster_list:
A list containing the members and their additional information of each cluster. The elements of the list are named according to the appropriate cluster tag.
File name: cluster_member_details.txt
$parameters:
A data frame containing all given input parameter values.
File name: parameter.txt
Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.
https://github.com/immunoengineer/gliph
Huang, Huang, et al. "Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening." Nature Biotechnology 38.10 (2020): 1194-1202.
utils::data("gliph_input_data") res <- gliph_combined(cdr3_sequences = gliph_input_data[base::seq_len(200),], lcsim_depth = 50, scoring_sim_depth = 50, n_cores = 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.