View source: R/turbo_gliph_function_foreach.R
turbo_gliph | R Documentation |
Identification of specificity groups in the T cell repertoire based on local and global similarities between sample sequences. This R implementation is based on the GLIPH algorithm described by Glanville et al. The R implementation of GLIPH presented here is ~ 100 times faster than the original perl script depending on the input sample size.
turbo_gliph( cdr3_sequences, result_folder = "", refdb_beta = "gliph_reference", v_usage_freq = NULL, cdr3_length_freq = NULL, ref_cluster_size = "original", sim_depth = 1000, lcminp = 0.01, lcminove = c(1000, 100, 10), kmer_mindepth = 3, accept_sequences_with_C_F_start_end = TRUE, min_seq_length = 8, gccutoff = NULL, structboundaries = TRUE, boundary_size = 3, motif_length = base::c(2, 3, 4), discontinuous = FALSE, make_depth_fig = FALSE, local_similarities = TRUE, global_similarities = TRUE, global_vgene = FALSE, positional_motifs = FALSE, cdr3_len_stratify = FALSE, vgene_stratify = FALSE, public_tcrs = TRUE, cluster_min_size = 2, hla_cutoff = 0.1, n_cores = 1 )
cdr3_sequences |
vector or dataframe. This dataframe must contain the cdr3 sequences and optional additional information. The columns must be named as specified in the following list in arbitrary order.
|
result_folder |
character. By default |
refdb_beta |
character or data frame. By default
|
v_usage_freq |
data frame. By default |
cdr3_length_freq |
data frame. By default |
ref_cluster_size |
character. Either |
sim_depth |
numeric. By default 1000. Simulated resampling depth for non-parametric convergence significance tests. This defines the number of random repeat samplings into the reference distribution that GLIPH performs. A higher number will take longer to run but will produce more reproducible and reliable results. |
lcminp |
numeric. By default 0.01. Local convergence maximum probability score cutoff. The score reports the probability that a random sample of the same size as the sample set but of the reference set (i.e. naive repertoire) would generate an enrichment of the given motif at least as high as has been observed in the sample set. |
lcminove |
numeric. Local convergence minimum observed vs expected fold change.
This is a cutoff for the minimum fold enrichment over a
reference distribution that a given motif should have in
the sample set in order to be considered for further evaluation. By default, the minimum fold enrichment (1000,100,10) is
dependent on the motif length (2,3,4 amino acids). |
kmer_mindepth |
numeric. By default 3. Minimum observations of kmer for it to be evaluated. This is the minimum number of times a kmer should be observed in the sample set in order for it to be considered for further evaluation. The number can be set higher to provide less motif-based clusters with higher confidence. This could be recommended if the sample set is greater than 5000 reads. Lowering the value to 2 will identify more groups but likely at a cost of an increased False Discovery Rate. |
accept_sequences_with_C_F_start_end |
logical. This logical flag
if |
min_seq_length |
numeric. By default 8. All the sequences with a length less than this
value will be filtered out in input and reference database. If structboundaries
is |
gccutoff |
numeric. Global convergence distance cutoff. This is the maximum CDR3 Hamming mutation distance between two clones sharing the same CDR3 length in order for them to be considered to be likely binding the same antigen. This number will change depending on sample depth, as with more reads, the odds of finding a similar sequence increases even in a naive repertoire. This number will also change depending on the species evaluated and even the choice of reference database (memory TCRs will be more likely to have similar TCRs than naive TCR repertoires). Thus, by default this is calculated at runtime if not specified. If the sample depth is less than 125 it will be set to 2, otherwise it will be set to 1. |
structboundaries |
logical. By default |
boundary_size |
numeric. By default 3. Specifies the boundary size if structboundaries is active. |
motif_length |
accepts a numeric vector of motif lengths you want GLIPH to find and study. By default it searches for motifs of size 2, 3 and 4 amino acids. |
discontinuous |
logical. By default |
make_depth_fig |
logical. By default |
local_similarities |
logical. By default |
global_similarities |
logical. By default |
global_vgene |
logical. By default |
positional_motifs |
logical. By default |
cdr3_len_stratify |
logical. By default |
vgene_stratify |
logical. By default |
public_tcrs |
logical. By default |
cluster_min_size |
numeric. By default 2. Minimal size of a cluster required to be considered for scoring. |
hla_cutoff |
numeric. By default 0.1. Defines the threshold of HLA probability scores below which HLA alleles are considered significant. |
n_cores |
numeric. Number of cores to use, by default 1. In case of |
This function returns a list of six elements whose contents are explained below. If a file path is specified under result_folder
,
the results are additionally stored there. The individual file names are also specified below (italic name parts indicate the given value of the
corresponding parameter).
$sample_log:
A data frame with 1 + sim_depth
rows representing observations for all the possible
k-mer motifs. The first observation, Discovery
is the actual
observation counts in the input sample and the rest shows the observation
counts in a subsample from reference database.
File name: kmer_resample_sim_depth
_log.txt
$motif_enrichment:
A list of two data frames. selected_motifs
contains only the motifs that pass the filtering criterion (ove and p-value),
whereas all_motifs
contains p-value and ove of all motifs.
File name of selected_motifs
: kmer_resample_sim_depth
_minp lcminp
_ove lcminove
.txt
File name of all_motifs
: kmer_resample_sim_depth
_all_motifs.txt
$connections:
Contains the edge list. Each row consists of two nodes (cdr3 sequences) and a
third column which shows whether they are similar based on global or local similarity.
File name: clone_network.txt
$cluster_properties:
A data frame consisting of cluster_size
, leader_tag
, all available cluster scores (total score,
cluster size, cdr3 length enrichment, V-gene enrichment, enrichment of clonal expansion and enrichment of common HLA)
and members
for each cluster/component of the specificity network.
File name: convergence_groups.txt
$cluster_list:
A list containing the members and their additional information of each cluster. The elements of the list are named according to the appropriate cluster tag.
File name: cluster_member_details.txt
$parameters:
A data frame containing all given input parameter values.
File name: parameter.txt
Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.
https://github.com/immunoengineer/gliph
utils::data("gliph_input_data") res <- turbo_gliph(cdr3_sequences = gliph_input_data[base::seq_len(200),], sim_depth = 100, n_cores = 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.