cluster_scoring: Scoring cdr3 clusters as performed in GLIPH and GLIPH2...

View source: R/scoring.R

cluster_scoringR Documentation

Scoring cdr3 clusters as performed in GLIPH and GLIPH2 algorithm

Description

With this method the scores of cdr3 clusters are calculated as in the GLIPH and GLIPH2 algorithm. Depending on the information provided, a final score is calculated based on up to five cluster properties: cluster size, enrichment of cdr3 lengths, enrichment of V genes, enrichment of clonal expansions and enrichment of a common HLA alleles.

Usage

cluster_scoring(
  cluster_list,
  cdr3_sequences,
  refdb_beta = "gliph_reference",
  v_usage_freq = NULL,
  cdr3_length_freq = NULL,
  ref_cluster_size = "original",
  gliph_version = 1,
  sim_depth = 1000,
  hla_cutoff = 0.1,
  n_cores = 1
)

Arguments

cluster_list

list. Each element of this list contains a data frame in which the CDR3b sequences and additional information necessary for scoring are provided. Corresponds to the $cluster_list element of the output of the functions turbo_gliph and gliph2.

cdr3_sequences

vector or dataframe. This dataframe must contain the cdr3 sequences and optional additional information. The columns must be named as specified in the following list in arbitrary order.

  • "CDR3b": cdr3 sequences of beta chains

  • "TRBV": optional. V-genes of beta chains

  • "TRBV": optional. V-genes of beta chains

  • "TRBV": optional. V-genes of beta chains

  • "patient": optional. Index of donor the appropriate sequence is obtained from. The value is composed of the index of the donor and an optional experimental condition separated by a colon (example: 09/0410:MtbLys). For the calculation of the HLA-scores only the index before the colon is used.

  • "HLA": optional. HLA alleles of the appropriate donor. The HLA alleles of a patient are separated by commas. The standard notation of the HLA alleles is expected (example: DPA1*01:03). For the calculation of HLA scores, information after the colon is neglected.

  • "counts": optional. Frequency of occurrence of the appropriate clone.

refdb_beta

character or data frame. By default "gliph_reference". Specifies the reference database to be used. For an individual reference database, a data frame is expected as input. In its first column, the CDR3b sequences must be specified and, if required, the V genes must be specified in the second column. Additional reference databases were provided for download by the developers of GLIPH2 in the web tool (http://50.255.35.37:8080/tools). To use the predefined database, the following keyword must be specified:

  • "gliph_reference": Reference database of 162,165 CDR3b sequences of naive human CD4+ or CD8+ T cells of two individuals used for the GLIPH paper.

v_usage_freq

data frame. By default NULL. This data frame contains the frequency of V-genes in a naive T cell repertoire required for scoring. The first column provides the V-gene alleles and the second column the frequencies. If the value is NULL, default frequencies are used.

cdr3_length_freq

data frame. By default NULL. This data frame contains the frequency of CDR3 lengths in a naive T cell repertoire required for scoring. The first column provides the CDR3 lengths and the second column the frequencies. If the value is NULL, default frequencies are used.

ref_cluster_size

character. Either "original" or "simulated", by default "original". Defines the probabilities used for calculating the cluster size score. In the case of "original", the standard probabilities of the original algorithm which are constant for all sample sizes are used. However, since the distribution of cluster sizes depends on the sample size, we estimated the probabilities for different sample sizes in a 500-step simulation using random sequences from the reference database. To use these probabilities, the keyword "simulated" must be specified.

gliph_version

numeric. Either 1 for GLIPH or 2 for GLIPH2 algorithm. The scoring of the algorithms differs only in the calculation of the total score. GLIPH2 calculates the product of all individual scores, while GLIPH multiplies this product additionally by 0.064.

sim_depth

numeric. By default 1000. Simulated resampling depth for non-parametric convergence significance tests. A higher number will take longer to run but will produce more reproducible results.

hla_cutoff

numeric. By default 0.1. Defines the threshold of HLA probability scores below which HLA alleles are considered significant.

n_cores

numeric. Number of cores to use, by default 1. In case of NULL it will be set to number of cores in your machine minus 2.

Value

This function produces one file in the result_folder named "GLIPH_scoring_results.txt" containing the same information as in the returned data frame. The data frame contains the cluster scoring results. The first columns provides the representative_seq for any evaluated cluster. In the second column the total scores are stored. Additional columns contain up to five scores (cluster size, cdr3 length enrichment, V-gene enrichment, enrichment of clonal expansion and enrichment of common HLA) used to evaluate the total score.

References

Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.

https://github.com/immunoengineer/gliph

Examples

utils::data("gliph_input_data")

res <- turbo_gliph(cdr3_sequences = gliph_input_data[base::seq_len(200),],
sim_depth = 100,
n_cores = 1)

scoring_results <- cluster_scoring(cluster_list = res$cluster_list,
cdr3_sequences = gliph_input_data[base::seq_len(200),],
refdb_beta = "gliph_reference",
gliph_version = 1,
sim_depth = 100,
n_cores = 1)


HetzDra/turboGliph documentation built on Oct. 2, 2022, 2:22 a.m.