de_novo_TCRs: De novo generation of cdr3 sequences based on GLIPH or GLIPH2

View source: R/de_novo_tcrs.R

de_novo_TCRsR Documentation

De novo generation of cdr3 sequences based on GLIPH or GLIPH2

Description

De novo generation of cdr3 sequences based on GLIPH or GLIPH2. Based on the position-specific abundance of amino acids in the CDR3 region of the sequences of a GLIPH or GLIPH2 cluster, artificial sequences are simulated as established in Glanville et al.

Usage

de_novo_TCRs(
  convergence_group_tag,
  result_folder = "",
  clustering_output = NULL,
  refdb_beta = "gliph_reference",
  normalization = FALSE,
  accept_sequences_with_C_F_start_end = TRUE,
  sims = 1e+05,
  num_tops = 1000,
  min_length = 10,
  make_figure = FALSE,
  n_cores = 1
)

Arguments

convergence_group_tag

character. Tag of the convergence group that shall be used for prediction.

result_folder

character. By default "". Path to the folder in which the output files of the clustering are stored and the output of this method will be stored. If the value is "" the results are not saved and the output list of the function turboGliph or gliph2 must be entered under the parameter turboGliph_output.

clustering_output

list. By default NULL. If this parameter is specified, the clustering results are loaded directly from the list and not from the files in the result_folder If the value of result_folder is "", the output list of the function turboGliph or gliph2 must be entered here.

refdb_beta

character or data frame. By default "gliph_reference". Specifies the reference database to be used. For an individual reference database, a data frame is expected as input. In its first column, the CDR3b sequences must be specified and, if required, the V genes must be specified in the second column. Additional reference databases were provided for download by the developers of GLIPH2 in the web tool (http://50.255.35.37:8080/tools). To use the predefined database, the following keyword must be specified:

  • "gliph_reference": Reference database of 162,165 CDR3b sequences of naive human CD4+ or CD8+ T cells of two individuals used for the GLIPH paper.

normalization

logical. By default FALSE. If TRUE the calculated scores are normalized to a reference database and the probability that a reference sequence has a score greater than or equal to the sample sequence is returned. If V gene information is available, only sequences with identical V gene are compared.

accept_sequences_with_C_F_start_end

logical. This logical flag if TRUE, by default, only accepts sequences with amino-acid C at the start position and amino-acid F at the end position.

sims

numeric. By default 1,000,000. Value of how many de novo cdr3 sequences shall be created.

num_tops

numeric. By default 1000. The num_tops best scoring de novo created cdr3 sequences are returned

min_length

Numeric value determining the number of N-terminal positions used for scoring. By default it is set to 10.

make_figure

Logical value whether a graph of the num_tops best scoring de novo created cdr3 sequences in dependence of the rank shall be displayed.

n_cores

numeric. Number of cores to use, by default 1. In case of NULL it will be set to number of cores in your machine minus 2.

Value

This function produces one file in the result_folder, if specified, named convergence_group_tag followed by _de_novo.txt) containing the num_tops best scoring generated sequences and their corresponding scores. A list containing this file and additional information will also be returned as follows:

$de_novo_sequences A data frame containing the num_tops best scoring generated sequences and their corresponding scores.

$sample_sequences_scores A data frame containing the sequences of the used convergence group and their corresponding scores.

$cdr3_length_probability A data frame with any considered cdr3 length and the probability of occurrence in the convergence group. The distribution of the cdr3 length of all generated sequences resembles this distribution.

$PWM_Scoring A data frame containing the positional weight matrix used for scoring. The columns represent the different amino acids and the rows represent the position relative to the N-terminus.

$PWM_Prediction A list of data frames containing the positional weight matrix for any considered cdr3 length used for generation of new sequences. The columns represent the different amino acids and the rows represent the position relative to the N-terminus.

References

Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.

https://github.com/immunoengineer/gliph

Examples

utils::data("gliph_input_data")
res <- turbo_gliph(cdr3_sequences = gliph_input_data[base::seq_len(200),],
sim_depth = 100,
n_cores = 1)

new_seqs <- de_novo_TCRs(convergence_group_tag = res$cluster_properties$tag[1],
clustering_output = res,
sims = 10000,
make_figure = TRUE,
n_cores = 1)


HetzDra/turboGliph documentation built on Oct. 2, 2022, 2:22 a.m.