de_novo_TCRs: De novo generation of cdr3 sequences based on GLIPH or GLIPH2
In HetzDra/turboGliph: Find Specificity Groups with GLIPH and GLIPH2 Method

de_novo_TCRs

R Documentation

De novo generation of cdr3 sequences based on GLIPH or GLIPH2

Description

De novo generation of cdr3 sequences based on GLIPH or GLIPH2. Based on the position-specific abundance of amino acids in the CDR3 region of the sequences of a GLIPH or GLIPH2 cluster, artificial sequences are simulated as established in Glanville et al.

Usage

de_novo_TCRs(
  convergence_group_tag,
  result_folder = "",
  clustering_output = NULL,
  refdb_beta = "gliph_reference",
  normalization = FALSE,
  accept_sequences_with_C_F_start_end = TRUE,
  sims = 1e+05,
  num_tops = 1000,
  min_length = 10,
  make_figure = FALSE,
  n_cores = 1
)

Arguments

`convergence_group_tag`	character. Tag of the convergence group that shall be used for prediction.
`result_folder`	character. By default `""`. Path to the folder in which the output files of the clustering are stored and the output of this method will be stored. If the value is `""` the results are not saved and the output list of the function `turboGliph` or `gliph2` must be entered under the parameter `turboGliph_output`.
`clustering_output`	list. By default `NULL`. If this parameter is specified, the clustering results are loaded directly from the list and not from the files in the `result_folder` If the value of `result_folder` is `""`, the output list of the function `turboGliph` or `gliph2` must be entered here.
`refdb_beta`	character or data frame. By default `"gliph_reference"`. Specifies the reference database to be used. For an individual reference database, a data frame is expected as input. In its first column, the CDR3b sequences must be specified and, if required, the V genes must be specified in the second column. Additional reference databases were provided for download by the developers of GLIPH2 in the web tool (http://50.255.35.37:8080/tools). To use the predefined database, the following keyword must be specified: "gliph_reference": Reference database of 162,165 CDR3b sequences of naive human CD4+ or CD8+ T cells of two individuals used for the GLIPH paper.
`normalization`	logical. By default `FALSE`. If `TRUE` the calculated scores are normalized to a reference database and the probability that a reference sequence has a score greater than or equal to the sample sequence is returned. If V gene information is available, only sequences with identical V gene are compared.
`accept_sequences_with_C_F_start_end`	logical. This logical flag if `TRUE`, by default, only accepts sequences with amino-acid C at the start position and amino-acid F at the end position.
`sims`	numeric. By default 1,000,000. Value of how many de novo cdr3 sequences shall be created.
`num_tops`	numeric. By default 1000. The `num_tops` best scoring de novo created cdr3 sequences are returned
`min_length`	Numeric value determining the number of N-terminal positions used for scoring. By default it is set to 10.
`make_figure`	Logical value whether a graph of the `num_tops` best scoring de novo created cdr3 sequences in dependence of the rank shall be displayed.
`n_cores`	numeric. Number of cores to use, by default 1. In case of `NULL` it will be set to number of cores in your machine minus 2.

Value

This function produces one file in the result_folder, if specified, named convergence_group_tag followed by _de_novo.txt) containing the num_tops best scoring generated sequences and their corresponding scores. A list containing this file and additional information will also be returned as follows:

$de_novo_sequences A data frame containing the num_tops best scoring generated sequences and their corresponding scores.

$sample_sequences_scores A data frame containing the sequences of the used convergence group and their corresponding scores.

$cdr3_length_probability A data frame with any considered cdr3 length and the probability of occurrence in the convergence group. The distribution of the cdr3 length of all generated sequences resembles this distribution.

$PWM_Scoring A data frame containing the positional weight matrix used for scoring. The columns represent the different amino acids and the rows represent the position relative to the N-terminus.

$PWM_Prediction A list of data frames containing the positional weight matrix for any considered cdr3 length used for generation of new sequences. The columns represent the different amino acids and the rows represent the position relative to the N-terminus.

References

Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.

https://github.com/immunoengineer/gliph

Examples

utils::data("gliph_input_data")
res <- turbo_gliph(cdr3_sequences = gliph_input_data[base::seq_len(200),],
sim_depth = 100,
n_cores = 1)

new_seqs <- de_novo_TCRs(convergence_group_tag = res$cluster_properties$tag[1],
clustering_output = res,
sims = 10000,
make_figure = TRUE,
n_cores = 1)

HetzDra/turboGliph documentation built on Oct. 2, 2022, 2:22 a.m.