write_TCR_junction_fasta: Write out TCR junction sequences to a FASTA file for BLAST...

Description Usage Arguments Details

View source: R/write_TCR_junction_fasta.R

Description

Write out TCR junctions to a FASTA file, to enable submission for BLAST query against known protein sequences. This allows identification of known TCRs from published sequences.

Usage

1
2
3
4
5
6
7
write_TCR_junction_fasta(
  tcrs, filename="tcr_junctions.fasta",
  pos_control=">flu_1_TRBV_CAGAGSQGNLIF CAGAGSQGNLIF",
  sample_col="libid",
  cols_for_name=c(sample_col, "cln_count", "v_gene", "j_gene", "junction"),
  junction_col="junction",
  unique_only=TRUE)

Arguments

tcrs

data frame, the TCrs to be included. Must include all the columns specified in cols_for_name and junction_col.

filename

character string, the path and file name to output. Defaults to "tcr_junctions.fasta"

pos_control

character string containing a positive control TCR sequence that will be prepended to the fasta file. Used to ensure that BLAST search parameters are working properly. Can be set to NULL to not include positive control. The default value should give a perfect match to protein accession 1OGA_D.

sample_col

character, the name of the column in tcrs to be included in the fasta query name. Separate from cols_for_name to enable easier editing.

cols_for_name

character vector, the columns in tcrs to be included in the fasta query name. Values will be concatenated with "_" as separator to generate the query name. To use alignment results with read_BLAST_TCR_align_hits with length thresholding, include the junction at the end of the query name.

junction_col

character string, the name of the column in tcrs to use for the CDR3 junction sequence. Must be included in tcrs.

unique_only

logical, whether to include only unique combinations of cols_for_name in the output.

Details

This function outputs a file designed to be used for a BLAST query on https://blast.ncbi.nlmh.nih.gov. On that site, use protein blast (blastp), under "Choose File", upload the file that is output by this function. Make sure that the non-redundant protein sequences database (nr) is selected, set organism to "Homo sapiens (taxid:9606)", and under Algorith parameters set Expect threshold to 1000 (to account for the length of the sequences being queried). If you included the default pos_control, you should see it match to protein accession 1OGA_D.


mjdufort/TCRtools documentation built on Sept. 12, 2021, 7:11 p.m.