View source: R/imgt_tcr_segment_prep.R
imgt_tcr_segment_prep | R Documentation |
Immunoglobulin (IG) reference data from IMGT do not come in a handy format for processing in R. For T cell receptor (TCR) gene segments, this functions uses data from IMGT (fasta files and one manually prepared table) to create a data frame that can be used subsequently to align TCR sequences from scRNAseq (or other). All necessary files (human or mouse) are included in this package (Oct-2021) but may be downloaded manually from IMGT in case there are major updates. The files included can be retrieved with file.copy(list.files(system.file("extdata", "IMGT_ref", package = "igsc")), 'path to your folder'). These files demonstrate the required file names and formats in case you want to provide updated data from IMGT.
imgt_tcr_segment_prep(path, organism = "human", mc = F)
path |
path to a folder with all necessary files from IMGT; if not provided human or mouse data downloaded roughly Oct-2021 will be used |
organism |
if no path is provided data will be taken from this package, either human or mouse |
mc |
use multicore (mclapply from parallel package) for pairwise alignment of TCR segments |
To skip this function and immediately obtain its output, ready made data frames are available with imgt_ref <- readRDS(system.file("extdata", "IMGT_ref/human/hs.rds", package = "igsc")) or imgt_ref <- readRDS(system.file("extdata", "IMGT_ref/mouse/mm.rds", package = "igsc")).
Sources and how to prepare the data yourself. Data for the xlsx-files are from: http://www.imgt.org/IMGTrepertoire/Proteins/proteinDisplays.php?species=human&latin=Homo%20sapiens&group=TRAV, http://www.imgt.org/IMGTrepertoire/Proteins/proteinDisplays.php?species=human&latin=Homo%20sapiens&group=TRBV, http://www.imgt.org/IMGTrepertoire/Proteins/proteinDisplays.php?species=house%20mouse&latin=Mus%20musculus&group=TRAV, http://www.imgt.org/IMGTrepertoire/Proteins/proteinDisplays.php?species=house%20mouse&latin=Mus%20musculus&group=TRBV. Fasta-files are made from the data found here: http://www.imgt.org/vquest/refseqh.html. Leader sequences are from "L-PART1+L-PART2" artificially spliced sets, nucleotides (F+ORF+all P). Others are from "L-PART1+V-EXON" artificially spliced sets and Constant gene artificially spliced exons sets. Fasta-formatted sequences from there have to be copied manually and saved as .fasta files in a folder. This folder then becomes the path argument.
a data frame
## Not run:
imgt_df <- imgt_tcr_segment_prep()
openxlsx::write.xlsx(imgt_df, "imgt_ref_df.xlsx")
saveRDS(imgt_df, "imgt_ref_df.rds")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.