knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(eval = FALSE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
ogrdbstats is an R package that can be used to create an analysis of gene usage in a adaptive immune receptor sequencing repertoire. The analysis consists of usage statistics and plots. The package is intended to be used in conjunction with a tool that infers a 'personalised genotype'. Currently the following tools are supported:
Ogrdbstats requires a recent installation of Pandoc. If Rstudio is installed on your machine, pandoc will already be installed. Otherwise please follow the installation instructions on the Pandoc website.
Once Pandoc is installed, please install ogrdbstats from CRAN:
install.packages('ogrdbstats')
Alternatively, you can install the latest development version from Github:
devtools::install_github("https://github.com/airr-community/ogrdbstats")
The package requires R version 4.2.2 or above.
It's easiest to use the package from the command line. To do this, download ogrdbstats.R and copy to the directory in which you wish to conduct the analysis.
Command Syntax (see following section for detailed description of file formats)
Rscript ogrdbstats.R [--inf_file INF_FILE] [--hap_gene HAP_GENE] REF_FILE SPECIES READ_FILE CHAIN
Positional Arguments:
REF_FILE
- pathname of a FASTA file containing IMGT gap-aligned reference germline sequences. Usually this would be downloaded from IMGT.
SPECIES
- should contain the species name used in field 3 of the IMGT REF_FILE FASTA header, with spaces removed, e.g. Homosapiens for Human. If you are not using an IMGT REF_FILE, you can use any single word for the species here, and the reference file should only contain genes for that species.
READ_FILE
- pathname of a tab-separated file containing the annotated reads used to infer the genotype, in MiAIRR, CHANGEO or IgDiscover format
CHAIN
specifies the sequence type to be analysed. It must be one of VH, VK, VL, D, JH, JK, JL
.
Optional Arguments:
INF_FILE
- pathname of a FASTA file containing sequences of inferred novel alleles. This file must be provided if the read file contains assignments
to alleles that are not listed in the reference file.
HAP_GENE
- the gene to be used for haplotyping analysis (see haplotyping section below)
Detailed descriptions of the required input files are given in the next section, but for quick usage with a supported tool, please skip to the Usage Notes for that tool towards the end of this document.
The package provides functions to allow you to create the reports programmatically, or use the information it reads from input files for your own purposes. Please refer to the 'ogrdbstats API' section for a brief overview.
sequence_id, v_call_genotyped, d_call, j_call, sequence_alignment, cdr3
. For J or D inferences they must also contain
J_sequence_start
, J_sequence_end
, J_germline_start
, J_germline_end
, or the equivalent fields for D genes. IgBLAST's
--format airr
creates compatible MiAIRR format files.CHANGEO files must contain at least the following columns:
SEQUENCE_ID, V_CALL_GENOTYPED, D_CALL, J_CALL, SEQUENCE_IMGT, CDR3_IMGT
, V_MUT_NC
, D_MUT_NC
, J_MUT_NC
, SEQUENCE
, JUNCTION_START
, V_SEQ
, D_SEQ
, J_SEQ
.
If you would like to process files from IMGT V-Quest, please parse them with CHANGEO to convert them to CHANGEO format.
D- related fields are only required for heavy chain records.
In both the above file formats, v_call_genotyped/V_CALL_GENOTYPED
should contain the V calls made after the subject's V-gene genotype has been inferred
(including calls of the novel alleles). Sequences may be either unagpped or IMGT gap-aligned. Determining the personalised V-gene genotype is recommended when processing D or J
gene inferences, so that V-gene usage counts are accurate. However, this step can be omitted for D or J gene processing by providing a V_CALL field instead of V_CALL_GENOTYPED.
For IgDiscover, the file 'final/filtered.tab' should be used - see section on IgDiscover below.
<READ_FILE>_ogrdb_report.csv
- the Genotype File ready to be uploaded to OGRDB.<READ_FILE>_ogrdb_plots.csv
- plots (see next section for details). READ_FILE
is used as a prefix to the output file names. They will be written to the directory containing the read file.
If you are submitting inferences to OGRDB, you will be prompted to upload the genotype file. Please also upload the plots file as an attachment to the Notes section of your submission.
The script produces the following plots:
The nucleotide usage plots are not produced from IgDiscover output, as aligned V-sequences are not available.
The script should first be run without the optional HAP_GENE
parameter. If, having consulted the plots, you identify a suitable gene
for haplotyping, please run the script again, with this gene specified as HAP_GENE
. The haplotyping_gene and haplotyping_ratio columns of the genotype file
will be appropriately populated. A J-gene should be used with V- and D- gene inferences, and a V-gene with J-gene inferences.
To download the IMGT reference file and complete an analysis using example files, run the following commands:
wget -O IMGT_REF_GAPPED.fasta https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP Rscript ogrdbstats.R --inf_file TWO01A_naive_novel_ungapped.fasta --hap_gene IGHJ6 IMGT_REF_GAPPED.fasta Homosapiens TWO01A_naive_genotyped.tsv VH
Usage notes are indicative only and are not intended to discount other approaches. Notes for other tools will follow.
To conduct a V-gene analysis with TIgGER:
findNovelAlleles
to identify novel alleles in a Change-O-formatted data set. Write these to a FASTA file. inferGenotype
or inferGenotypeBayesian
to infer the genotype.reassignAlleles
to correct allele calls in the data set, based on the inferred genotypeogrdbstats
.Note that inferGenotype
will not necessarily include every inferred allele produced by findNovelAlleles
in the genotype that it produces. Only those
alleles included in the genotype will be considered by genotype_statistics.R
because, leaving other considerations aside, no sequences are assigned to other alleles.
TIgGER provides additonal information, including its own plots and statistics We encourage you to take these into consideration, and to upload them as attachments to your submission if they are informative.
Following an IgDiscover run, please copy ogrdbstats.R to IgDiscover's final
directory. The commands below can then be used to download the IMGT reference file and run a VH gene analysis.
All commands should be run in the final
directory.
$ wget -O IMGT_REF_GAPPED.fasta https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP $ unzip final.tab.gz $ Rscript ogrdbstats.R --inf_file database/V.fasta IMGT_REF_GAPPED.fasta Homosapiens filtered.tab VH
alternatively, to produce a JH gene analysis:
$ Rscript ogrdbstats.R --inf_file database/J.fasta IMGT_REF_GAPPED.fasta Homosapiens filtered.tab JH
The information required by generate_statstics.R is split between partis's normal yaml output and that provided by the 'presto-output' mode. A python script, convert_partis.py, is provided. This will combine output from partis's yaml and presto annotations, producing CHANGEO format annotations and a FASTA file of genotype V-sequences. These files can then passed to generate_statistics.R. convert_partis.py is written in python 2.7 for compatibility with partis, and can be run from the command line in the same virtual environment.
Usage of convert_partis.py:
python convert_partis.py [-h] partis_yaml partis_tsv ogrdb_recs ogrdb_vs positional arguments: partis_yaml .yaml file created by partis partis_tsv .tsv file created by partis presto-output mode ogrdb_recs annotation output file (.tsv) ogrdb_vs v_gene sequences (.fasta) optional arguments: -h, --help show this help message and exit
Although partis must be run twice - once without the presto-output option, and once with it - it will use cached information provided other parameters remain the same, so that the overall impact on run time is low. Typical processing steps are shown below. Note that --presto-output requires an IMGT-gapped V-gene germline file. This can be extracted from the full germline library downloaded from IMGT (see 'Prerequisites' above), but partis will report as an error any duplicated identical sequences in the library: duplicates must be removed from the file before processing will complete successfully. Note in the examples below the --extra-annotations option. The CDR3 is required by generate_statistics.R: the other fields are included for reference.
# Run partis to produce annotations in YAML format partis annotate --extra-annotation-columns cdr3_seqs:invalid:in_frames:stops --infname TW01A.fasta --outfname TW01A.yaml --n-procs 5 # Run partis again with additional --presto-output option. This will produce TSV-formatted output from cached data partis annotate --extra-annotation-columns cdr3_seqs:invalid:in_frames:stops --infname TW01A.fasta --outfname TW01A.tsv --presto-output \ --aligned-germline-fname IMGT_REF_GAPPED_DEDUPED.fasta --n-procs 5 # Extract and merge required information from YAML and TSV files python convert_partis.py TW01A.yaml TW01A.tsv TW01A_OGRDB.tsv TW01A_V_OGRDB.fasta # Process the resulting output to produce the genotye file and plots Rscript ogrdbstats.R --inf_file TW01A_V_OGRDB.fasta IMGT_REF_GAPPED.fasta Homosapiens TW01A_OGRDB.tsv VH
IMPre does not provide a set of sequences annotated with the novel allele calls. The sequences must be annotated by a separate tool in order to provide the information needed for the OGRDB genotype. One possible approach is as follows:
generate_ogrdb_report()
takes equivalent arguments to ogrdbstats.R and generates the genotype file and plots.
read_input_files()
reads and parses the input files. It returns a representation of the genotype file in memory, as
well as a number of other structures containing related information.
make_barplot_grobs()
, make_novel_base_grobs()
and make_haplo_grobs()
create the plots: each function returns a list of grobs that can be used as you wish, or passed to write_plot_file()
to create a the standard file of plots.
The use of the grob functions is demonstrated below with example data included in the package.
library(ogrdbstats) reference_set = system.file("extdata/ref_gapped.fasta", package = "ogrdbstats") inferred_set = system.file("extdata/novel_gapped.fasta", package = "ogrdbstats") repertoire = system.file("extdata/ogrdbstats_example_repertoire.tsv", package = "ogrdbstats") rd = suppressMessages( read_input_files(reference_set, inferred_set, 'Homosapiens', repertoire, 'IGHV', NA, 'V', 'H', FALSE) ) barplot_grobs = make_barplot_grobs(rd$input_sequences, rd$genotype_db, rd$inferred_seqs, rd$genotype, 'V', rd$calculated_NC) base_grobs = make_novel_base_grobs(rd$inferred_seqs, rd$input_sequences, 'V', FALSE) gridExtra::grid.arrange(grobs=list(barplot_grobs[2][[1]], base_grobs$end[1][[1]], base_grobs$conc[1][[1]]),ncol=1)
Please use help(package="ogrdbstats") or ? to find function-level documentation within R.
Some functions are adapted from TIgGER with thanks to the authors.
The example annotated reads and inferences linked from this description are taken from the data of Rubelt et al and were downloaded from VDJServer. The genotype was inferred by TIgGER. A small number of light-chain records were removed from the data set.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.