knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
B cells are diversified in order to acquire the ability to recognise and respond against different immune challenges. The processes of diversification of the immunoglobulins B cells produced, both in the forms of B cell receptors and as secreted antibodies, are well studied in humans and a small handful of model organisms such as mice: processes like V(D)J gene recombination and somatic hypermutation introduces a great amount of sequence diversity into immunoglobulins. Such diversity is fine-tuned by Darwinian selection to remove those which express immunoglubilin target self-antigens, and expand those which are suitable to bind, recognise and respond against the immune challenge at hand. However, immunoglobulin sequence diversification is less characterised for species where a small handful of functional immunoglobulin gene segments exist, e.g. chickens where only 1 functional copy of V gene each for the heavy and light chain gene loci. In these systems we often observe long stretches of sequence in the transcribed immunoglobulins attributable to pseudogenes, a phenomenon known as gene conversion. While such observations have been previously reported, they are often small-scale samples of the immunoglobulin repertoire with manual annotation of gene conversion events, or studies without reported code base to reproduce such analysis.
BrepConvert
is an R package designed to automate the annotation of gene conversion events in immunoglobulin repertoire data. It uses a combination of pairwise global sequence alignment, BLAT (BLAST-like alignment tool) [@pmid11932250] and arithmetic functionalities provided in base R to identify candidate gene conversion events, annotate donor pseudogenes, report associated statistics and provide basic plotting capabilities to visualise gene conversion on your data.
To install the BrepConvert
package, in R (version $\geq 3.5.0$) do:
require(devtools) install_github('Fraternalilab/BrepConvert', ref = 'main')
The above should install BrepConvert
itself as well as its R package dependencies.
The only other prerequisite is an executable of the BLAT[@pmid11932250] program. This requires no installation; just downloading it from the utilities section of the UCSC genome browser website would suffice. You will indicate the filepath to the BLAT executable as an argument to the main function (see below).
Please follow the instructions above to install BrepConvert
and download the BLAT executable. For Linux users, you can alternatively use the BLAT executable that is shipped with the installation of this package (this does not work with MacOS - you should download the executable from UCSC [see above]):
# An executable of the BLAT program is also shipped with the package. blat <- system.file("exe/blat", package = "BrepConvert") # NOTE: This works for Linux OS only. For other OS, please download/compile # the executable and indicate the filepath like so: # blat <- "/home/abc/Documents/blat/blat"
The BrepConvert
package has been tested on Linux and MacOS systems.
The BrepConvert
package has not been tested on Windows system. In theory installation of the BrepConvert
package should work following the same instructions above. For BLAT, UCSC does not provide an executable BLAT software for Windows; others (e.g. here) have provided BLAT executables on Windows; note that these have not been tested and are simply offered here as alternatives that you might want to try.
In this vignette we use a small set of chicken immunoglobulin light chain sequences published as one of the earliest reports of gene conversion in chicken [@pmid3100050]. The dataset is shipped with this package:
library(BrepConvert) dataset <- system.file( "extdata/pmid3100050_vquest.tsv", package = "BrepConvert" ) dataset <- read.table( dataset, sep = "\t", stringsAsFactors = FALSE, header = TRUE ) dataset[, c("sequence_id", "locus", "v_call", "j_call")] # 'sequence_alignment' contain the IMGT-gapped DNA sequence of the V gene # for each observation print( head( dataset$v_sequence_alignment, 3 ) ) # assigning it to a new variable and name this vector with the sequence_id repertoire <- dataset$v_sequence_alignment names( repertoire ) <- dataset$sequence_id print( head( repertoire , 3) )
These are 12 light chain sequences collected from three-weeks-stage chicken bursal cells (those sequences marked '3W-') and bursal cells collected from the 18-days embryos ('18D-').
BrepConvert
is designed to be simple to use - users only need to use one function to perform automated annotation of gene conversion events for their dataset. The function batchConvertAnalysis
will perform all the following steps:
batchConvertAnalysis
requires the following input:
repertoire
).functional
).pseudogene
).blat_exec
).# FASTA files containing pseudogene and functional alleles of the Chicken IGLV # locus are shipped with the package. functional_IGLV <- system.file("extdata/IMGT_Chicken_IGLV_F.fasta", package = "BrepConvert") pseudogene_IGLV <- system.file("extdata/IMGT_Chicken_IGLV_P.fasta", package = "BrepConvert") annotation <- batchConvertAnalysis( functional = functional_IGLV, pseudogene = pseudogene_IGLV, repertoire = repertoire, # notice here the vector is passed, NOT the entire table! blat_exec = blat # this is the filepath to the BLAT executable ) # look at the annotation for sequence '3W-3' print( annotation[ annotation$SeqID == "3W-3", ] )
As you can see, the output is a table consisting of results from all sequences provided. Each line corresponds to one gene conversion event identified in one sequence, with the following annotation:
SeqID
: the identifier for the sequence provided in the names of the input repertoire
vector.event
/ possibility
: gene
: nominated pseudogene donors for each event. In cases where identical sequence stretch can be found across different pseudogenes, all such genes are listed and delimited by semicolon (';'). If the conversion event could not be mapped to any pseudogene, it would be noted as 'NA'.start
/ end
/ fiveprime_identical_length
/ threeprime_identical_length
: start
and end
denote the start and beginning relative to the numbering the observed sequence - i.e. IMGT-gapped numbering.fiveprime_identical_length
and threeprime_identical_length
indicates the whether identity extend beyond the given start
and end
. If e.g. 30 bp 5' to start
it is identical between the nominated pseudogene(s), the functional allele and the observed sequence, fiveprime_identical_length
will be 30. The same applies for 3' & the 'end' column. seq_event
contains the actual sequence stretch implicated in the gene conversion event, which would be useful for measuring the size of conversion events.edit_distance
: The number of "edits" (i.e. changes) necessary to convert the nucleotide sequence observed to the nominated pseudogene(s). This is a metric to rank the candidate donor pseudogenes in terms of how close they are to the observed sequence from the repertoire.AID_motif
: The closest DNA motif targetted by the AID enzyme is provided, together with its position (nearest_AID_motif
) and its distance (in terms of number of nucleotides; column distance_to_AID_motif
) from start
.seq_event
: The nucleotide sequence between start
and end
.seq_5prime
and seq_3prime
: The sequence stretch (10 nucleotides) 5' and 3' to the gene conversion event. This is useful to identify whether there are sequence motifs surrounding gene conversion events observed in a group of sequences.germline
/observed_AA
: amino acid sequence translated from the identified gene conversion events. Available for both observed
(i.e. the sequence stretch observed) and germline
(i.e. the same segment in the germline functional allele), and narrow
(the region between start
and end
) and broad
(narrow
plus the flanking sequence identicated by 5'/3'_identical_lengths
) definitions of the gene conversion event.The output from the batchConvertAnalysis
is very rich; it would be handy to have a basic visualisation summarising the conversion events, and, if possible, highlight a few pseudogenes of interest (Figure 1):
plotEvents( annotation = annotation, repertoire = repertoire, show.sequence.names = TRUE, # recommended FALSE if you are visualising a lot of sequences! thickness = c(4, 1), # this controls the thickness of lines drawn highlight.genes = c("IGLV1-2", "IGLV1-16", "IGLV1-22" )) # this is optional; these donors will be highlighted with different colours
As described in the original paper [@pmid3100050] we can observe much more prevasive gene conversion events in the sequences collected from three-weeks-stage chicken bursal cells (those sequences marked '3W-') compared to bursal cells collected from the 18-days embryos ('18D-'). We can compare this to the analysis presented in the original paper where the gene conversion events were identified manually:
Notwithstanding differences in the germline allele nomenclature (understandable given we use the up-to-date IMGT gene catalogue), BrepConvert
provides a fast, reasonable accurate method to annotate gene conversion events over large samples of immunoglobulin sequences.
We can further compare the 3W and 18D sequences using histograms. The following shows a histogram (Figure 3) of coverage of gene conversion events - i.e. how often is a given DNA position implicated in a gene conversion event?
# first add an extra column 'sequence_type' indicating whether # a sequence is from the 3W or 18D group annotation$sequence_type <- ifelse(grepl("^3W-", annotation$SeqID), "3W", "18D") plotCoverage(annotation, repertoire, 'sequence_type', type = 'coverage', # alternatively 'start' considers only the start sites definition = 'both') # or 'narrow' / 'broad' if you only want either definition of boundaries
Here these plots visualise the proportion of gene conversion events (y-axis) which covers any given DNA position (x-axis) along the V gene. You can see clearly the wider boundaries of gene conversion events implicated in the 'broad' definition (i.e. the start/end points of sequence differences plus the 5'/3' padded nucleotide stretches identical between the pseudogene and the observed sequence) - this is the maximum size of gene conversion events possibly explainable by the given donor pseudogene. This plot is useful for comparing a large number of sequences: instead of plotting each sequence independently, the histograms summarise the locations of these events better for larger amount of data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.