In Fraternalilab/BrepConvert: BrepConvert: R package for gene conversion analysis of B cell repertoire

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Background

B cells are diversified in order to acquire the ability to recognise and respond against different immune challenges. The processes of diversification of the immunoglobulins B cells produced, both in the forms of B cell receptors and as secreted antibodies, are well studied in humans and a small handful of model organisms such as mice: processes like V(D)J gene recombination and somatic hypermutation introduces a great amount of sequence diversity into immunoglobulins. Such diversity is fine-tuned by Darwinian selection to remove those which express immunoglubilin target self-antigens, and expand those which are suitable to bind, recognise and respond against the immune challenge at hand. However, immunoglobulin sequence diversification is less characterised for species where a small handful of functional immunoglobulin gene segments exist, e.g. chickens where only 1 functional copy of V gene each for the heavy and light chain gene loci. In these systems we often observe long stretches of sequence in the transcribed immunoglobulins attributable to pseudogenes, a phenomenon known as gene conversion. While such observations have been previously reported, they are often small-scale samples of the immunoglobulin repertoire with manual annotation of gene conversion events, or studies without reported code base to reproduce such analysis.

BrepConvert is an R package designed to automate the annotation of gene conversion events in immunoglobulin repertoire data. It uses a combination of pairwise global sequence alignment, BLAT (BLAST-like alignment tool) [@pmid11932250] and arithmetic functionalities provided in base R to identify candidate gene conversion events, annotate donor pseudogenes, report associated statistics and provide basic plotting capabilities to visualise gene conversion on your data.

Installation

To install the BrepConvert package, in R (version $\geq 3.5.0$) do:

require(devtools)
install_github('Fraternalilab/BrepConvert', ref = 'main')

The above should install BrepConvert itself as well as its R package dependencies.

The only other prerequisite is an executable of the BLAT[@pmid11932250] program. This requires no installation; just downloading it from the utilities section of the UCSC genome browser website would suffice. You will indicate the filepath to the BLAT executable as an argument to the main function (see below).

Linux and MacOS

Please follow the instructions above to install BrepConvert and download the BLAT executable. For Linux users, you can alternatively use the BLAT executable that is shipped with the installation of this package (this does not work with MacOS - you should download the executable from UCSC [see above]):

# An executable of the BLAT program is also shipped with the package.
blat <- system.file("exe/blat", package = "BrepConvert")
# NOTE: This works for Linux OS only. For other OS, please download/compile
# the executable and indicate the filepath like so:
# blat <- "/home/abc/Documents/blat/blat"

The BrepConvert package has been tested on Linux and MacOS systems.

Windows

The BrepConvert package has not been tested on Windows system. In theory installation of the BrepConvert package should work following the same instructions above. For BLAT, UCSC does not provide an executable BLAT software for Windows; others (e.g. here) have provided BLAT executables on Windows; note that these have not been tested and are simply offered here as alternatives that you might want to try.

Example

In this vignette we use a small set of chicken immunoglobulin light chain sequences published as one of the earliest reports of gene conversion in chicken [@pmid3100050]. The dataset is shipped with this package:

library(BrepConvert)

dataset <- system.file( "extdata/pmid3100050_vquest.tsv", 
                        package = "BrepConvert" )
dataset <- read.table( dataset, sep = "\t", stringsAsFactors = FALSE, header = TRUE )
dataset[, c("sequence_id", "locus", "v_call", "j_call")]

# 'sequence_alignment' contain the IMGT-gapped DNA sequence of the V gene 
# for each observation
print( head( dataset$v_sequence_alignment, 3 ) )

# assigning it to a new variable and name this vector with the sequence_id
repertoire <- dataset$v_sequence_alignment
names( repertoire ) <- dataset$sequence_id
print( head( repertoire , 3) )

These are 12 light chain sequences collected from three-weeks-stage chicken bursal cells (those sequences marked '3W-') and bursal cells collected from the 18-days embryos ('18D-').

One-step annotation of gene conversion events

BrepConvert is designed to be simple to use - users only need to use one function to perform automated annotation of gene conversion events for their dataset. The function batchConvertAnalysis will perform all the following steps:

Pairwise sequence alignment of each sequence in the repertoire to the functional (alleles) to identify sequence stretches as candidate gene conversion events.
Run BLAT of these sequence stretches against pseudogene sequences to identify donor pseudogenes
Optimise annotation to rank candidate pseudogene donors by comparing their identity to the observation. It also merges adjacent gene conversion events if they could be attributable to the same pseudogene donor, and the intervening segment is identical to this pseudogene. This narrows down candidates to find the closest pseudogene donor for each gene conversion event.
Provide annotation such as the nearest AID mutational hotspot, and sequence identity 5' and 3' of the nominated gene conversion event.

batchConvertAnalysis requires the following input:

Sequences from repertoire: a named vector of characters corresponding to IMGT-gapped DNA sequence from the BCR repertoire. The names are taken as the identifiers of the sequences. This vector is passed as an argument (repertoire).
Functional allele sequences: a FASTA file containing IMGT-gapped DNA sequences of functional allele(s). The filepath is passed as an argument (functional).
Pseudogene allele sequences: a FASTA file containing IMGT-gapped DNA sequences of pseudogene allele(s). The filepath is passed as an argument (pseudogene).
A copy of the executable of the BLAT program: The filepath to such executable is passed as an argument (blat_exec).

# FASTA files containing pseudogene and functional alleles of the Chicken IGLV
# locus are shipped with the package.
functional_IGLV <- system.file("extdata/IMGT_Chicken_IGLV_F.fasta",
                               package = "BrepConvert")
pseudogene_IGLV <- system.file("extdata/IMGT_Chicken_IGLV_P.fasta",
                               package = "BrepConvert")

annotation <- batchConvertAnalysis(
  functional = functional_IGLV,
  pseudogene = pseudogene_IGLV,
  repertoire = repertoire, # notice here the vector is passed, NOT the entire table!
  blat_exec = blat # this is the filepath to the BLAT executable
)

# look at the annotation for sequence '3W-3'
print( annotation[ annotation$SeqID == "3W-3", ] )

As you can see, the output is a table consisting of results from all sequences provided. Each line corresponds to one gene conversion event identified in one sequence, with the following annotation:

SeqID: the identifier for the sequence provided in the names of the input repertoire vector.
event / possibility:
- Events are numbered 1, 2, 3 ... for each sequence counting from the beginning of the IMGT numbering.
- In cases where different pseudogenes could explain the same gene conversion event (but with, e.g. different levels of identity evidenced by the edit distance, see below), these possibilities are labelled a, b, c ... and ordered by their similarity to the observed sequence in the repertoire.
gene: nominated pseudogene donors for each event. In cases where identical sequence stretch can be found across different pseudogenes, all such genes are listed and delimited by semicolon (';'). If the conversion event could not be mapped to any pseudogene, it would be noted as 'NA'.
start / end / fiveprime_identical_length / threeprime_identical_length:
- start and end denote the start and beginning relative to the numbering the observed sequence - i.e. IMGT-gapped numbering.
- fiveprime_identical_length and threeprime_identical_length indicates the whether identity extend beyond the given start and end. If e.g. 30 bp 5' to start it is identical between the nominated pseudogene(s), the functional allele and the observed sequence, fiveprime_identical_length will be 30. The same applies for 3' & the 'end' column.
- This allows exploration of the longest-possible definition of the gene conversion event, as the method identifies events only based on sequence mismatches and could underestimate the size of gene conversion events as such.
- Note: since the numbering used is IMGT-gapped, the size of gene conversion events inferred (e.g. end - start) may not be corresponding to the actual length of the sequence stretch implicated - it may contain gap(s). The column seq_event contains the actual sequence stretch implicated in the gene conversion event, which would be useful for measuring the size of conversion events.
edit_distance: The number of "edits" (i.e. changes) necessary to convert the nucleotide sequence observed to the nominated pseudogene(s). This is a metric to rank the candidate donor pseudogenes in terms of how close they are to the observed sequence from the repertoire.
AID_motif: The closest DNA motif targetted by the AID enzyme is provided, together with its position (nearest_AID_motif) and its distance (in terms of number of nucleotides; column distance_to_AID_motif) from start.
seq_event: The nucleotide sequence between start and end.
seq_5prime and seq_3prime: The sequence stretch (10 nucleotides) 5' and 3' to the gene conversion event. This is useful to identify whether there are sequence motifs surrounding gene conversion events observed in a group of sequences.
germline/observed_AA: amino acid sequence translated from the identified gene conversion events. Available for both observed (i.e. the sequence stretch observed) and germline (i.e. the same segment in the germline functional allele), and narrow (the region between start and end) and broad (narrow plus the flanking sequence identicated by 5'/3'_identical_lengths) definitions of the gene conversion event.

Visualisation

The output from the batchConvertAnalysis is very rich; it would be handy to have a basic visualisation summarising the conversion events, and, if possible, highlight a few pseudogenes of interest (Figure 1):

plotEvents( annotation = annotation, repertoire = repertoire,
            show.sequence.names = TRUE, # recommended FALSE if you are visualising a lot of sequences!
            thickness = c(4, 1), # this controls the thickness of lines drawn
            highlight.genes = c("IGLV1-2", "IGLV1-16", "IGLV1-22" )) # this is optional; these donors will be highlighted with different colours

As described in the original paper [@pmid3100050] we can observe much more prevasive gene conversion events in the sequences collected from three-weeks-stage chicken bursal cells (those sequences marked '3W-') compared to bursal cells collected from the 18-days embryos ('18D-'). We can compare this to the analysis presented in the original paper where the gene conversion events were identified manually:

Figure 2. Reproduced from ref. 2 Figure 9, listing gene conversion events of a selection of the sequences included in our example. Vertical bars correspond to the boundaries of the gene conversion events; Dark dots denote DNA substitutions with no identified donors.

Notwithstanding differences in the germline allele nomenclature (understandable given we use the up-to-date IMGT gene catalogue), BrepConvert provides a fast, reasonable accurate method to annotate gene conversion events over large samples of immunoglobulin sequences.

We can further compare the 3W and 18D sequences using histograms. The following shows a histogram (Figure 3) of coverage of gene conversion events - i.e. how often is a given DNA position implicated in a gene conversion event?

# first add an extra column 'sequence_type' indicating whether
# a sequence is from the 3W or 18D group
annotation$sequence_type <- ifelse(grepl("^3W-", annotation$SeqID), "3W", "18D")
plotCoverage(annotation, repertoire, 'sequence_type', 
             type = 'coverage', # alternatively 'start' considers only the start sites
             definition = 'both') # or 'narrow' / 'broad' if you only want either definition of boundaries

Here these plots visualise the proportion of gene conversion events (y-axis) which covers any given DNA position (x-axis) along the V gene. You can see clearly the wider boundaries of gene conversion events implicated in the 'broad' definition (i.e. the start/end points of sequence differences plus the 5'/3' padded nucleotide stretches identical between the pseudogene and the observed sequence) - this is the maximum size of gene conversion events possibly explainable by the given donor pseudogene. This plot is useful for comparing a large number of sequences: instead of plotting each sequence independently, the histograms summarise the locations of these events better for larger amount of data.