Analysis of antibody repertoires by high throughput sequencing is of major importance in understanding adaptive immune responses. Our knowledge of variations in the genomic loci encoding antibody genes is incomplete, mostly due to technical difficulties in aligning short reads to these highly repetitive loci. The partial knowledge results in conflicting V-D-J gene assignments between different algorithms, and biased genotype and haplotype inference. Previous studies have shown that haplotypes can be inferred by taking advantage of IGHJ6 heterozygosity, observed in approximately one third of the population [@kidd_inference_2012;@kirik2017parallel].
Here we provide a robust novel method for determining V-D-J haplotypes by adapting a Bayesian framework, RAbHIT. Our method extends haplotype inference to IGHD and IGHV based analysis, thereby enabling inference of complex genetic events like deletions and copy number variations in the entire population. It calculates a Bayes factor, a number that indicates the certainty level of the inference, for each haplotyped gene.
More details can be found here:
RAbHIT requires two main inputs:
Antibody repertoire sequencing data is in a data frame format. Each row represents a unique observation and columns represent data about that observation. The names of the required columns are provided below along with a short description.
Column Name | Description
----------------------|---------------------------------------------------------
subject
| subject name
v_call
| (Comma separated) name(s) of the nearest V allele(s) (IMGT format)
d_call
| (Comma separated) name(s) of the nearest D allele(s)
j_call
| (Comma separated) name(s) of the nearest J allele(s)
An example dataset is provided with RAbHIT. It contains unique naive b-cell sequences, from multiple individuals. One individual in the example data set appears twice, once with full V coverage and once partial V coverage (I5 and I5_FR2 respectively).
The database of germline sequences should be provided in a FASTA format with sequences gaped according to the IMGT numbering scheme[@lefranc2003imgt]. IGHV, IGHD, and IGHJ alleles in the IMGT database (release 2018-12-4) are provided with this package (HVGERM, HDGERM, and HJGERM). We removed the following IGHV and IGHD alleles, as they are duplicated from other allele assignments (All duplicates are mentioned in IMGT).
V/D allele | V/D allele duplicated (removed) --------------|------------------------------ IGHV1-6901 | IGHV1-69D01 IGHV2-7004 | IGHV2-70D04 IGHV3-2301 | IGHV3-23D01 IGHV3-3004 | IGHV3-30-303 IGHV3-2901 | IGHV3-30-4201 IGHV3-3018 | IGHV3-30-501 IGHV3-3002 | IGHV3-30-502 IGHD4-1101 | IGHD4-401 IGHD5-1801 | IGHD5-501
To obtain the most reliable result we suggest following the data pre-processing steps below.
To load RAbHIT we'll run the following:
library(rabhit)
The functions provided by this package can be used to perform any combination of the following:
An individual's haplotype can be inferred using the function createFullHaplotype
.
The function infers the haplotype based on the provided anchor gene. Using this function, a contingency table is created for each gene, from which a strand is inferred for each allele. The user can set the anchor gene for haplotyping as well as the column for which a haplotype should be inferred.
# Load example sequence data and example germline database data(samples_db, HVGERM, HDGERM) # Selecting a single individual clip_db <- samples_db[samples_db$subject=='I5', ] # Inferring haplotype using J6 as anchor haplo_db_J6 <- createFullHaplotype(clip_db, toHap_col=c("v_call","d_call"), hapBy_col="j_call", hapBy="IGHJ6", toHap_GERM=c(HVGERM, HDGERM))
head(haplo_db_J6,3)
We can plot the haplotype map using the plotHaplotype
function. Each row represents a different gene and the genes are ordered by the chromosome location. Each color represents the different alleles, light gray represents an unknown, dark gray a deletion and off-white a non-reliable allele annotation. The blues represent the confidence level of the inference ($lK$). The most left panels show the count for each allele on each chromosome. The middle panels show the allele inference for each gene on each chromosome, The most right panels show the confidence level of the inference. non reliable alleles are annotated with [*]
and a number and are found below the graph.
# Plotting the haplotype map plotHaplotype(haplo_db_J6)
One of the advantages of RAbHIT, is the ability to infer haplotype by other anchor genes than J6. RAbHIT offers utilizing any gene as anchor, however, we recommend that the fraction of the least frequent allele of the anchor gene will be above $0.3$. So far only J6, D2-8, and D2-21 has been proven to infer the heavy chain loci correctly. Here, we chose a single individual from our example dataset with heterozygousity in J6 and D2-21, inferred haplotypes according to both anchors, compared them using the hapDendo
function, and calculated the Jaccard distance between the inferred haplotypes.
# Inferring haplotype using D2-21 as anchor haplo_db_D2_21 <- createFullHaplotype(clip_db, toHap_col="v_call", hapBy_col="d_call", hapBy="IGHD2-21", toHap_GERM=HVGERM)
To combine the two data frames we need to associate each of the D2-21 alleles with J6 alleles. We can do that from the haplotype inference with the J6 as anchor
haplo_db_J6[haplo_db_J6$gene == "IGHD2-21", ]
D2-21*02 goes with J6*02 and that D2-21*01 goes with J6*03. Now to bind the data frames together, will change the sample name to match the anchor gene used in each haplotype, and change the anchor gene columns to a generic name.
# rename the subject haplo_db_J6$subject <- 'J6' haplo_db_D2_21$subject <- 'D2-21' # change the anchor gene columns # For D2-21*01 and J6*03 we will change the column to Anchor_J03_D01 names(haplo_db_J6)[which(names(haplo_db_J6)=='IGHJ6_03')] <- "AnchorJ03D01" names(haplo_db_D2_21)[which(names(haplo_db_D2_21)=='IGHD2-21_01')] <- "AnchorJ03D01" # For D2-21*02 and J6*02 we will change the column to Anchor_J02_D02 names(haplo_db_J6)[which(names(haplo_db_J6)=='IGHJ6_02')] <- "AnchorJ02D02" names(haplo_db_D2_21)[which(names(haplo_db_D2_21)=='IGHD2-21_02')] <- "AnchorJ02D02" # subsetting the haplo_db_J6 dataset to include only the V genes haplo_db_J6 <- haplo_db_J6[grep('IGHV',haplo_db_J6$gene),] # Combining the datasets row wise haplo_comb <- rbind(haplo_db_J6,haplo_db_D2_21)
Now, we can compare the haplotypes using the hapDendo
function.
# Plot the haplotype inferred deprogram hapDendo(haplo_comb)
The createFullHaplotype
function infers haplotypes for multiple individuals at once, using the subject
column to distinguish between the different individuals. We can then compare the individual haplotypes using the hapHeatmap
function. This function generates a heatmap of the alleles inferred for each chromosome. The novel alleles are annotated with ^
and a number, while the non reliable alleles are annotated with [*]
and a number and are found below the graph. This new annotation allows to distinguish better between different alleles.
# Removing the individual I5_FR1 with the partial V coverage sequence. clip_dbs <- samples_db[samples_db$subject!='I5_FR2', ] # Inferred haplotype summary table haplo_db <- createFullHaplotype(clip_dbs, toHap_col=c("v_call","d_call"), hapBy_col="j_call", hapBy="IGHJ6", toHap_GERM=c(HVGERM, HDGERM)) # Plot the haplotype inferred heatmap p.list <- hapHeatmap(haplo_db) # The function return a list with the plot and the optimal width and height # we can use both parameters to render the plot to the desired size. width <- p.list$width height <- p.list$height
# Plotting the heatmap p.list$p
Gene usage tends to vary between individuals, and in some cases the relative gene usage of certain individuals are much lower than the rest of the population. To asses whether the low frequency arises from a deleted gene, a binomial test described in Gidoni et al. (2018)[@gidoni2018mosaic] was implemented. There it was checked whether the relative gene usage of a specific gene or set of genes in an individual is lower than a chosen cutoff. For example for the IGHV genes, the chosen cutoff was $0.001$. The deletionsByBinom
function implements the binomial test and returns the detected gene deletion for a certain individual.
# Inferring double chromosome deletions del_binom_db <- deletionsByBinom(clip_dbs) head(del_binom_db)
For visualizing the deletion detected by deletionsByBinom
, the plotDeletionsByBinom
can be used. It is recommended to use this function for multiple individuals.
# Don't plot IGHJ del_binom_db <- del_binom_db[grep('IGHJ', del_binom_db$gene, invert = T),] # Inferred deletion summary table plotDeletionsByBinom(del_binom_db)
The detection of a double chromosome deletion from the function can be then used for haplotype inference. This prior knowledge of deletion can raise the certainty level in the inference of the genes for where a deletion was detected. The createFullHaplotype
receives a vector of the deleted genes detected. V gene labels marked in red represent low expressed genes for which deletions are inferred with lowly certainty. D gene labels marked in purple represent indistinguishable genes due to high sequence similarity, therefore the alignment call is less reliable.
Individuals with partial V coverage tend to have multiple gene or allele assignments. This multiplicity requires special attention, as it can hinder the haplotype inference results by introducing biases. To infer the haplotypes correctly from those datasets we recommend following the initial pre-processing steps and detecting non reliable V genes using the nonReliableVGenes
function.
The individual in the example dataset with the partial V coverage is I5_FR2. We shortened the sequences of individual I5 using pRESTO[@presto] MaskPrimers
function with BIOMED-2 framework 2 (FR2) primers.
# Selecting a single individual with partial V coverage clip_db <- samples_db[samples_db$subject=='I5_FR2', ] # Detecting non reliable genes nonReliable_Vgenes <- nonReliableVGenes(clip_db)
As mentioned above the double chromosome deletions can be used within the haplotype inference. For the partial V coverage dataset it is important to input the deletionsByBinom
with the detect non reliable V gene list.
# Inferred deletion summary table del_binom_db <- deletionsByBinom(clip_db, chain = "IGH", nonReliable_Vgenes = nonReliable_Vgenes)
Next, the haplotype can be inferred inputting the deletion summary table to the deleted_genes
flag and the non reliable V genes list to the nonRelaible_Vgenes
flag in the createFullHaplotype
function.
# Using the deleted_genes and nonRelaible_Vgenes flags # to infer haplotype for a partial V coverage sequence dataset haplo_db <- createFullHaplotype(clip_db, toHap_col=c("v_call","d_call"), hapBy_col="j_call", hapBy="IGHJ6", toHap_GERM=c(HVGERM, HDGERM), deleted_genes = del_binom_db, nonReliable_Vgenes = nonReliable_Vgenes)
Another way to visualize the inferred haplotype is using the interactive HTML5 plot. To plot the HTML5 visualization, the plotHaplotype
function should be used with the html_output
flag marked TRUE
. The function saveWidget
from the htmlwidgets
package can be used to save the interactive plot into an html file.
# Generate interactive haplotype plot p <- plotHaplotype(hap_table = haplo_db, html_output = TRUE)
# Saving the plot to html output htmlwidgets::saveWidget(p, "haplotype.html", selfcontained = T)
# Plotting the interactive haplotype inference
p
createFullHaplotype
can infer single chromosome deletion events by the threshold of the certainty level of the inference. The function utilizes the Bayes factor ($K$) obtained from the posterior probability of the inference. A gene deletion event is defined when the log 10 of the $K$ value ($lK = log_{10}(K)$) for an "unknown" gene is larger than the threshold of $3$ (The default value of the kThreshDel
flag).
For a group of individuals, the single chromosome deletions coupled with the double chromosome deletions can be visualized with the deletionHeatmap
function. The function creates a heatmap of the deletions inferred and colors them by the certainty level ($lK$).
# Detecting non reliable genes nonReliable_Vgenes <- nonReliableVGenes(samples_db) # Inferring double chromosome deletion del_binom_db <- deletionsByBinom(samples_db, nonReliable_Vgenes = nonReliable_Vgenes) # Inferred haplotype summary table for multiple subjects haplo_db <- createFullHaplotype(samples_db, toHap_col=c("v_call","d_call"), hapBy_col="j_call", hapBy="IGHJ6", toHap_GERM=c(HVGERM, HDGERM), deleted_genes = del_binom_db, nonReliable_Vgenes = nonReliable_Vgenes) # plot deletion heatmap deletionHeatmap(haplo_db)
Since V gene heterozygosity is extremely common, using V genes as anchors for haplotype inference could increase the number of people for which D and J haplotype can be inferred. However, since there are far more V genes than J genes, the relative frequencies of them are lower. Therefore, to obtain a reliable haplotype inference using V genes as anchors, the sequence depth of the dataset has to be much greater than when using J6.
The RAbHIT package offers a solution to overcome the low number of sequences that connect a given V-D allele pair. RAbHIT applies an aggregation approach, in which connects information from several heterozygous V genes can be combined to infer D gene deletions. The deletionsByVpooled
function uses the V pooled approach to detect single chromosomal deletions for D and J.
# Inferred deletion summary table del_db <- deletionsByVpooled(samples_db, nonReliable_Vgenes = nonReliable_Vgenes) head(del_db)
For visualizing the deletions detected by deletionsByVpooled
, the plotDeletionsByVpooled
can be used.
However, it is recommended to use this function for multiple individuals.
# Plot the deletion heatmap plotDeletionsByVpooled(del_db)
For help, questions, or suggestions, please contact:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.