In bahlolab/isoRelate: Identity-by-Descent Inference of Haploid Recombining Organisms

isoRelate performs pairwise relatedness mapping on haploid isolates using a first order continuous time hidden Markov model (HMM). Some isolates may have multiple infections, whereby the infected individual is carrying multiple genetically distinct strains of the species. Such isolates are said to have multiplicity of infection (MOI) greater than 1, and isoRelate allows for these isolates by analyzing them using a diploid model, rather than a haploid model.

This document outlines the data format and key functions used by isoRelate for relatedness mapping.

Caution should be taken when using isoRelate with small reference datasets, reference populations that do not match the input population or cohorts of mixed populations.

Data Format

PED and MAP Files

The input for isoRelate is unphased genotype data for SNPs in PLINK PED and MAP format (http://www.cog-genomics.org/plink2).

A PED file is a white-space (space or tab) delimited file with the first six columns

Family ID
Isolate ID
Paternal ID
Maternal ID
Multiplicity of infection (MOI) (1 = single infection or haploid, 2 = multiple infections or diploid)
Phenotype (1=unaffected, 2=affected, 0=unknown)

The IDs are alphanumeric: the combination of family and isolate ID should uniquely identify a sample. Columns 7 onwards (white-space delimited) are the isolate genotypes for biallelic SNPs where the A and B alleles are coded as 1 and 2 respectively and missing genotypes are coded as 0. All SNPs must have two alleles specified and each allele should be in a separate column. For single infections, genotypes should be specified as homozygous. Either Both alleles should be missing (i.e. 0) or neither. Column labels are not required. The paternal ID, maternal ID and phenotype columns are not used by isoRelate, however are required for completeness of the pedigree. Examples of informative family IDs are the sample collection site or country, however family IDs can be the same as the isolate IDs.

The MAP file contains exactly 4 columns of information

Chromosome
SNP identifier
Genetic map distance (centi-Morgans or Morgans)
Base-pair position

where each row describes a single marker. Genetic map distances and base-pair positions are expected to be positive values. The MAP file must be ordered by increasing chromosome genetic map distance. SNP identifiers can contain any characters except spaces or tabs; also, you should avoid * symbols in names. The MAP file must contain as many markers as are in the PED file. Column labels are not required.

Reference Data

isoRelate calculates SNP allele frequencies from either the input dataset or a reference dataset, where the reference dataset is in PED/MAP format. If a reference dataset is not specified then allele frequencies will be calculated from the input dataset.

Caution should be taken when using isoRelate with small reference datasets, reference populations that do not match the input population or cohorts of mixed populations.

Key Functions

# load isoRelate library
library(isoRelate)

Function | Description ------------------ | -------------------------------------------------- getGenotypes | reformat the genotypes in the PED/MAP file and perform data filtering getIBDparameters | estimate IBD parameters between pairs of isolates getIBDsegments | detect IBD segments between pairs of isolates getIBDposterior | calculate the posterior probability of IBD sharing per isolate pair getIBDsummary | print a summary of the IBD segments detected getIBDmatrix | create a binary matrix of IBD/non-IBD for all SNP and pairs getIBDproportions | calculate the proportion of pairs IBD at each SNP getIBDiR | calculate the significance of excess IBD sharing at each SNP getIBDiclusters | create a network of clusters of isolates who are IBD over an interval getIBDpclusters | create a network of clusters of isolates who share a certain proportion of genome IBD plotIBDsegments | plot the detected IBD segments across the genome plotIBDproportions | plot the proportion of pairs IBD across the genome plotIBDiR | plot the significance of excess IBD sharing across the genome plotIBDclusters | plot the network of IBD isolates

Example

Below is an IBD analysis performed on a Plasmodium falciparum dataset from Papua New Guinea (PNG), generated by the MalariaGEN Consortium (https://www.malariagen.net/projects/pf3k). Change the parameters for each function to see how it affects the results.

NOTE: R has a memory limit and if you exceed this limit then an error message will appear stating cannot allocate vector of length. The number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9, which is also the limit on each dimension of an array.

Format and Filter PED/MAP

We begin by reformatting and filtering the input PED and MAP data.

# load isoRelate library
library(isoRelate)

# lets look at the data
str(png_pedmap)

# reformat and filter genotypes
my_genotypes <- getGenotypes(ped.map = png_pedmap,
                             reference.ped.map = NULL,
                             maf = 0.01,
                             isolate.max.missing = 0.1,
                             snp.max.missing = 0.1,
                             chromosomes = NULL,
                             input.map.distance = "cM",
                             reference.map.distance = "cM")

One isolate has been removed because more than 10% of SNPs have missing genotype calls.

Estimate Parameters

Next we estimate the model parameters. These parameters can be useful as an initial measure of relatedness.

# estimate parameters
my_parameters <- getIBDparameters(ped.genotypes = my_genotypes, 
                                   number.cores = 1)

head(my_parameters)

head(my_parameters)

Detect IBD Segments

Following parameter estimation we can detect IBD segments.

# infer IBD
my_ibd <- getIBDsegments(ped.genotypes = my_genotypes,
                         parameters = my_parameters, 
                         number.cores = 1, 
                         minimum.snps = 20, 
                         minimum.length.bp = 50000,
                         error = 0.001)

head(my_ibd)

head(my_ibd)

We have set thresholds on the minimum number of SNPs and length of IBD segments reported in order to reduce false positive IBD calls that are most likely due to population linkage disequilibrium from extremely distant relatedness.

We can get a brief summary of the IBD segments as follows.

# get a summary of IBD segments
getIBDsummary(ped.genotypes = my_genotypes, 
              ibd.segments = my_ibd)

IBD segments are found using the Viterbi algorithm, which finds the single most likely sequence of IBD states that could have generated the observed genotypic data. An alternative method to this is to calculate the posterior probability of IBD sharing, which calculates the probability of sharing 0, 1 or 2 alleles IBD at each SNP, given the genotypic data. Thus, in addition to the Viterbi algorithm, we provide a function to generate the average posterior probability of IBD sharing for each isolate pair, which is calculated as;

$$avePostPr = \frac{PostPr(IBD = 1)}{2} + PostPr(IBD = 2).$$ Please note, the output from the following function has number of rows equal to the number of SNPs and number of columns equal to the number of pairwise comparisons. As such, it will be very large when there are many isolates in the analysis and is likely to take some time to run. Additionally, the output may exceed R's memory limit and, in some instances, can cause R to unexpectedly crash.

# calculate the posterior probability of IBD sharing
my_posterior <- getIBDposterior(ped.genotypes = my_genotypes,
                         parameters = my_parameters, 
                         number.cores = 1, 
                         error = 0.001)

head(my_posterior[,1:10])

Exploring IBD Results

isoRelate provides a number of graphical functions to help with exploring the vast number of IBD segments detected. We begin by plotting the IBD segments as colored blocks across the genome to get an idea of their size and distribution.

# plot IBD segments
plotIBDsegments(ped.genotypes = my_genotypes, 
                ibd.segments = my_ibd, 
                interval = NULL,
                annotation.genes = NULL,
                annotation.genes.color = NULL,
                highlight.genes = NULL, 
                highlight.genes.labels = FALSE,
                highlight.genes.color = NULL,
                highlight.genes.alpha = 0.1,
                segment.height = 0.6,
                number.per.page = NULL, 
                fid.label = FALSE, 
                iid.label = FALSE, 
                ylabel.size = 9, 
                add.rug = FALSE,
                plot.title = "Distribution of IBD segments in PNG", 
                add.legend = TRUE, 
                segment.color = NULL)

If there are many IBD pairs then it may be worthwhile specifying the maximum number of pairs to plot in a figure (number.per.page). This will split all IBD pairs into subgroups of IBD pairs of size number.per.page, then plot the IBD segments for each subgroup in a new figure. It is highly recommended to save the figures directly to a document (my_ibd_segments.pdf) as opposed to the R plot window if there are many IBD pairs.

We have reframed from including isolate labels (y-axis) in this plot. Isolate pairs who are highly related will have many IBD segments spanning a large amount of the genome. Identical isolates will have IBD blocks spanning the entire genome.

We detected many IBD segments on chromosome 7 (Pf3D7_07_v3), suggesting a loci on chromosome 7 may be under selection. We can take a closer look at this region by specifying an interval to plot the segments over as well as including annotation genes or genes that may be of interest over this region. We alter some of the parameters for fun.

# plot IBD segments over an interval: chromosome 7: 350000 - 550000
plotIBDsegments(ped.genotypes = my_genotypes,
                ibd.segments = my_ibd,
                interval = c("Pf3D7_07_v3",350000,550000),
                annotation.genes = annotation_genes,
                annotation.genes.color = NULL,
                highlight.genes = highlight_genes,
                highlight.genes.labels = FALSE,
                highlight.genes.color = NULL,
                highlight.genes.alpha = 0.1,
                segment.height = 0.8,
                number.per.page = NULL,
                fid.label = FALSE,
                iid.label = FALSE,
                ylabel.size = 9,
                add.rug = TRUE,
                plot.title = "Distribution of IBD segments in PNG",
                add.legend = TRUE,
                segment.color = c("purple","green"))

The gene CRT (Plasmodium falciparum chloroquine resistance transporter) overlaps many of the segments. Mutations in this gene have been associated with antimalarial drug resistance, resulting in positive selection of this gene and hence many IBD segments. Ticks on the x-axis inside the plotting area indicate SNP positions (add.rug=TRUE) while yellow and red colored segments represent annotation genes (positive strand and negative strand respectively).

As an alternative to plotting the IBD segments as above, which may become unfeasible when there are many IBD pairs, we can calculate the proportion of pairs who are IBD at each SNP and plot this distribution across the genome as a quick way of identifying genomic loci with many IBD pairs.

# generate a binary IBD matrix
my_matrix <- getIBDmatrix(ped.genotypes = my_genotypes, 
                          ibd.segments = my_ibd)

# calculate the proportion of pairs IBD at each SNP
my_proportion <- getIBDproportion(ped.genotypes = my_genotypes, 
                                  ibd.matrix = my_matrix, 
                                  groups = NULL)

# plot the proportion of pairs IBD
plotIBDproportions(ibd.proportions = my_proportion, 
                   interval = NULL, 
                   annotation.genes = NULL,
                   annotation.genes.color = NULL,
                   highlight.genes = NULL,
                   highlight.genes.labels = TRUE,
                   highlight.genes.color = NULL,
                   highlight.genes.alpha = 0.1,
                   add.rug = FALSE, 
                   plot.title = "Proportion of pairs IBD in PNG", 
                   add.legend = FALSE,
                   line.color = NULL, 
                   facet.label = TRUE, 
                   facet.scales = "fixed", 
                   subpop.facet = FALSE)

The region on chromosome 7 has a high proportion of pairs that are IBD, potentially reflective of positive selection. This figure can be further explored if additional information is available. For example, we could stratify the results by collection sites to see if different sites have different loci potentially under selection. All of the PNG isolates were collected from Madang and no other location, therefore we create an example grouping of all the isolates for stratification purposes.

# creating a stratification dataset 
my_groups <- my_genotypes[[1]][,1:3]
my_groups[1:10,"pid"] <- "a"
my_groups[11:25,"pid"] <- "b"
my_groups[26:38,"pid"] <- "c"

my_proportion <- getIBDproportion(ped.genotypes = my_genotypes,
                                  ibd.matrix = my_matrix,
                                  groups = my_groups)

# plot the proportion of pairs IBD
plotIBDproportions(ibd.proportions = my_proportion,
                   interval = NULL,
                   annotation.genes = NULL,
                   annotation.genes.color = NULL,
                   highlight.genes = NULL,
                   highlight.genes.labels = FALSE,
                   highlight.genes.color = NULL,
                   highlight.genes.alpha = 0.1,
                   line.color = NULL,
                   add.rug = FALSE,
                   plot.title = "Proportion of pairs IBD in PNG - with stratification",
                   add.legend = FALSE,
                   facet.label = TRUE,
                   facet.scales = "fixed",
                   subpop.facet = TRUE)

To identify genomic loci with significant amounts of excess IBD we can apply a transformation to the binary IBD matrix to account for variations in isolate relatedness as well as SNP allele frequencies, then calculate a summary statistic that can be used to assess significance.

# calculate the significance of IBD sharing
my_iR <- getIBDiR(ped.genotypes = my_genotypes, 
                  ibd.matrix = my_matrix, 
                  groups = NULL)

# plot the iR statistics
plotIBDiR(ibd.iR = my_iR, 
          interval = NULL, 
          annotation.genes = NULL,
          annotation.genes.color = NULL,
          highlight.genes = NULL,
          highlight.genes.labels = FALSE,
          highlight.genes.color = NULL,
          highlight.genes.alpha = 0.1,
          point.size = 1,
          point.color = NULL,
          add.rug = FALSE, 
          plot.title = "Significance of IBD sharing", 
          add.legend = FALSE,
          facet.label = TRUE, 
          facet.scales = "fixed")

A 5% significance threshold would identify regions that have SNPs with -log10 (P-values) > -log10(0.05) as having significant IBD sharing, indicative of positive selection.

We note that the function getIBDiR can return NA values which may occur when there are no IBD pairs or when all pairs are IBD, or when only several isolates are analyzed, among other reasons.

We can investigate the specific isolate pairs that are contributing to the excess IBD sharing on chromosome 7 by creating an IBD network. Such a network will identify clusters of isolates sharing a common haplotype over this region. We include our example grouping of isolates when plotting the network.

# generate the isolates who are IBD over the Plasmodium falciparum CRT gene
my_i_clusters <- getIBDiclusters(ped.genotypes = my_genotypes, 
                                 ibd.segments = my_ibd, 
                                 interval = c("Pf3D7_07_v3", 403222, 406317), 
                                 prop=0, 
                                 hi.clust = FALSE)

str(my_i_clusters)

# plot the network of clusters
plotIBDclusters(ped.genotypes = my_genotypes, 
                clusters = my_i_clusters, 
                groups = my_groups, 
                vertex.color = NULL, 
                vertex.frame.color = "white",
                vertex.size = 4, 
                vertex.name = FALSE, 
                edge.color = "gray60", 
                edge.width = 0.8, 
                mark.border = "white",
                mark.col = "gray94", 
                add.legend = TRUE, 
                legend.x = -1.5, 
                legend.y = -0.25, 
                layout = NULL, 
                return.layout = FALSE)

Each time the function plotIBDclusters() is run the vertex layout will differ. This is an artifact of the R package used to generate the network, igraph. If the user wishes to keep the network layout the same in multiple subsequent figures then set return.layout = TRUE and use the returned layout in subsequent plotIBDclusters() runs with layout = my_layout.

We note that the i.network is output as the second object in the list my_i_clusters. If the user wishes to perform quantitative analyses using this network, there are many functions in the i.graph package to do so. Please see the i.graph documentation for more details on this.

We can also explore networks of isolates sharing a large proportion of their genome IBD. Such isolates may represent duplicate samples or individuals who have identical infections.

# generate the isolates who share at least than 90% of their genome IBD
my_p_clusters <- getIBDpclusters(ped.genotypes = my_genotypes, 
                                 ibd.segments = my_ibd, 
                                 prop=0.9, 
                                 hi.clust = FALSE)

# plot the network of clusters
plotIBDclusters(ped.genotypes = my_genotypes, 
                clusters = my_p_clusters, 
                groups = my_groups, 
                vertex.color = NULL, 
                vertex.frame.color = "white",
                vertex.size = 4, 
                vertex.name = FALSE, 
                edge.color = "gray60", 
                edge.width = 0.8, 
                mark.border = "white",
                mark.col = "gray94", 
                add.legend = TRUE, 
                legend.x = -1.5, 
                legend.y = -0.25, 
                layout = NULL, 
                return.layout = FALSE)