library(polyIBD)

Overview

polyIBD is an R package that can be used to infer regions of identity by descent (IBD) from samples that are unphased and potentially polyclonal.

Recently, several programs have been written to address this problem, including Henden/Bahlo's isorelate and Schaffner/Neafsey's hmmIBD. Here we extend upon these approaches and assume that the multiplicity of infection (MOI) is unkown and is not limited to 2.

Using a bayesian approach, the main strength of polyIBD is that MOI estimates are unbounded and can be inferred directly along with other model parameters. As such, polyIBD has the ability to infer:

  1. Regions of IBD
  2. MOI estimates for both samples
  3. Genetic Relatedness
  4. Rate of Recombination

Description

polyIBD is a first-order Hidden Markov Model (HMM) that uses the Metropolis-Hasting Algorithm to infer model parameters. Within the HMM framework, polyIBD assumes that the transition probabilities and emission probabilites can be described and uses the forward-backward algorithm to calculate the likelihood of the hidden state. For an extended description of the model, please view the supplementary information.

Getting Started with polyIBD

First you will need to have successfully installed polyIBD as described previously in the README.

Input

The input for polyIBD is a filtered Variant Call File (VCF) that follows the VCF specifciations for the Genotype field and has a genotype (GT) value for each sample. In additon, VCFs must be unphased and called as a haploid or diploid. As such, acceptable GT values are: 0, 1, 0/0, 0/1, 1/1, and ".", for homozygous referent (0, 0/0), heterozygous (0/1), homozygous alternative (1, 1/1), and missing (.), respectively.

Simulated Data

For the purpose of this vignette, we will simulate a VCF with two samples and use it to run polyIBD. First let's input some specifications for the number of loci, the genetic relatedness, and the recombination rate for these samples.

# setup
n <- 1e3 # this is the number of loci we would like to simulate
rho <- 7.4e-7 # this is the recombination rate that is assumed to be known
k_true <- 2 # this is the number of generations that we are trying to infer
f_true <- 0.4 # this is the proportion of genetic relatedness we are trying to infer 
m1 <- 1 # this is the multiplicity of infection for sample1 that we are trying to infer
m2 <- 1 # this is the multiplicity of infection for sample2 that we are trying to infer
m_true <- c(m1, m2)
pos <- sort(sample(1.4e6, n)) # simulate some positions/genomic coordinates

We will now call the built-in simulation function from polyIBD to take our inputs a simulate a two-sampled VCF. The simulation is using a beta distribution with the shape parameters alpha=1 and beta=1 to simiulate population allele frequencies. This distribution can be adjusted by the user but the 1,1 shape parameters provide a population with a high proportion of fixated alleles (to see the distribution, input the following command to your R Console: #> hist(rbeta(1000, 0.5, 0.5))).

# run the simulation 
sim <- polyIBD::simData(pos=list(contig1=pos), 
                        m1=m_true[1], m2=m_true[2], k=k_true,
                        f=f_true, rho=rho, p=NULL, 
                        p_shape1=0.1, p_shape2=0.1, 
                        propMissing = 0)

f_ind_true <- mean(unlist(sim$IBD[,1:ncol(sim$IBD)]))
f_ind_true
trueIBD <- data.frame(CHROM = sim$CHROMPOS$CHROM, POS=sim$CHROMPOS$POS,
                      z_true = rowSums(sim$IBD[,1:ncol(sim$IBD),drop=FALSE]))

The ouptut of the simData function is a list of two lists and two dataframes:
1. Simulated population major allele frequencies for both samples (p) 2. The haploid genotypes (0 for referrent allele, 2 for alternate allele) for both samples (haploid) 3. The haploid genotypes (0 for referrent allele, 2 for alternate allele) for both samples (haploid) 4. The IBD genotypes at each positions genotypes (0 for referrent allele, 2 for alternate allele) for both samples (haploid)

Running the MCMC

# run MCMC
ret <- runMCMC(input = sim, rho=rho, k_max = 25, 
               burnin=5e1, samples=1e4, reportIteration=1e2) 

From the MCMC, a polyIBD class object is returned that contians both summary statistics and raw data. The summary statistics and raw data can be interrogated directly.



nickbrazeau/polyIBD documentation built on Dec. 1, 2019, 11:53 p.m.