library(polyIBD)
polyIBD
is an R package that can be used to infer regions of identity by descent (IBD) from samples that are unphased and potentially polyclonal.
Recently, several programs have been written to address this problem, including Henden/Bahlo's isorelate and Schaffner/Neafsey's hmmIBD. Here we extend upon these approaches and assume that the multiplicity of infection (MOI) is unkown and is not limited to 2.
Using a bayesian approach, the main strength of polyIBD
is that MOI estimates are unbounded and can be inferred directly along with other model parameters. As such, polyIBD
has the ability to infer:
polyIBD
is a first-order Hidden Markov Model (HMM) that uses the Metropolis-Hasting Algorithm to infer model parameters. Within the HMM framework, polyIBD
assumes that the transition probabilities and emission probabilites can be described and uses the forward-backward algorithm to calculate the likelihood of the hidden state. For an extended description of the model, please view the supplementary information.
polyIBD
First you will need to have successfully installed polyIBD
as described previously in the README.
The input for polyIBD
is a filtered Variant Call File (VCF) that follows the VCF specifciations for the Genotype field and has a genotype (GT) value for each sample. In additon, VCFs must be unphased and called as a haploid or diploid. As such, acceptable GT values are: 0, 1, 0/0, 0/1, 1/1, and ".", for homozygous referent (0, 0/0), heterozygous (0/1), homozygous alternative (1, 1/1), and missing (.), respectively.
For the purpose of this vignette, we will simulate a VCF with two samples and use it to run polyIBD
. First let's input some specifications for the number of loci, the genetic relatedness, and the recombination rate for these samples.
# setup n <- 1e3 # this is the number of loci we would like to simulate rho <- 7.4e-7 # this is the recombination rate that is assumed to be known k_true <- 2 # this is the number of generations that we are trying to infer f_true <- 0.4 # this is the proportion of genetic relatedness we are trying to infer m1 <- 1 # this is the multiplicity of infection for sample1 that we are trying to infer m2 <- 1 # this is the multiplicity of infection for sample2 that we are trying to infer m_true <- c(m1, m2) pos <- sort(sample(1.4e6, n)) # simulate some positions/genomic coordinates
We will now call the built-in simulation function from polyIBD
to take our inputs a simulate a two-sampled VCF. The simulation is using a beta distribution with the shape parameters alpha=1 and beta=1 to simiulate population allele frequencies. This distribution can be adjusted by the user but the 1,1 shape parameters provide a population with a high proportion of fixated alleles (to see the distribution, input the following command to your R Console: #> hist(rbeta(1000, 0.5, 0.5))
).
# run the simulation sim <- polyIBD::simData(pos=list(contig1=pos), m1=m_true[1], m2=m_true[2], k=k_true, f=f_true, rho=rho, p=NULL, p_shape1=0.1, p_shape2=0.1, propMissing = 0) f_ind_true <- mean(unlist(sim$IBD[,1:ncol(sim$IBD)])) f_ind_true trueIBD <- data.frame(CHROM = sim$CHROMPOS$CHROM, POS=sim$CHROMPOS$POS, z_true = rowSums(sim$IBD[,1:ncol(sim$IBD),drop=FALSE]))
The ouptut of the simData
function is a list of two lists and two dataframes:
1. Simulated population major allele frequencies for both samples (p
)
2. The haploid genotypes (0 for referrent allele, 2 for alternate allele) for both samples (haploid
)
3. The haploid genotypes (0 for referrent allele, 2 for alternate allele) for both samples (haploid
)
4. The IBD genotypes at each positions genotypes (0 for referrent allele, 2 for alternate allele) for both samples (haploid
)
# run MCMC ret <- runMCMC(input = sim, rho=rho, k_max = 25, burnin=5e1, samples=1e4, reportIteration=1e2)
From the MCMC, a polyIBD class object is returned that contians both summary statistics and raw data. The summary statistics and raw data can be interrogated directly.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.