polySegratioMM: An R library for Bayesian mixture models for marker dosage in autopolyploids
In polySegratioMM: Bayesian Mixture Models for Marker Dosage in Autopolyploids

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#:",
  fig.path = "man/figures/"
)
## output: BiocStyle::html_document
##version <- as.vector(read.dcf('DESCRIPTION')[, 'Version'])
##version <- gsub('-', '.', version)
version <- "0.6-5"

Version: r version

It is well known that the dosage level of markers in autopolyploids and allopolyploids can be characterised by their observed segregation ratios. The related package \texttt{polySegratio} provides functions to allocate dosage by standard approaches and to simulate marker data sets for differing ploidies and levels of overdispersion. Note that these methods could equally well be applied to allopolyploids with specified expected segregation ratios. For details see \texttt{polySegratio}.

A Bayesian approach was proposed by @baker2010 where marker dosage estimation was obtained by fitting a finite mixture distribution.

This library calls the JAGS software for Bayesian calculation. JAGS 1.0 or higher must be installed following instructions from https://mcmc-jags.sourceforge.io. Note that only the most recent version is used for testing with R. The JAGS executable must be in your path. Currently, no checking is carried out to ascertain whether or not JAGS is set up appropriately.

To use the library, you need to attach it with

library(polySegratioMM)

##library(cacheSweave)  #  comment this out after development phase
## to use:    Sweave("polySegratioMM-overview.Rnw", driver=cacheSweaveDriver)
## or without cache: Sweave("polySegratioMM-overview.Rnw")
op <- options()
options(width=70, digits=4)

Simulated data

Library functions are demonstrated on a simulated data set generated using the sim.autoMarkers function from the polySegratio package.

##<<simData, cache=true}
## simulate small autohexaploid data set of 500 markers for 200 individuals
## with %70 Single, %20 Double and %10 Triple Dose markers
## created with 
## hexmarkers <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200)
## save(hexmarkers, file="../../data/hexmarkers.RData")
data(hexmarkers)

The following R code can be used to generate 500 markers for 200 autohexaploid individuals exhibiting overdispersion with the parameter shape1 = 25. The underlying percentages of single double and triple dose markers are 70%, 20% and 10%, respectively.

  hexmarkers <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200)

print(hexmarkers)

Note that the segregation ratios for simulated or real data may be extracted by using segregationRatios which sets up the appropriate objects for testing marker dosage and plotting or summarising the marker data.

sr <-  segregationRatios(hexmarkers$markers)

plotTheoretical(ploidy.level=6, seg.ratios=sr,
  expected.segratio=NULL, proportions=c(0.7,0.2,0.1),
  n.individuals=200)

For instance, as seen in Figure \@ref(fig:sim1), segregation ratios may be plotted with

plotTheoretical(ploidy.level=6, seg.ratios=sr,
  expected.segratio=NULL, proportions=c(0.7,0.2,0.1),
  n.individuals=200)

On the other hand, consider a similar data set that exhibits overdispersion. This may be simulated as follows

hexmarkers.overdisp <- sim.autoMarkers(6, c(0.7,0.2,0.1),
                                       n.markers=500,n.individuals=200,
                                       overdispersion=TRUE, shape1=30)

Markers

sr.overdisp <- segregationRatios(hexmarkers.overdisp$markers)

print(plotTheoretical(ploidy.level=6, seg.ratios=sr.overdisp, main="",
  expected.segratio=NULL, proportions=c(0.7,0.2,0.1),
  n.individuals=200))

The histogram of marker segregation ratios, which is a useful graphical method for identifying overdispersion or outliers, is seen in Figure~\@ref(fig:sim2). Note that, due to overdispersion the theoretical distribution is narrower than the observed data.

A Bayesian mixture model approach

For the j^th^ marker $j=1 \dots n$, we assume the observed number $r_j$ of dominant markers out of $N_j$ lines follows a binomial distribution denoted $Bin(N_j, P_k)$.If we knew the dosage $k$ then, following @ripol99, the expected value of $P_k$ may be written as

$$ P_k(k| m, x) = 1 - {{{m-k} \choose {mx}} \over {{m} \choose {mx}}}, k=0 \ldots m/2(#eq:ripol1) $$

here $m$ is the ploidy level or number of homologous chromosomes and the monoploid number $x$ is the number of chromosomes in a basic set. Note that for diploids $m=2$, tetraploids $m=4$ , octaploids then $m=8$ and so on and also that if there are no marker data missing then $N_j$ is simply the number of progeny.

Since the dosage of each marker is unknown, we rely on the missing data representation of @dempster77 and @tannerandwong87 which is commonly adopted for MCMC computation in finite mixture models. An indicator variable $z_j$ corresponding to unknown marker dosage class $k$ is introduced where $z_j = k$ if the marker has dose $k$. For the $K$ components with $K \leq m/2$, consider the logit transformation of the true segregation proportions $P_k$ for dose $k, k=1 \dots K$. The the logit transformed segregation ratio $ \omega_k$ is then

$$ \omega_k = \log({ {P_k}\over{1-P_k} })(#eq:logit) $$

Let $z=(z_1 \ldots z_n)^T$ be a vector of unknown dosages (labelled $1,2 \dots K$ corresponding to simplex, duplex, triplex markers and so on), then $r_j$ is binomially distributed with known size parameter $N_j$ and unknown proportion parameter $\omega_{Z_j}$ which is the segregation ratio for marker dosage $z_j$. Hence, given marker dosage $z_j$ then

$$ \begin{eqnarray} r_j|z_j & \sim & Bin \left( N_j, \omega_{Z_j} \right), \ \mbox{ where } && \nonumber\ \mbox{logit}( \omega_{Z_j} ) & = & \log ( { \omega_{Z_j} \over 1 - \omega_{Z_ j} } ) \sim N( \mu_{Z_j}, \tau_{Z_j}^{-1}) \nonumber \end{eqnarray} (#eq:bin-mod1) $$

where $\mu_k$ and $\tau_k$ are the mean and precision $(\tau_k = 1/\sigma^2_k)$ of marker dosage class $k$ on the logit scale.

Since the dosage is unknown, for the autohexaploid data generated here then for the logit($\omega_{z_k}$) can be modelled as a finite mixture of 3 normals

$$\mbox{logit}( \omega_{Z_j} ) \sim \pi_1 N(\mu_1,\tau_1^{-1}) + \pi_2 N(\mu_2,\tau_2^{-1}) + \dots + \pi_K N(\mu_K,\tau_K^{-1})(#eq:3normals)$$

where $\mu_k$ is the mean and $\tau_k$ is the precision of component $k$ on the logit scale, and $\pi_k$ are the mixing proportions of the three components with $\sum_{k=1}^K\pi_k = 1$. The probability density function $f(x)$ of logit($\omega_k$) is

$$f(x) = \sum_{k=1}^K \pi_k \phi(x | \mu_k, \tau^{-1}_k)$$

where $\phi$ is the normal cumulative distribution function with parameters mean $\mu_k$ and variance $\sigma^2_k =\tau_k^{-1}$.

Simulation studies suggested that incorporating strong prior information, such as the expected distributions of @haldane30 provided the best method of allocating dosage. Further details may be found in @baker2010.

Specifying a model

A mixture model may be set up with setModel. By default, only two parameters are required, namely the ploidy.level or the number of homologous chromosomes set either as a numeric or as a character string and also n.components or the number of components for mixture model (less than or equal to maximum number of possible dosages). By default, strong priors are set by using the formulae of @haldane30 for the expected numbers and ratios of offspring for various parental configurations of autopolyploids.

For the autohexaploid data generated above, the models are set with

x.mod1 <- setModel(3,6)  # autohexaploid model with 3 components

The R object x.mod1 contains components describing aspects of the model such as the number of components, ploidy, expected segregation ratios and so on. Note that the str command is useful for displaying the internal structure of any R object.

Fitting a mixture model

While various options are available for fine tuning the MCMC process, the simplest way to fit a mixture model to allocate marker dosages is with the wrapper function `runSegratioMM} as follows:

mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1)

which automatically determines starting values, priors, length of burn in, number of iterations, and other parameters as well as producing summary statistics and diagnostic plots.

## produced using the following but loaded as data to avoid the run time on slow machines
##mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1, burn.in=200, sample=500, plots=FALSE)
##mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1, plots=FALSE)
## save(mcmcHexRun, file="../../data/mcmcHexRun.RData")
data(mcmcHexRun)

To run JAGS without producing plots then set the plots option to FALSE. For the overdispersed data running this command produced the following selected output. While selected output is printed here the simple command print(mcmcHexRun) whould produce the following output and more.

The summary of processing times:

print(mcmcHexRun$run.jags)

And summary statistics for the posterior distributions of selected parameters:

print(mcmcHexRun$summary)

Note that MCMC convergence diagnostic output is produced automatically. Assessing convergence is crucial in MCMC and poor convergence may result in mis-allocated marker dosages. The diagnostic statistics indicate that convergence was achieved.

print(mcmcHexRun$diagnostics)

And finally, summaries of marker dosage allocations are produced:

print(mcmcHexRun$doses)

Note that simply plotting mcmcHexRun will produce a histogram of segregation proportions and the fitted model but that other plots are easily produced.

plot(mcmcHexRun$mcmc.mixture$mcmc.list[[1]][,c("P[1]","mu[1]","sigma","T[140]")])

When the plots option of runSegratioMM is set to the default value of TRUE, numerous plots are produced including trace and density plots from the CODA package. These may also be extracted manually but the process is somewhat more complicated. For instance, to obtain trace and density plots for the parameters $p_1$, $\mu_1$, $\sigma_1$ and for the 140^th^ marker, as shown in Figure \@ref(fig:trace1), then CODA may be used directly by following command:

plot(mcmcHexRun$mcmc.mixture$mcmc.list[[1]][,c("P[1]","mu[1]","sigma","T[140]")])

plot(mcmcHexRun, theoretical=TRUE, main="")

The histogram of segregation proportions with fitted and theoretical values shown in Figure \@ref(fig:fitted1) may be obtained by setting the theoretical option to TRUE as follows.

plot(mcmcHexRun, theoretical=TRUE)

Assigning marker dosage

Marker dosages allocations may be obtained directly from the object mcmcHexRun. The dosage with maximum posterior probability is simply mcmcHexRun\$doses\$max.post.dosage. A more conservative allocation is obtained by using mcmcHexRun\$doses\$dosage[,"0.8"] whereby the dosage with posterior probability over 0.8 is employed. For instance, to tabulate the number of markers (including those not allocated a dosage which are labelled NA) the table command can be employed.

cat("Employing maximum posterior probability\n")
table(Dose=mcmcHexRun$doses$max.post.dosage, exclude=NULL)
cat("Employing posterior probability > 0.8\n")
table(Dose=mcmcHexRun$doses$dosage[,"0.8"], exclude=NULL)

And of course since the data were simulated we can compare the estimated and true dosages obtained as hexmarkers.overdisp\$true.doses\$dosage via cross tabulation. Doses can also be obtained for the standard$\chi^2$ test by using the test.segRatio command from the polySegratio library.

cat("Employing theChi squared test\n")
dose.chi <- test.segRatio(sr.overdisp, ploidy.level = 6)
table(Chi2Dose=dose.chi$dosage,
  True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL)
cat("Employing maximum posterior probability\n")
table(MixtureDose=mcmcHexRun$doses$max.post.dosage,
  True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL)
cat("Employing posterior probability > 0.8\n")
table(MixtureDose=mcmcHexRun$doses$dosage[, "0.8"],
  True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL)

These tables show that far fewer markers are allocated a dosage using the standard $\chi^2$ test than by the mixture model. Fewer markers were misclassified using a posterior probability threshold of 0.8 rather than the maximum posterior probability as a basis for allocating dosage.

\bibliographystyle{apalike}

Acknowledgments

Karen Aitken, given her experience in tetraploids and sugarcane marker maps, has provided many valuable insights into marker dosage in autopolyploids. Additionally, Ross Darnell, Andrew George and Kerrie Mengersen provided useful comments and discussions.