knitr::opts_chunk$set( collapse = TRUE, comment = "#:", fig.path = "man/figures/" ) ## output: BiocStyle::html_document ##version <- as.vector(read.dcf('DESCRIPTION')[, 'Version']) ##version <- gsub('-', '.', version) version <- "0.6-5"
Version: r version
It is well known that the dosage level of markers in autopolyploids and allopolyploids can be characterised by their observed segregation ratios. The related package \texttt{polySegratio} provides functions to allocate dosage by standard approaches and to simulate marker data sets for differing ploidies and levels of overdispersion. Note that these methods could equally well be applied to allopolyploids with specified expected segregation ratios. For details see \texttt{polySegratio}.
A Bayesian approach was proposed by @baker2010 where marker dosage estimation was obtained by fitting a finite mixture distribution.
This library calls the JAGS software for Bayesian
calculation. JAGS 1.0 or higher must be installed following
instructions from https://mcmc-jags.sourceforge.io. Note that
only the most recent version is used for testing with R. The
JAGS executable must be in your path. Currently, no checking
is carried out to ascertain whether or not JAGS is set up
appropriately.
To use the library, you need to attach it with
library(polySegratioMM)
##library(cacheSweave) # comment this out after development phase ## to use: Sweave("polySegratioMM-overview.Rnw", driver=cacheSweaveDriver) ## or without cache: Sweave("polySegratioMM-overview.Rnw") op <- options() options(width=70, digits=4)
Library functions are demonstrated on a simulated data set generated
using the sim.autoMarkers function from the
polySegratio package.
##<<simData, cache=true} ## simulate small autohexaploid data set of 500 markers for 200 individuals ## with %70 Single, %20 Double and %10 Triple Dose markers ## created with ## hexmarkers <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200) ## save(hexmarkers, file="../../data/hexmarkers.RData") data(hexmarkers)
The following R code can be used to generate 500 markers for
200 autohexaploid individuals exhibiting overdispersion with the
parameter shape1 = 25. The underlying percentages of single
double and triple dose markers are 70%, 20% and 10%, respectively.
hexmarkers <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200)
print(hexmarkers)
Note that the segregation ratios for simulated or real data may be
extracted by using segregationRatios which sets up the
appropriate objects for testing marker dosage and plotting or
summarising the marker data.
sr <- segregationRatios(hexmarkers$markers)
plotTheoretical(ploidy.level=6, seg.ratios=sr, expected.segratio=NULL, proportions=c(0.7,0.2,0.1), n.individuals=200)
For instance, as seen in Figure \@ref(fig:sim1), segregation ratios may be plotted with
plotTheoretical(ploidy.level=6, seg.ratios=sr, expected.segratio=NULL, proportions=c(0.7,0.2,0.1), n.individuals=200)
On the other hand, consider a similar data set that exhibits overdispersion. This may be simulated as follows
hexmarkers.overdisp <- sim.autoMarkers(6, c(0.7,0.2,0.1), n.markers=500,n.individuals=200, overdispersion=TRUE, shape1=30)
Markers
sr.overdisp <- segregationRatios(hexmarkers.overdisp$markers)
print(plotTheoretical(ploidy.level=6, seg.ratios=sr.overdisp, main="", expected.segratio=NULL, proportions=c(0.7,0.2,0.1), n.individuals=200))
The histogram of marker segregation ratios, which is a useful graphical method for identifying overdispersion or outliers, is seen in Figure~\@ref(fig:sim2). Note that, due to overdispersion the theoretical distribution is narrower than the observed data.
For the j^th^ marker $j=1 \dots n$, we assume the observed number $r_j$ of dominant markers out of $N_j$ lines follows a binomial distribution denoted $Bin(N_j, P_k)$.If we knew the dosage $k$ then, following @ripol99, the expected value of $P_k$ may be written as
$$ P_k(k| m, x) = 1 - {{{m-k} \choose {mx}} \over {{m} \choose {mx}}}, k=0 \ldots m/2(#eq:ripol1) $$
here $m$ is the ploidy level or number of homologous chromosomes and the monoploid number $x$ is the number of chromosomes in a basic set. Note that for diploids $m=2$, tetraploids $m=4$ , octaploids then $m=8$ and so on and also that if there are no marker data missing then $N_j$ is simply the number of progeny.
Since the dosage of each marker is unknown, we rely on the missing data representation of @dempster77 and @tannerandwong87 which is commonly adopted for MCMC computation in finite mixture models. An indicator variable $z_j$ corresponding to unknown marker dosage class $k$ is introduced where $z_j = k$ if the marker has dose $k$. For the $K$ components with $K \leq m/2$, consider the logit transformation of the true segregation proportions $P_k$ for dose $k, k=1 \dots K$. The the logit transformed segregation ratio $ \omega_k$ is then
$$ \omega_k = \log({ {P_k}\over{1-P_k} })(#eq:logit) $$
Let $z=(z_1 \ldots z_n)^T$ be a vector of unknown dosages (labelled $1,2 \dots K$ corresponding to simplex, duplex, triplex markers and so on), then $r_j$ is binomially distributed with known size parameter $N_j$ and unknown proportion parameter $\omega_{Z_j}$ which is the segregation ratio for marker dosage $z_j$. Hence, given marker dosage $z_j$ then
$$ \begin{eqnarray} r_j|z_j & \sim & Bin \left( N_j, \omega_{Z_j} \right), \ \mbox{ where } && \nonumber\ \mbox{logit}( \omega_{Z_j} ) & = & \log ( { \omega_{Z_j} \over 1 - \omega_{Z_ j} } ) \sim N( \mu_{Z_j}, \tau_{Z_j}^{-1}) \nonumber \end{eqnarray} (#eq:bin-mod1) $$
where $\mu_k$ and $\tau_k$ are the mean and precision $(\tau_k = 1/\sigma^2_k)$ of marker dosage class $k$ on the logit scale.
Since the dosage is unknown, for the autohexaploid data generated here then for the logit($\omega_{z_k}$) can be modelled as a finite mixture of 3 normals
$$\mbox{logit}( \omega_{Z_j} ) \sim \pi_1 N(\mu_1,\tau_1^{-1}) + \pi_2 N(\mu_2,\tau_2^{-1}) + \dots + \pi_K N(\mu_K,\tau_K^{-1})(#eq:3normals)$$
where $\mu_k$ is the mean and $\tau_k$ is the precision of component $k$ on the logit scale, and $\pi_k$ are the mixing proportions of the three components with $\sum_{k=1}^K\pi_k = 1$. The probability density function $f(x)$ of logit($\omega_k$) is
$$f(x) = \sum_{k=1}^K \pi_k \phi(x | \mu_k, \tau^{-1}_k)$$
where $\phi$ is the normal cumulative distribution function with parameters mean $\mu_k$ and variance $\sigma^2_k =\tau_k^{-1}$.
Simulation studies suggested that incorporating strong prior information, such as the expected distributions of @haldane30 provided the best method of allocating dosage. Further details may be found in @baker2010.
A mixture model may be set up with setModel. By default, only
two parameters are required, namely the ploidy.level or the
number of homologous chromosomes set either as a numeric or as a
character string and also n.components or the number of
components for mixture model (less than or equal to maximum number of
possible dosages). By default, strong priors are set by using the
formulae of @haldane30 for the expected numbers and ratios of
offspring for various parental configurations of autopolyploids.
For the autohexaploid data generated above, the models are set with
x.mod1 <- setModel(3,6) # autohexaploid model with 3 components
The R object x.mod1 contains components describing
aspects of the model such as the number of components, ploidy,
expected segregation ratios and so on. Note that the str
command is useful for displaying the internal structure of any
R object.
While various options are available for fine tuning the MCMC process, the simplest way to fit a mixture model to allocate marker dosages is with the wrapper function `runSegratioMM} as follows:
mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1)
which automatically determines starting values, priors, length of burn in, number of iterations, and other parameters as well as producing summary statistics and diagnostic plots.
## produced using the following but loaded as data to avoid the run time on slow machines ##mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1, burn.in=200, sample=500, plots=FALSE) ##mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1, plots=FALSE) ## save(mcmcHexRun, file="../../data/mcmcHexRun.RData") data(mcmcHexRun)
To run JAGS without producing plots then set the
plots option to FALSE. For the overdispersed data
running this command produced the following selected output. While
selected output is printed here the simple command
print(mcmcHexRun) whould produce the following output and
more.
The summary of processing times:
print(mcmcHexRun$run.jags)
And summary statistics for the posterior distributions of selected parameters:
print(mcmcHexRun$summary)
Note that MCMC convergence diagnostic output is produced automatically. Assessing convergence is crucial in MCMC and poor convergence may result in mis-allocated marker dosages. The diagnostic statistics indicate that convergence was achieved.
print(mcmcHexRun$diagnostics)
And finally, summaries of marker dosage allocations are produced:
print(mcmcHexRun$doses)
Note that simply plotting mcmcHexRun will produce a histogram
of segregation proportions and the fitted model but that other plots
are easily produced.
plot(mcmcHexRun$mcmc.mixture$mcmc.list[[1]][,c("P[1]","mu[1]","sigma","T[140]")])
When the plots option of runSegratioMM is set to the
default value of TRUE, numerous plots are produced including
trace and density plots from the CODA package. These may also
be extracted manually but the process is somewhat more
complicated. For instance, to obtain trace and density plots for the
parameters $p_1$, $\mu_1$, $\sigma_1$ and for the 140^th^ marker,
as shown in Figure \@ref(fig:trace1), then CODA may be used
directly by following command:
plot(mcmcHexRun$mcmc.mixture$mcmc.list[[1]][,c("P[1]","mu[1]","sigma","T[140]")])
plot(mcmcHexRun, theoretical=TRUE, main="")
The histogram of segregation proportions with fitted and theoretical
values shown in Figure \@ref(fig:fitted1) may be obtained by setting
the theoretical option to TRUE as follows.
plot(mcmcHexRun, theoretical=TRUE)
Marker dosages allocations may be obtained directly from the object
mcmcHexRun. The dosage with maximum posterior probability is
simply mcmcHexRun\$doses\$max.post.dosage. A more conservative
allocation is obtained by using mcmcHexRun\$doses\$dosage[,"0.8"]
whereby the dosage with posterior probability over 0.8 is
employed. For instance, to tabulate the number of markers (including
those not allocated a dosage which are labelled NA) the table
command can be employed.
cat("Employing maximum posterior probability\n") table(Dose=mcmcHexRun$doses$max.post.dosage, exclude=NULL) cat("Employing posterior probability > 0.8\n") table(Dose=mcmcHexRun$doses$dosage[,"0.8"], exclude=NULL)
And of course since the data were simulated we can compare the
estimated and true dosages obtained as
hexmarkers.overdisp\$true.doses\$dosage via cross tabulation. Doses can
also be obtained for the standard$\chi^2$ test by using the
test.segRatio command from the polySegratio library.
cat("Employing theChi squared test\n") dose.chi <- test.segRatio(sr.overdisp, ploidy.level = 6) table(Chi2Dose=dose.chi$dosage, True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL) cat("Employing maximum posterior probability\n") table(MixtureDose=mcmcHexRun$doses$max.post.dosage, True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL) cat("Employing posterior probability > 0.8\n") table(MixtureDose=mcmcHexRun$doses$dosage[, "0.8"], True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL)
These tables show that far fewer markers are allocated a dosage using the standard $\chi^2$ test than by the mixture model. Fewer markers were misclassified using a posterior probability threshold of 0.8 rather than the maximum posterior probability as a basis for allocating dosage.
\bibliographystyle{apalike}
Karen Aitken, given her experience in tetraploids and sugarcane marker maps, has provided many valuable insights into marker dosage in autopolyploids. Additionally, Ross Darnell, Andrew George and Kerrie Mengersen provided useful comments and discussions.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.