knitr::opts_chunk$set(fig.width = 5, fig.height = 5)
This vignette shows how to use the R package disclapmix
that implements the method described in [@AndersenDisclap2013].
The key function is disclapmix_adaptive()
.
For a more gentle introduction to the method, refer to [@AndersenDisclapIntroduction2013].
A Danish reference database [@Hallenberg2005YchromosomeSH] with $n = 185$ observations (male Y-STR haplotypes) at $r=10$ loci is available in the danes
dataset.
Let us load the package as well as the data:
library(ggplot2) library(disclapmix) data(danes)
The database is in compact format, i.e. one unique haplotype per row. To fit the model, we need one observation per row. This is done for example like this:
db <- as.matrix(danes[rep(seq_len(nrow(danes)), danes$n), seq_len(ncol(danes)-1)]) str(db)
Also, note that the database is now an integer matrix.
We fit models from 1 cluster up to as many necessary (see ?disclapmix_adaptive
for how this is determined):
fits <- disclapmix_adaptive(x = db)
We can extract how well each model fits the data (judged by the marginal BIC [@BIC]):
BICs <- sapply(fits, function(x) x$BIC_marginal) bestfit <- fits[[which.min(BICs)]] bestfit
We can plot how well each model fits the data:
clusters <- seq_along(fits) # or: sapply(fits, function(x) nrow(x$y)) plot(clusters, BICs, type = "b") points(nrow(bestfit$y), bestfit$BIC_marginal, pch = 4, cex = 4, col = "red")
And use the model to predict haplotype frequencies:
db_haplotypes <- as.matrix(danes[, seq_len(ncol(db))]) p <- predict(bestfit, db_haplotypes) plot(log10(danes$n / sum(danes$n)), log10(p), xlab = "Observed relative frequency", ylab = "Discrete Laplace estimated probability") abline(0, 1)
To fit a model using 2 clusters, the following command can be used (note the L
postfix to emphasize that the number is an integer):
fit <- disclapmix(x = db, clusters = 2L)
The number of clusters is not known beforehand. Here, the numbers 1 through 5 are tried and the best one according to the BIC [@BIC] is taken:
clusters <- 1L:5L fits <- lapply(clusters, function(clusters) { fit <- disclapmix(x = db, clusters = clusters) return(fit) }) marginalBICs <- sapply(fits, function(fit) { fit$BIC_marginal }) bestfit <- fits[[which.min(marginalBICs)]]
The best fit is now in the bestfit
that can be inspected by print
(default method called when the variable is just written) or summary
which (currently) give the same output:
bestfit
summary(bestfit)
We can also plot the fitted model:
plot(bestfit)
There are important information returned by disclapmix
, e.g. the central haplotypes and the dispersion parameters for the discrete Laplace distributions:
bestfit$y bestfit$disclap_parameters
The returned object is described in ?disclapmix
and objects can be inspected using e.g. str()
.
Haplotype frequencies can be obtained using the predict
function. Note, that this is done per haplotype (danes
) and not per observation (db
):
disclap_estimates <- predict(bestfit, newdata = as.matrix(danes[, 1:(ncol(danes) - 1)]))
These can be compared to the database frequencies:
ggplot() + geom_abline(intercept = 0, slope = 1) + geom_point(aes(x = danes$n/sum(danes$n), y = disclap_estimates)) + labs(x = "Observed frequency", y = "Predicted frequency (discrete Laplace)") + theme_bw() + scale_x_log10() + scale_y_log10()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.