compareMixtureToData: Compare a mixture solution to some data.
In danjlawson/badMIXTURE: Validating Structure With Chromosome Painting

Description Usage Arguments Value Examples

This function takes a mixture solution, a data matrix, and a mapping of data observations into clusters. It then predicts what it expects the mixture solutions to look like.

It is essential to understand that there are two key structures being explored simultaneously in these plots. The first is the P-dimensional clustering of the data, which defines the similarities; each of the N data points has a similarity to these P clusters. A representation in this space is called a palette. The second is the K-dimensional set of admixture weights. Each of the K "ancestral" or latent variables also has a P dimensional palette. Further, each of the N data points has a K dimensional "admixture" breakdown, seen as a mixture of the K ancestral palettes.

We must be able to match up each individual to the palette which it represents, so that we can order the individuals according to the palette. As such, this implemetation reorders the individuals, and may reorder the clusters themselves, in order to provide a clean representation of the data. If you don't want this, you may try using compareMixtureToDataDirect).

compareMixtureToData(mix, dataraw, fam = NULL, ids = NULL, remself = 0,
  ancestral = "solve", relabel = I, tdend = NULL, popdend = NULL,
  poplist = NULL, poporder = NULL, mycols = NULL, mycols2 = NULL,
  gap = 3)

`mix`	A proposed mixture solution of dimension N by K; for example, as generated by STRUCTURE or ADMIXTURE.
`dataraw`	A matrix containing the data of dimension N by P, for example, as generated by ChromoPainter. The data must reflect similarity to P clusterings of the data. Note that we assume that this matrix is correctly normalised (all the rows sum to the same value); if this is not true then the results may be strange. If in doubt, provide dataraw/rowSums(dataraw) to normalise the rows to sum 1.
`fam`	An N by 2 or more data frame consisting of the fam file that generated the data. The used parts are: column 1: the cluster membership that created the groups in dataraw (with the column names in dataraw as the values). column 2: row names (for both the data and the mix, and in the same order). One of fam or ids must be present.
`ids`	An N by 3 data frame consisting of: column 1: row names (for both the data and the mix). column 2: the cluster membership that created the groups in dataraw (with the column names in dataraw as the values). column 3: inclusion (0 for absent, 1 for present; NB only all present will currently work!). One of fam or ids must be present.
`remself`	Number of iterations that "self-copying" (cluster specific sharing of drift) is removed. Set to 10 to essentially remove all self-copying.
`ancestral`	Ancestral model, i.e. how the latent clusters are defined. There are two options. "mixture": meaning tht the populations are defined as a mixture of the individuals that comprise them according to the admixture model. This is not advisable because it allows ancestry from different true latent popluations to affect inference of others. "solve": find the definition of the clusters that best explains the data by root-mean-sqaure-distance. Default: "solve"
`relabel`	Mapping from dendrogram labels to individual names. Should not be needed. Default: The identity function
`tdend`	Dendrogram of the individuals to determine plot order. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create this from the data
`popdend`	Dendrogram of the populations to determine plot order. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create this from the data
`poplist`	Specify the full ordering of individuals and populations. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create from the data
`poporder`	Specify a population ordering. Default: NULL, meaning determine from the dendrogram
`mycols`	The colour for each of the K ancestral populations; Default: NULL, meaning use rainbow(K). Can be modified in the returned object instead.
`mycols2`	The colour for each of the P clusters; Default: NULL, meaning determined using `rgbDistCols` so that similar clusters have similar colours. Can be modified in the returned object instead.
`gap`	The spacing between populations, relative to the spacing between individuals which is 1. Default: 3

An object of class admixfs, which is a list containing the following:

mix The N by K admixture matrix, reordered by clusters.
selfmatrix An N by P matrix of containing 0 except if individual i is in cluster j in which case it is 1.
data.NbyP The N by P data matrix, reordered by clusters.
data.PbyP A P by P matrix describing the similarity of each cluster to each other.
K The number of ancestral or latent populations.
P The number of clusters, defining the size of the palette.
poplist A representation of the membership of each cluster. A list of clusters (reordered) each containing a character vector of the individuals in that cluster.
tdend A dendrogram relating the clusters, used to define their order.
coancestry.KbyP The palettes of the K ancestral populations.
pred.NbyPatK The predicted palette of the data.
meanpainting.KbyP The average palette.
meandiff.KbyP The prediction - meanpainting
meandiff.KbyP.over2A The amount that the mean prediction is over - The amount that the full predicction is over, with negative values set to zero
meandiff.KbyP.over The amount that the mean prediction is over - The amount that the full predicction is over
meandiff.KbyP.under2A The amount that the mean prediction is under - The amount that the full predicction is under, with negative values set to zero
meandiff.KbyP.under The amount that the mean prediction is under - The amount that the full predicction is under
same.KbyP The palettes that are the same in the data and the preciction
diff.KbyP The prediction - data
diff.KbyP.under The underprediction
diff.KbyP.over The overprediction
dist.KbyP.predfail The absolute error for each individual
dist.KbyP=dist.KbyP The sum of the absolute errors
tspace The distance of each individual in the plot to its left neighbour, allowing for spaces between populations. (Locations are given by cumsum of this)
popxcentres The centre of each population in the plot
mycols The colour for each of the K ancestral populations; Default: rainbow(K)
mycols2 The colour for each of the P clusters; Default: determined using rgbDistCols so that similar clusters have similar colours.
mycols2A mycols2 with white at the start, for when we plot the mean of all the data
xrange The range of the x axis for the plot

data(arisim_remnants)

## The vanilla analysis highlights the existance of genetic drift
## specific to Pop5 that isn't captured by the mixture

adm<-compareMixtureToData(arisim_remnants$mixture,
                          arisim_remnants$data,
                          ids=arisim_remnants$ids)
admplot=mixturePlot(adm)

## The same plot but with excess similarity within populations
## removed highlights that the mixture solution overpredicts the
## amount of Pop13 admixture in Pop5, and underpredicts the amount
## of Pop6 admixture in the same population. The same is true for
## Pop13 to a lesser extent. Each other population is well fit- the
## variation seen is all within the clusters.
adm_remself<-compareMixtureToData(arisim_remnants$mixture,
                                  arisim_remnants$data,
                                  ids=arisim_remnants$ids,remself=10)
admplot_remself=mixturePlot(adm_remself)