compareMixtureToData: Compare a mixture solution to some data.

Description Usage Arguments Value Examples

View source: R/compareMixtureToData.R

Description

This function takes a mixture solution, a data matrix, and a mapping of data observations into clusters. It then predicts what it expects the mixture solutions to look like.

It is essential to understand that there are two key structures being explored simultaneously in these plots. The first is the P-dimensional clustering of the data, which defines the similarities; each of the N data points has a similarity to these P clusters. A representation in this space is called a palette. The second is the K-dimensional set of admixture weights. Each of the K "ancestral" or latent variables also has a P dimensional palette. Further, each of the N data points has a K dimensional "admixture" breakdown, seen as a mixture of the K ancestral palettes.

We must be able to match up each individual to the palette which it represents, so that we can order the individuals according to the palette. As such, this implemetation reorders the individuals, and may reorder the clusters themselves, in order to provide a clean representation of the data. If you don't want this, you may try using compareMixtureToDataDirect).

Usage

1
2
3
4
compareMixtureToData(mix, dataraw, fam = NULL, ids = NULL, remself = 0,
  ancestral = "solve", relabel = I, tdend = NULL, popdend = NULL,
  poplist = NULL, poporder = NULL, mycols = NULL, mycols2 = NULL,
  gap = 3)

Arguments

mix

A proposed mixture solution of dimension N by K; for example, as generated by STRUCTURE or ADMIXTURE.

dataraw

A matrix containing the data of dimension N by P, for example, as generated by ChromoPainter. The data must reflect similarity to P clusterings of the data. Note that we assume that this matrix is correctly normalised (all the rows sum to the same value); if this is not true then the results may be strange. If in doubt, provide dataraw/rowSums(dataraw) to normalise the rows to sum 1.

fam

An N by 2 or more data frame consisting of the fam file that generated the data. The used parts are: column 1: the cluster membership that created the groups in dataraw (with the column names in dataraw as the values). column 2: row names (for both the data and the mix, and in the same order). One of fam or ids must be present.

ids

An N by 3 data frame consisting of: column 1: row names (for both the data and the mix). column 2: the cluster membership that created the groups in dataraw (with the column names in dataraw as the values). column 3: inclusion (0 for absent, 1 for present; NB only all present will currently work!). One of fam or ids must be present.

remself

Number of iterations that "self-copying" (cluster specific sharing of drift) is removed. Set to 10 to essentially remove all self-copying.

ancestral

Ancestral model, i.e. how the latent clusters are defined. There are two options. "mixture": meaning tht the populations are defined as a mixture of the individuals that comprise them according to the admixture model. This is not advisable because it allows ancestry from different true latent popluations to affect inference of others. "solve": find the definition of the clusters that best explains the data by root-mean-sqaure-distance. Default: "solve"

relabel

Mapping from dendrogram labels to individual names. Should not be needed. Default: The identity function

tdend

Dendrogram of the individuals to determine plot order. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create this from the data

popdend

Dendrogram of the populations to determine plot order. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create this from the data

poplist

Specify the full ordering of individuals and populations. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create from the data

poporder

Specify a population ordering. Default: NULL, meaning determine from the dendrogram

mycols

The colour for each of the K ancestral populations; Default: NULL, meaning use rainbow(K). Can be modified in the returned object instead.

mycols2

The colour for each of the P clusters; Default: NULL, meaning determined using rgbDistCols so that similar clusters have similar colours. Can be modified in the returned object instead.

gap

The spacing between populations, relative to the spacing between individuals which is 1. Default: 3

Value

An object of class admixfs, which is a list containing the following:

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
data(arisim_remnants)

## The vanilla analysis highlights the existance of genetic drift
## specific to Pop5 that isn't captured by the mixture

adm<-compareMixtureToData(arisim_remnants$mixture,
                          arisim_remnants$data,
                          ids=arisim_remnants$ids)
admplot=mixturePlot(adm)

## The same plot but with excess similarity within populations
## removed highlights that the mixture solution overpredicts the
## amount of Pop13 admixture in Pop5, and underpredicts the amount
## of Pop6 admixture in the same population. The same is true for
## Pop13 to a lesser extent. Each other population is well fit- the
## variation seen is all within the clusters.
adm_remself<-compareMixtureToData(arisim_remnants$mixture,
                                  arisim_remnants$data,
                                  ids=arisim_remnants$ids,remself=10)
admplot_remself=mixturePlot(adm_remself)

danjlawson/badMIXTURE documentation built on Sept. 27, 2019, 9:11 p.m.