The input SnpMatrix is first standardized by subtracting the mean (or stratum mean) from each call and dividing by the expected standard deviation under Hardy-Weinberg equilibrium. It is then post-multiplied by its transpose. This is a preliminary step in the computation of principal components.

1 2 |

`snps` |
The input matrix, of type |

`strata` |
A |

`correct.for.missing` |
If |

`lower.only` |
If |

`uncertain` |
If |

This computation forms the first step of the calculation of principal
components for genome-wide SNP data. As pointed out by Price et al.
(2006), when the data matrix has more rows than columns it is
most efficient to calculate the eigenvectors of
`X`.`X`-transpose, where `X` is a
`SnpMatrix`

whose columns have been standardized to zero mean and
unit variance. For autosomes, the genotypes are given codes 0, 1 or 2
after subtraction of the mean, 2`p`, are divided by the standard
deviation
sqrt(2`p`(1-`p`)) (`p` is the estimated allele
frequency). For SNPs on the X chromosome in male subjects,
genotypes are coded 0 or 2. Then
the mean is still 2`p`, but the standard deviation is
2sqrt(`p`(1-`p`)). If the `strata`

is supplied, a
stratum-specific estimate value for `p` is used for
standardization.

Missing observations present some difficulty. Price et al. (2006)
recommended replacing missing observations by their means, this being
equivalent to replacement by zeros in the standardized matrix. However
this results in a biased estimate of the complete data
result. Optionally this bias can be corrected by inverse probability
weighting. We assume that the probability that any one call is missing
is small, and can be predicted by a multiplicative model with row
(subject) and column (locus) effects. The estimated probability of a
missing value in a given row and column is then given by
*m = RC/T*, where `R` is the row total number of
no-calls, `C` is the column total of no-calls, and `T` is the
overall total number of no-calls. Non-missing contributions to
`X`.`X`-transpose are then weighted by *w=1/(1-m)* for
contributions to the diagonal elements, and products of the relevant
pairs of weights for contributions to offâ€“diagonal elements.

A square matrix containing either the complete X.X-transpose matrix, or just its lower triangle

The correction for missing observations can result in an output matrix which is not positive semi-definite. This should not matter in the application for which it is intended

In genome-wide studies, the SNP data will usually be held as a series of
objects (of
class `"SnpMatrix"`

or`"XSnpMatrix"`

), one per chromosome.
Note that the `X`.`X`-transpose matrices
produced by applying the `xxt`

function to each object in turn
can be added to yield the genome-wide result.

If the matrix is converted to a correlation matrix by pre- and post-multiplying by the sqrt of the inverse of its diagonal, then this is an unbiased estimate of twice the kinship matrix.

David Clayton dc208@cam.ac.uk

Price et al. (2006) Principal components analysis corrects for
stratification in genome-wide association studies. *Nature Genetics*,
**38**:904-9

1 2 3 4 5 6 7 |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.