imputation_accuracy: Imputation accuracy, aka. correlations
In stefanedwards/Siccuracy: Pipeline Package for AlphaImpute

Description Usage Arguments Details Value Standardization File-based method See Also

Calculation of column-wise, row-wise, and matrix-wise correlations between two matrices, the "true" genotypes and the imputed genotypes.

imputation_accuracy(true, impute, standardized = TRUE, center = NULL,
  scale = NULL, p = NULL, tol = 0.1, ...)

## S3 method for class 'character'
imputation_accuracy(true, impute, standardized = TRUE,
  center = NULL, scale = NULL, p = NULL, tol = 0.1, ..., ncol = NULL,
  nlines = NULL, na = 9, adaptive = TRUE, excludeIDs = NULL,
  excludeSNPs = NULL)

## S3 method for class 'matrix'
imputation_accuracy(true, impute, standardized = TRUE,
  center = NULL, scale = NULL, p = NULL, tol = 0.1, ...,
  excludeIDs = NULL, excludeSNPs = NULL, transpose = FALSE)

## S3 method for class 'haps'
imputation_accuracy(true, impute, standardized = TRUE,
  center = NULL, scale = NULL, p = NULL, tol = 0.1, ...,
  excludeIDs = NULL, excludeSNPs = NULL)

## S3 method for class 'vcfR'
imputation_accuracy(true, impute, standardized = TRUE,
  center = NULL, scale = NULL, p = NULL, tol = 0.1, excludeIDs = NULL,
  excludeSNPs = NULL, ...)

`true`	True genotype matrix, or filename (AlphaImpute format only).
`impute`	Imputed genotype matrix, or filename (AlphaImpute format only).
`standardized`	Logical, whether to center and scale genotypes by dataset in `true`-matrix. Currently by subtracting column mean and dividing by column standard deviation.
`center`	Numeric vector of `ncol`-length to subtract with for standardization.
`scale`	Numeric vector of `ncol`-length to divide by for standardization.
`p`	Shortcut for `center` and `scale` when using allele frequencies. `center=2p` and `scale=sqrt(2p(1-p))`.
`tol`	Numeric, tolerance for imputation error when counting correctly imputed genotypes.
`...`	Arguments passed between different methods (mostly `extract.snps` and `extract.gt`).
`ncol`	Integer, number of SNP columns in files. When `NULL`, automagically detected with `get_ncols(true)-1`.
`nlines`	Integer, number of lines in `true`. When `NULL`, automagically detected with `gen_nlines(true)`.
`na`	Value of missing genotypes.
`adaptive`	Use adaptive method (default) that stores `true` in memory and compares rows by ID in first column.
`excludeIDs`	Integer vector, exclude these individuals from correlations. Does not affect calculation of column means and standard deviations.
`excludeSNPs`	Integer or logical vector, exclude these columns from correlations. Does not affect calculation of column means and standard deviations.
`transpose`	Logical, if SNPs are per row, set to `TRUE`.

Character class method uses files only, and arguments true and impute refer to the filenames. The method assumes first column in both files is an integer ID column and thus excluded from calculations. Genotypes equal to na are considered missing (i.e. NA) and are not included in the calculations.

matrix class method performs same calculations, but on matrices stored in memory. Class methods for format-specific objects ('haps', 'oxford', or 'vcfR'), extracts SNP genotypes matrices using extract.snps.

Correlations are only performed on those rows that are found in both matrices / files, based on the first column (ID column).

List with following elements:

matcor: Matrix-wise correlation between true and imputed matrix.
snps: Data frame with all snp-wise statistics; has $m$ or $m - |excludeSNPs|$ rows.
animals: Data frame with all animal-wise statistics; has $n$ or $n - |excludeIDs|$ rows.

The data frames keeps all rows when used on files; when used on matrices, the rows of the corresponding dropped IDs or SNPs are dropped.

The data frames, snps and animals, with statistics consists of columns

rowID: Row ID ($animals only!).
means: Value subtracted from each column ($snps only!).
sds: Value used to scale each column (i.e. standard deviations) ($snps only!).
cors: Pearson correlation between true and imputed genotype.
correct: Number of entries of equal value (within tol)
true.na: Number of entries in that were missing in true but not impute.
imp.na: As true.na, but vice versa.
both.na: Number of entries that were missing in both files.
correct.pct: correct divided by total number of entries bare missing entries in true.

Standardization is performed by subtracting the mean followed by division of the standard deviation; conceptually the same as in scale. Mean and standard deviation are calculated based on true matrix, before removing samples (excludeIDs) or SNPs (excludeSNPs). Alternate means and scales may be provided by arguments center and scale, or p.

Note: If either scale or p are 0 or NA, they will not contribute to correlation, but they will count towards correct pct. To exclude entirely, use excludeSNPs.

This method stores the "true" matrix in memory with a low-precision real type, and rows in the "imputed" matrix are read and matched by ID. If there are no extra rows in either matrix and order of IDs is the same, consider setting adaptive=FALSE, as this has a memory usage of O(m), compared to O(nm) for the adaptive method, where 'm' is the number of SNPs and 'n' the number of animals. The non-adaptive method is however, and very surprisingly, slightly slower.

write.snps for writing SNPs to a file.

stefanedwards/Siccuracy documentation built on May 30, 2019, 10:44 a.m.