imputation_accuracy: Imputation accuracy, aka. correlations

Description Usage Arguments Details Value Standardization File-based method See Also

Description

Calculation of column-wise, row-wise, and matrix-wise correlations between two matrices, the "true" genotypes and the imputed genotypes.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
imputation_accuracy(true, impute, standardized = TRUE, center = NULL,
  scale = NULL, p = NULL, tol = 0.1, ...)

## S3 method for class 'character'
imputation_accuracy(true, impute, standardized = TRUE,
  center = NULL, scale = NULL, p = NULL, tol = 0.1, ..., ncol = NULL,
  nlines = NULL, na = 9, adaptive = TRUE, excludeIDs = NULL,
  excludeSNPs = NULL)

## S3 method for class 'matrix'
imputation_accuracy(true, impute, standardized = TRUE,
  center = NULL, scale = NULL, p = NULL, tol = 0.1, ...,
  excludeIDs = NULL, excludeSNPs = NULL, transpose = FALSE)

## S3 method for class 'haps'
imputation_accuracy(true, impute, standardized = TRUE,
  center = NULL, scale = NULL, p = NULL, tol = 0.1, ...,
  excludeIDs = NULL, excludeSNPs = NULL)

## S3 method for class 'vcfR'
imputation_accuracy(true, impute, standardized = TRUE,
  center = NULL, scale = NULL, p = NULL, tol = 0.1, excludeIDs = NULL,
  excludeSNPs = NULL, ...)

Arguments

true

True genotype matrix, or filename (AlphaImpute format only).

impute

Imputed genotype matrix, or filename (AlphaImpute format only).

standardized

Logical, whether to center and scale genotypes by dataset in true-matrix. Currently by subtracting column mean and dividing by column standard deviation.

center

Numeric vector of ncol-length to subtract with for standardization.

scale

Numeric vector of ncol-length to divide by for standardization.

p

Shortcut for center and scale when using allele frequencies. center=2p and scale=sqrt(2p(1-p)).

tol

Numeric, tolerance for imputation error when counting correctly imputed genotypes.

...

Arguments passed between different methods (mostly extract.snps and extract.gt).

ncol

Integer, number of SNP columns in files. When NULL, automagically detected with get_ncols(true)-1.

nlines

Integer, number of lines in true. When NULL, automagically detected with gen_nlines(true).

na

Value of missing genotypes.

adaptive

Use adaptive method (default) that stores true in memory and compares rows by ID in first column.

excludeIDs

Integer vector, exclude these individuals from correlations. Does not affect calculation of column means and standard deviations.

excludeSNPs

Integer or logical vector, exclude these columns from correlations. Does not affect calculation of column means and standard deviations.

transpose

Logical, if SNPs are per row, set to TRUE.

Details

Character class method uses files only, and arguments true and impute refer to the filenames. The method assumes first column in both files is an integer ID column and thus excluded from calculations. Genotypes equal to na are considered missing (i.e. NA) and are not included in the calculations.

matrix class method performs same calculations, but on matrices stored in memory. Class methods for format-specific objects ('haps', 'oxford', or 'vcfR'), extracts SNP genotypes matrices using extract.snps.

Correlations are only performed on those rows that are found in both matrices / files, based on the first column (ID column).

Value

List with following elements:

matcor

Matrix-wise correlation between true and imputed matrix.

snps

Data frame with all snp-wise statistics; has $m$ or $m - |excludeSNPs|$ rows.

animals

Data frame with all animal-wise statistics; has $n$ or $n - |excludeIDs|$ rows.

The data frames keeps all rows when used on files; when used on matrices, the rows of the corresponding dropped IDs or SNPs are dropped.

The data frames, snps and animals, with statistics consists of columns

rowID

Row ID ($animals only!).

means

Value subtracted from each column ($snps only!).

sds

Value used to scale each column (i.e. standard deviations) ($snps only!).

cors

Pearson correlation between true and imputed genotype.

correct

Number of entries of equal value (within tol)

true.na

Number of entries in that were missing in true but not impute.

imp.na

As true.na, but vice versa.

both.na

Number of entries that were missing in both files.

correct.pct

correct divided by total number of entries bare missing entries in true.

Standardization

Standardization is performed by subtracting the mean followed by division of the standard deviation; conceptually the same as in scale. Mean and standard deviation are calculated based on true matrix, before removing samples (excludeIDs) or SNPs (excludeSNPs). Alternate means and scales may be provided by arguments center and scale, or p.

Note: If either scale or p are 0 or NA, they will not contribute to correlation, but they will count towards correct pct. To exclude entirely, use excludeSNPs.

File-based method

This method stores the "true" matrix in memory with a low-precision real type, and rows in the "imputed" matrix are read and matched by ID. If there are no extra rows in either matrix and order of IDs is the same, consider setting adaptive=FALSE, as this has a memory usage of O(m), compared to O(nm) for the adaptive method, where 'm' is the number of SNPs and 'n' the number of animals. The non-adaptive method is however, and very surprisingly, slightly slower.

See Also

write.snps for writing SNPs to a file.


stefanedwards/Siccuracy documentation built on May 30, 2019, 10:44 a.m.