popkin_A: Compute popkin's 'A' and 'M' matrices from genotypes

View source: R/popkin_A.R

popkin_AR Documentation

Compute popkin's A and M matrices from genotypes

Description

This function returns lower-level, intermediate calculations for the main popkin function. These are not intended for most users, but rather for researchers studying the estimator.

Usage

popkin_A(
  X,
  n = NA,
  loci_on_cols = FALSE,
  mean_of_ratios = FALSE,
  mem_factor = 0.7,
  mem_lim = NA,
  m_chunk_max = 1000
)

Arguments

X

Genotype matrix, BEDMatrix object, or a function X(m) that returns the genotypes of all individuals at m successive locus blocks each time it is called, and NULL when no loci are left. If a regular matrix, X must have values only in c(0, 1, 2, NA), encoded to count the number of reference alleles at the locus, or NA for missing data.

n

Number of individuals (required only when X is a function, ignored otherwise). If n is missing but subpops is not, n is taken to be the length of subpops.

loci_on_cols

If TRUE, X has loci on columns and individuals on rows; if FALSE (default), loci are on rows and individuals on columns. Has no effect if X is a function. If X is a BEDMatrix object, loci_on_cols is ignored (set automatically to TRUE internally).

mean_of_ratios

Chose how to weigh loci. If FALSE (default) loci have equal weights (in terms of variance, rare variants contribute less than common variants; also called the "ratio-of-means" version, this has known asymptotic behavior). If TRUE, rare variant loci are upweighed (in terms of variance, contributions are approximately the same across variant frequencies; also called the "mean-of-ratios" version, its asymptotic behavior is less well understood but performs better for association testing).

mem_factor

Proportion of available memory to use loading and processing data. Ignored if mem_lim is not NA.

mem_lim

Memory limit in GB, used to break up data into chunks for very large datasets. Note memory usage is somewhat underestimated and is not controlled strictly. Default in Linux is mem_factor times the free system memory, otherwise it is 1GB (Windows, OSX and other systems).

m_chunk_max

Sets the maximum number of loci to process at the time. Actual number of loci loaded may be lower if memory is limiting.

Value

A named list containing:

  • A: n-by-n matrix, for individuals j and k, of average w_i * ( ( x_ij - 1 ) * ( x_ik - 1 ) - 1) values across all loci i in X; if mean_of_ratios = FALSE, w_i = 1, otherwise w_i = 1 / (p_est_i * (1 - p_est_i) ) where p_est_i is the reference allele frequency.

  • M: n-by-n matrix of sample sizes (number of loci with non-missing individual j and k pairs, used to normalize A)

See Also

The main popkin() function (a wrapper of this popkin_A function and popkin_A_min_subpops() to estimate the minimum A value).

Examples

# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow = 3, byrow = TRUE) # genotype matrix

# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
# library(BEDMatrix)
# X <- BEDMatrix(file) # load genotype matrix object

obj <- popkin_A(X) # calculate A and M from genotypes
A <- obj$A
M <- obj$M


popkin documentation built on Jan. 7, 2023, 1:26 a.m.