fst_hudson_k: The Generalized "Hudson" FST estimator
In OchoaLab/popkinsuppl: Supplement to popkin package

View source: R/fst_hudson_k.R

fst_hudson_k

R Documentation

The Generalized "Hudson" FST estimator

Description

This function implements the "Hudson" FST estimator (Bhatia, Patterson, Sankararaman, and Price 2013) generalized to K subpopulations (Ochoa and Storey 2016). Handles very large data, passed as BEDMatrix or as a regular R matrix. Handles missing values correctly.

Usage

fst_hudson_k(
  X,
  labs,
  m = NA,
  ind_keep = NULL,
  loci_on_cols = FALSE,
  mem_factor = 0.7,
  mem_lim = NA,
  m_chunk_max = 1000
)

Arguments

`X`	The genotype matrix (BEDMatrix, regular R matrix, or function, same as `popkin`).
`labs`	A vector of subpopulation assignments for every individual. At least two subpoplations must be present.
`m`	The number of loci, required if `X` is a function (ignored otherwise). In particular, `m` is obtained from `X` when it is a BEDMatrix or a regular R matrix.
`ind_keep`	An optional vector of individuals to keep (as booleans or indexes, used to subset an R matrix).
`loci_on_cols`	Determines the orientation of the genotype matrix (by default, `FALSE`, loci are along the rows). If `X` is a BEDMatrix object, the input value is ignored (set automatically to `TRUE` internally).
`mem_factor`	Proportion of available memory to use loading and processing genotypes. Ignored if `mem_lim` is not `NA`.
`mem_lim`	Memory limit in GB, used to break up genotype data into chunks for very large datasets. Note memory usage is somewhat underestimated and is not controlled strictly. Default in Linux and Windows is `mem_factor` times the free system memory, otherwise it is 1GB (OSX and other systems).
`m_chunk_max`	Sets the maximum number of loci to process at the time. Actual number of loci loaded may be lower if memory is limiting.

Value

A list with the following named elements, in this order:

fst: The genome-wide Fst estimate (scalar).
fst_loci: A vecctor of per-locus Fst estimates.
data: a 2-by-m matrix of statistics used in estimating Fst. Useful to obtain a bootstrap distribution for the genome-wide Fst.

Examples

# dimensions of simulated data
n_ind <- 100
m_loci <- 1000
k_subpops <- 10
n_data <- n_ind * m_loci

# missingness rate
miss <- 0.1

# simulate ancestral allele frequencies
# uniform (0,1)
# it'll be ok if some of these are zero
p_anc <- runif(m_loci)

# simulate some binomial data
X <- rbinom(n_data, 2, p_anc)

# sprinkle random missingness
X[ sample(X, n_data * miss) ] <- NA

# turn into a matrix
X <- matrix(X, nrow = m_loci, ncol = n_ind)

# create subpopulation labels
# k_subpops groups of equal size
labs <- ceiling( (1 : n_ind) / k_subpops )

# estimate FST using the "Hudson" formula
fst_hudson_k_obj <- fst_hudson_k(X, labs)

# the genome-wide FST estimate
fst_hudson_k_obj$fst

# vector of per-locus FST estimates
fst_hudson_k_obj$fst_loci

OchoaLab/popkinsuppl documentation built on May 17, 2022, 9:50 a.m.