fst_hudson_pairwise: Matrix of between-subpopulation Hudson FST estimates

View source: R/fst_hudson_pairwise.R

fst_hudson_pairwiseR Documentation

Matrix of between-subpopulation Hudson FST estimates

Description

This function applies the fst_hudson_k formula to every pair of subpopulations in the dataset. Since it is only applied to two subpopulations at the time, the values equal the original (non-generalized) "Hudson" FST estimator of Bhatia, Patterson, Sankararaman, and Price (2013).

Usage

fst_hudson_pairwise(
  X,
  labs,
  pops = NULL,
  m = NA,
  loci_on_cols = FALSE,
  mem_factor = 0.7,
  mem_lim = NA,
  m_chunk_max = 1000
)

Arguments

X

The genotype matrix (BEDMatrix, regular R matrix, or function, same as popkin).

labs

A vector of subpopulation assignments for every individual.

pops

An optional vector of unique subpopulation labels, in the desired order for the matrix. By default the unique labels in labs sorted alphabetically are used.

m

The number of loci, required if X is a function (ignored otherwise). In particular, m is obtained from X when it is a BEDMatrix or a regular R matrix.

loci_on_cols

Determines the orientation of the genotype matrix (by default, FALSE, loci are along the rows). If X is a BEDMatrix object, the input value is ignored (set automatically to TRUE internally).

mem_factor

Proportion of available memory to use loading and processing genotypes. Ignored if mem_lim is not NA.

mem_lim

Memory limit in GB, used to break up genotype data into chunks for very large datasets. Note memory usage is somewhat underestimated and is not controlled strictly. Default in Linux and Windows is mem_factor times the free system memory, otherwise it is 1GB (OSX and other systems).

m_chunk_max

Sets the maximum number of loci to process at the time. Actual number of loci loaded may be lower if memory is limiting.

Value

A symmetric matrix of FST estimates between every pair of subpopulations. The diagonal has zero values.

See Also

The popkin package.

Examples

# dimensions of simulated data
n_ind <- 100
m_loci <- 1000
k_subpops <- 10
n_data <- n_ind * m_loci

# missingness rate
miss <- 0.1

# simulate ancestral allele frequencies
# uniform (0,1)
# it'll be ok if some of these are zero
p_anc <- runif(m_loci)

# simulate some binomial data
X <- rbinom(n_data, 2, p_anc)

# sprinkle random missingness
X[ sample(X, n_data * miss) ] <- NA

# turn into a matrix
X <- matrix(X, nrow = m_loci, ncol = n_ind)

# create subpopulation labels
# k_subpops groups of equal size
labs <- ceiling( (1 : n_ind) / k_subpops )

# estimated pairwise FST matrix using the "Hudson" formula
fst_hudson_matrix <- fst_hudson_pairwise(X, labs)


OchoaLab/popkinsuppl documentation built on May 17, 2022, 9:50 a.m.