sim_trait: Simulate a complex trait from genotypes

Description Usage Arguments Details Value Examples

View source: R/sim_trait.R

Description

Simulate a complex trait y given a SNP genotype matrix and model parameters (the desired heritability and the true ancestral allele frequencies used to generate the genotypes, or alternatively the kinship matrix of the individuals). Users can choose the number of causal loci and minimum marginal allele frequency requirements for the causal loci. The code selects random loci to be causal, draws random Normal effect sizes for these loci (scaled appropriately) and random independent non-genetic effects. Below let there be m loci and n individuals.

Usage

1
2
3
sim_trait(X, m_causal, herit, p_anc, kinship, mu = 0, sigma_sq = 1,
  maf_cut = 0.05, loci_on_cols = FALSE, mem_factor = 0.7,
  mem_lim = NA)

Arguments

X

The m-by-n genotype matrix (if loci_on_cols = FALSE, transposed otherwise), or a BEDMatrix object'. This is a numeric matrix consisting of reference allele counts (in c(0,1,2,NA) for a diploid organism).

m_causal

The number of causal loci desired.

herit

The desired heritability (proportion of trait variance due to genetics).

p_anc

The length-m vector of true ancestral allele frequencies. Recommended way to adjust the simulated trait to achieve the desired heritability and covariance structure. Either this or kinship must be specified.

kinship

The n-by-n kinship matrix of the individuals in the data. This offers an alternative way to adjust the simulated parameters parameters to achieve the desired covariance structure for real genotypes, since p_anc is only known for simulated data. Either this or p_anc must be specified.

mu

The desired parametric mean value of the trait (default zero). The sample mean of the trait will not be exactly zero, but instead have an expectation of mu (with potentially large variance depending on the kinship matrix and the heritability).

sigma_sq

The desired parametric variance factor of the trait (default 1). This factor corresponds to the variance of an outbred individual (see cov_trait).

maf_cut

The optional minimum allele frequency threshold (default 5%). This prevents rare alleles from being causal in the simulation. Note that this threshold is applied to the sample allele frequencies and not their true parametric values (p_anc), even if these are available.

loci_on_cols

If TRUE, X has loci on columns and individuals on rows; if false (the default), loci are on rows and individuals on columns. If X is a BEDMatrix object, loci are taken to be on the columns (regardless of the value of loci_on_cols).

mem_factor

BEDMatrix-specific, sets proportion of available memory to use loading genotypes. Ignored if mem_lim is not NA.

mem_lim

BEDMatrix-specific, sets total memory to use loading genotypes, in GB. If NA (default), a proportion mem_factor of the available memory will be used.

Details

In order to center and scale the trait and locus effect size vector correctly to the desired parameters (mean, variance factor, and heritability), the parametric ancestral allele frequencies (p_anc) must be known. This is necessary since in the context of Heritability the genotypes are themselves random variables (with means given by p_anc and a covariance structure given by p_anc and the kinship matrix), so the parameters of the genotypes must be taken into account. If p_anc are indeed known (true for simulated genotypes), then the trait will have the specified mean and covariance matrix in agreement with cov_trait.

If the desire is to simulate a trait using real genotypes, where p_anc is unknown, a compromise that works well in practice is possible if the kinship matrix (kinship) is known (see package vignette). The kinship matrix can be estimated accurately using the popkin package!

Value

A list containing the simulated trait (length n), the vector of causal locus indexes causal_indexes (length m_causal), and the locus effect size vector causal_coeffs (length m_causal) at the causal loci. However, if herit = 0 then causal_indexes and causal_coeffs will have zero length regardless of m_causal.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# construct a dummy genotype matrix
X <- matrix(
           data = c(0,1,2,1,2,1,0,0,1),
           nrow = 3,
           byrow = TRUE
           )
# made up ancestral allele frequency vector for example
p_anc <- c(0.5, 0.6, 0.2)

# create simulated trait and associated data
obj <- sim_trait(X = X, m_causal = 2, herit = 0.8, p_anc = p_anc)

# trait vector
obj$trait
# randomly-picked causal locus indexes
obj$causal_indexes
# locus effect size vector
obj$causal_coeffs

OchoaLab/simtrait documentation built on Oct. 18, 2019, 5:42 a.m.