geno_last_gen_admix_recomb: Simulate an admixed family efficiently with founders with LD

View source: R/geno_last_gen_admix_recomb.R

geno_last_gen_admix_recombR Documentation

Simulate an admixed family efficiently with founders with LD

Description

This function in essence combines pop_recomb() to simulate founders of known ancestries with LD (following a Li-Stephens-like model), draws recombination breaks of focal last-generation descendants from the specified pedigree using recomb_last_gen(), and their genomes from the founders variants using recomb_haplo_inds(). However, since a limited portion of founder sequences is actually inherited, the simulation is made much more efficient by simulating only those subsequences that were inherited, which saves time, and utilizing sparse matrices, which saves memory too. See below for a more detailed algorithm.

Usage

geno_last_gen_admix_recomb(
  anc_haps,
  bim,
  map,
  G,
  fam,
  ids,
  founders_anc,
  indexes_chr_ends = NULL,
  loci_on_cols = FALSE,
  missing_vals = c("", 0)
)

Arguments

anc_haps

A named list that maps the code used for each ancestry to its haplotype matrix. Each of the haplotype matrices the argument haps passed to pop_recomb(), namely is a regular matrix or BEDMatrix object of haplotype values, one row per locus, one column per haplotype (half individual), or transposed if loci_on_cols = TRUE and for BEDMatrix objects. Here, these values must be numeric (recommended are zeroes and ones, indicating absence or presence of reference allele).

bim

The table of variants of haps, which is a data.frame/tibble with at least two columns: chr (must be numeric between 1 and the maximum chromosome in map below for map to work) and pos (base pair position, usually an integer).

map

The genetic map, a list of chromosomes each of which is a data.frame/tibble with columns pos for base pair position and posg for genetic position.

G

Number of generations since most recent common ancestor of population (to multiply standard recombination rate)

fam

The pedigree data.frame, in plink FAM format. Only columns id, pat, and mat are required. id must be unique and non-missing. Founders must be present, and their pat and mat values must be missing (see below). Non-founders must have both their parents be non-missing. Parents must appear earlier than their children in the table.

ids

A list containing vectors of IDs for each generation. All these IDs must be present in fam$id. If IDs in fam and ids do not fully agree, the code processes the IDs in the intersection, which is helpful when fam is pruned but ids is the original (larger) set.

founders_anc

a named vector that maps every founder haplotype (the names of this vector) to its ancestry code. Ancestry codes must match the codes used in anc_haps above. Founder haplotypes are the founder individual IDs from the pedigree (values in ids[[1]]) appearing twice, suffixed with "_pat" and "_mat", respectively (so the parents of the founders are unadmixed, though founders be first generation admixed this way).

indexes_chr_ends

Optional vector mapping each chromosome (index in vector) to the last index in the bim table; missing chromosomes must be mapped to NA values. If NULL (default), this vector is calculated from the provided bim table. However, if this function is called repeatedly on the same bim data, it can improve efficiency to calculate it outside just once (use indexes_chr() for this).

loci_on_cols

If TRUE, haps has loci on columns and individuals on rows; if FALSE (default), loci are on rows and individuals on columns. If haps is a BEDMatrix object, loci_on_cols is ignored (set automatically to TRUE internally).

missing_vals

The list of ID values treated as missing. NA is always treated as missing. By default, the empty string (”) and zero (0) are also treated as missing (remove values from here if this is a problem).

Details

This function wraps around several exported package functions to achieve its objectives, which are roughly grouped into the following 4 phases. Phase 1 simulates recombination in the family without explicit sequences. In particular, it initializes the founder haplotype structure (without variants yet) using recomb_init_founders(), then simulates recombination breaks along the pedigree and identifies all the founder haplotype blocks in the focal individuals using recomb_last_gen(), and maps recombination breaks in cM to basepairs using recomb_map_inds(). Phase 2 reorganizes this data to identify the unique founder blocks that were inherited, first by making the data tidy with tidy_recomb_map_inds(), then applying recomb_founder_blocks_inherited(). Phase 3 initializes founder haplotypes using sparse matrices from the package Matrix, and draws inherited founder subsequences according to their known ancestries and using the Li-Stephens-like haplotype model of pop_recomb(). Phase 4 constructs the genotype matrices of focal individuals using the haplotypes of the founders drawn in phase 3 and the known origin of focal blocks from founders from phase 1, first constructing this data at the phased haplotype level with recomb_haplo_inds(), reencoding as unphased genotypes using recomb_geno_inds(), and constructing the corresponding local ancestry dosages using recomb_admix_inds().

Value

A named list with three elements:

  • X: the genotype matrix of the focal individuals, as returned by recomb_geno_inds().

  • Ls: a list, mapping each ancestry to its matrix of local ancestry dosages, as returned by recomb_admix_inds().

  • haplos: a phased version of the haplotypes and local ancestries of the focal individuals, structured as nested lists, as returned by recomb_haplo_inds().

See Also

recomb_init_founders(), recomb_last_gen(), recomb_map_inds(), tidy_recomb_map_inds(), recomb_founder_blocks_inherited(), pop_recomb(), recomb_haplo_inds(), recomb_geno_inds(), recomb_admix_inds().

Examples

library(tibble)

# simulate random haplotypes for example
# this toy data has 10 SNPs per chromosome, in fixed positions for simplicity
bim <- tibble( chr = rep( 1 : 22, each = 10 ), pos = rep( (1:10) * 1e6, 22 ) )
# and random haplotype data to go with this
n_ind_hap <- 10
m_loci <- nrow( bim )
# NOTE ancestry labels can be anything but must match `founders_anc` below
anc_haps <- list(
    'AFR' = matrix( rbinom( m_loci * n_ind_hap, 1L, 0.5 ), nrow = m_loci, ncol = n_ind_hap ),
    'EUR' = matrix( rbinom( m_loci * n_ind_hap, 1L, 0.2 ), nrow = m_loci, ncol = n_ind_hap )
)

# now simulate a very small family with one individual, 2 parents, 4 implicit grandparents
data <- fam_ancestors( 2 )
fam <- data$fam
ids <- data$ids
# select ancestries for each of the 4 grandparents / founder haplotypes (unadmixed)
founders_anc <- c('AFR', 'AFR', 'AFR', 'EUR')
# set names of founders with _pat/mat, needed to match recombination structure
# order is odd but choices were random so that doesn't matter
names( founders_anc ) <- c(
    paste0( ids[[1]], '_pat' ),
    paste0( ids[[1]], '_mat' )
)

# this performs the simulation!
data <- geno_last_gen_admix_recomb( anc_haps, bim, recomb_map_hg38, 10, fam, ids, founders_anc )
# this is the genotype matrix for the one admixed individual
data$X
# the corresponding local ancestry dosage matrices
# names match input labels
data$Ls$AFR
data$Ls$EUR
# if desired, a more complete but more complicated structure holding phased haplotypes
# and phased local ancestry information
data$haplos


OchoaLab/simfam documentation built on Jan. 11, 2025, 12:23 a.m.