geno_last_gen_admix_recomb: Simulate an admixed family efficiently with founders with LD
In OchoaLab/simfam: Simulate and Model Family Pedigrees with Structured Founders

View source: R/geno_last_gen_admix_recomb.R

geno_last_gen_admix_recomb

R Documentation

Simulate an admixed family efficiently with founders with LD

Description

This function in essence combines pop_recomb() to simulate founders of known ancestries with LD (following a Li-Stephens-like model), draws recombination breaks of focal last-generation descendants from the specified pedigree using recomb_last_gen(), and their genomes from the founders variants using recomb_haplo_inds(). However, since a limited portion of founder sequences is actually inherited, the simulation is made much more efficient by simulating only those subsequences that were inherited, which saves time, and utilizing sparse matrices, which saves memory too. See below for a more detailed algorithm.

Usage

geno_last_gen_admix_recomb(
  anc_haps,
  bim,
  map,
  G,
  fam,
  ids,
  founders_anc,
  indexes_chr_ends = NULL,
  loci_on_cols = FALSE,
  missing_vals = c("", 0)
)

Arguments

`anc_haps`	A named list that maps the code used for each ancestry to its haplotype matrix. Each of the haplotype matrices the argument `haps` passed to `pop_recomb()`, namely is a regular matrix or `BEDMatrix` object of haplotype values, one row per locus, one column per haplotype (half individual), or transposed if `loci_on_cols = TRUE` and for `BEDMatrix` objects. Here, these values must be numeric (recommended are zeroes and ones, indicating absence or presence of reference allele).
`bim`	The table of variants of `haps`, which is a data.frame/tibble with at least two columns: `chr` (must be numeric between 1 and the maximum chromosome in `map` below for map to work) and `pos` (base pair position, usually an integer).
`map`	The genetic map, a list of chromosomes each of which is a data.frame/tibble with columns `pos` for base pair position and `posg` for genetic position.
`G`	Number of generations since most recent common ancestor of population (to multiply standard recombination rate)
`fam`	The pedigree data.frame, in plink FAM format. Only columns `id`, `pat`, and `mat` are required. `id` must be unique and non-missing. Founders must be present, and their `pat` and `mat` values must be missing (see below). Non-founders must have both their parents be non-missing. Parents must appear earlier than their children in the table.
`ids`	A list containing vectors of IDs for each generation. All these IDs must be present in `fam$id`. If IDs in `fam` and `ids` do not fully agree, the code processes the IDs in the intersection, which is helpful when `fam` is pruned but `ids` is the original (larger) set.
`founders_anc`	a named vector that maps every founder haplotype (the names of this vector) to its ancestry code. Ancestry codes must match the codes used in `anc_haps` above. Founder haplotypes are the founder individual IDs from the pedigree (values in `ids[[1]]`) appearing twice, suffixed with "_pat" and "_mat", respectively (so the parents of the founders are unadmixed, though founders be first generation admixed this way).
`indexes_chr_ends`	Optional vector mapping each chromosome (index in vector) to the last index in the `bim` table; missing chromosomes must be mapped to `NA` values. If `NULL` (default), this vector is calculated from the provided `bim` table. However, if this function is called repeatedly on the same `bim` data, it can improve efficiency to calculate it outside just once (use `indexes_chr()` for this).
`loci_on_cols`	If `TRUE`, `haps` has loci on columns and individuals on rows; if `FALSE` (default), loci are on rows and individuals on columns. If `haps` is a `BEDMatrix` object, `loci_on_cols` is ignored (set automatically to `TRUE` internally).
`missing_vals`	The list of ID values treated as missing. `NA` is always treated as missing. By default, the empty string (”) and zero (0) are also treated as missing (remove values from here if this is a problem).

Details

This function wraps around several exported package functions to achieve its objectives, which are roughly grouped into the following 4 phases. Phase 1 simulates recombination in the family without explicit sequences. In particular, it initializes the founder haplotype structure (without variants yet) using recomb_init_founders(), then simulates recombination breaks along the pedigree and identifies all the founder haplotype blocks in the focal individuals using recomb_last_gen(), and maps recombination breaks in cM to basepairs using recomb_map_inds(). Phase 2 reorganizes this data to identify the unique founder blocks that were inherited, first by making the data tidy with tidy_recomb_map_inds(), then applying recomb_founder_blocks_inherited(). Phase 3 initializes founder haplotypes using sparse matrices from the package Matrix, and draws inherited founder subsequences according to their known ancestries and using the Li-Stephens-like haplotype model of pop_recomb(). Phase 4 constructs the genotype matrices of focal individuals using the haplotypes of the founders drawn in phase 3 and the known origin of focal blocks from founders from phase 1, first constructing this data at the phased haplotype level with recomb_haplo_inds(), reencoding as unphased genotypes using recomb_geno_inds(), and constructing the corresponding local ancestry dosages using recomb_admix_inds().

Value

A named list with three elements:

X: the genotype matrix of the focal individuals, as returned by recomb_geno_inds().
Ls: a list, mapping each ancestry to its matrix of local ancestry dosages, as returned by recomb_admix_inds().
haplos: a phased version of the haplotypes and local ancestries of the focal individuals, structured as nested lists, as returned by recomb_haplo_inds().

Examples

library(tibble)

# simulate random haplotypes for example
# this toy data has 10 SNPs per chromosome, in fixed positions for simplicity
bim <- tibble( chr = rep( 1 : 22, each = 10 ), pos = rep( (1:10) * 1e6, 22 ) )
# and random haplotype data to go with this
n_ind_hap <- 10
m_loci <- nrow( bim )
# NOTE ancestry labels can be anything but must match `founders_anc` below
anc_haps <- list(
    'AFR' = matrix( rbinom( m_loci * n_ind_hap, 1L, 0.5 ), nrow = m_loci, ncol = n_ind_hap ),
    'EUR' = matrix( rbinom( m_loci * n_ind_hap, 1L, 0.2 ), nrow = m_loci, ncol = n_ind_hap )
)

# now simulate a very small family with one individual, 2 parents, 4 implicit grandparents
data <- fam_ancestors( 2 )
fam <- data$fam
ids <- data$ids
# select ancestries for each of the 4 grandparents / founder haplotypes (unadmixed)
founders_anc <- c('AFR', 'AFR', 'AFR', 'EUR')
# set names of founders with _pat/mat, needed to match recombination structure
# order is odd but choices were random so that doesn't matter
names( founders_anc ) <- c(
    paste0( ids[[1]], '_pat' ),
    paste0( ids[[1]], '_mat' )
)

# this performs the simulation!
data <- geno_last_gen_admix_recomb( anc_haps, bim, recomb_map_hg38, 10, fam, ids, founders_anc )
# this is the genotype matrix for the one admixed individual
data$X
# the corresponding local ancestry dosage matrices
# names match input labels
data$Ls$AFR
data$Ls$EUR
# if desired, a more complete but more complicated structure holding phased haplotypes
# and phased local ancestry information
data$haplos

OchoaLab/simfam documentation built on Jan. 11, 2025, 12:23 a.m.