simulate_missing_data_array: Simulate Qijs across multiple relevant levels of missing data

View source: R/simulate_missing_data_array.R

simulate_missing_data_arrayR Documentation

Simulate Qijs across multiple relevant levels of missing data

Description

Some data sets have a lot of missing markers. If this is the case, it is not OK to just do the simulations as if there is no missing data. This function wraps up a lot of different steps that can be taken to try to get more accurate "first-pass" FPRs and FNRs for situations with a lot of missing data. The steps are:

  • Tabulate the distribution of the number of informative (i.e., not missing in either member of the pair) markers, across all pairs. (Note, this requires that you have an actual data set that you are trying to do relationship inference in.)

  • Estimate missingness rates per locus, and from that calculate the rate of missingness in pairs, under a simple independence assumption.

  • Simulate Q_ij values at a series of different numbers of non-missing loci to calculate FPRs and FNRs for those.

Usage

simulate_missing_data_array(LG, C, num_points = 11, num_cores = 1, ...)

Arguments

LG

the genotypes in long format. It must have the columns Indiv (unique IDs of the individuals), Locus, gene_copy (must be 1 or 2 denoting which of the two gene copies in a diploid each allele is), and Allele, which must be a character. If there are any missing genotypes in the data frame, they must appear as NAs in the Allele column.

C

the ckmr object upon which to base the simulations.

num_points

the number of different values between the lowest observed number of pairwise non-missing genotypes and the highest, inclusive, that simulations will be performed for.

num_cores

Number of cores to parallelize the simulations over (using mclapply) from the parallel package. On Windows, parallelization is not available from forking so this must remain equal to 1 on Windows.

...

Arguments passed on to simulate_Qij

sim_relats

a vector of names of the relationship IDs (these were the rownames in the kappa_matrix argument to create_ckmr to simulate from. For each relationship ID in sim_relats, genotype values will get simulated from the Y_l_true values in C.

calc_relats

a vector of names of the relationship IDs to calculate the genotype log probabilities of the simulated genotypes from. Genotype log probs are calculated using the Y_l matrices.

reps

a synonym for calc_relats for compatibility to an earlier version of CKMRsim.

unlinked

A logical indicating whether to simulate the markers as unlinked. By default this is TRUE. If FALSE, then genotypes at linked markers will be simulated using the program MENDEL, genotyping errors will be applied to them, and the Q_ij values themselves will still be computed under the assumption of no linkage. However, they will be simulated under the no-linkage model for relationships "U", "PO", and "MZ", because, in the absence of LD, related pairs under those relationships are not affected by physical linkage.

pedigree_list

If you specify unlinked == FALSE, then you have to supply a pedigree_list.

Value

This function returns a list. More on that later.

Examples

# this is just here for testing at the moment
LG <- read_rds("/tmp/LG.rds")
C <- read_rds("/tmp/C.rds")

eriqande/CKMRsim documentation built on June 12, 2025, 1:15 p.m.