sim_and_write_plink: Simulate and write genotypes to plink format on the fly

View source: R/sim_and_write_plink.R

sim_and_write_plinkR Documentation

Description

To save memory, simulate small chunks of variants at the time and write them to file as you go. This is a wrapper around write_plink() and readr::write_lines() (for ancestral allele frequencies, optional) with append = TRUE that simplifies looping somewhat. The function always appends to the output, so it can be called several times if convenient, for example to simulate separate chromosomes.

Usage

sim_and_write_plink(
  sim_chunk,
  m_loci,
  fam,
  file,
  file_p_anc = NA,
  n_data_cut = 10^6
)

Arguments

sim_chunk

A function that generates a small number of loci at the time, to be called iteratively until the whole genome is complete. It should accept a single parameter, the number of loci to simulate at one time, and returns a list with these named elements:

  • X: the simulated genotype matrix, with the desired number of loci and the same individuals in every call. Required.

  • bim: the simulated variant table for the loci that were just simulated. Required.

  • p_anc: the vector of ancestral allele frequencies (required for simulating traits with correctly specified heritabilities). Optional.

m_loci

Total number of loci to simulate.

fam

Sample table of simulation to write.

file

Output file path, without extensions (each of .bed, .bim, .fam extensions will be added automatically as needed).

file_p_anc

Complete file path (with extensions) of vector of ancestral allele frequencies, if sim_chunk generates them (optional). This file is created with readr::write_lines(), so it is a plain text file with each line being the ancestral allele frequency of each locus in order, and it may be compressed if this file has a .gz extension.

n_data_cut

Number of cells (individuals times loci) to aim to simulate at the time. Actual number may be smaller to ensure that the number of loci is an integer, except if the number of individuals is greater than n_data_cut then a single locus will be simulated at the time (and the number of cells will be greater than n_data_cut).

See Also

write_plink()

Examples


# some global constants that will be accessed by simulator function
n <- 10
# and a global variable updated as we go
m_last <- 0

# define a trivial but complete genotype simulator function
my_sim_chunk <- function( m_chunk ) {
    # construct ancestral allele frequencies
    p_anc <- runif( m_chunk )
    # simulate genotypes from HWE
    X <- matrix( rbinom( m_chunk * n, 2, p_anc ), m_chunk, n )
    # construct a trivial BIM table
    bim <- make_bim( n = m_chunk )
    # but make sure count continues across chunks without repeats
    # (so IDs and positions don't clash!)
    bim$id <- m_last + ( 1 : m_chunk )
    # update global value (use <<-) for next round
    m_last <<- m_last + m_chunk
    # return all of these elements in a named list!
    return( list( X = X, bim = bim, p_anc = p_anc ) )
}

# the fam table is created fully now
fam <- make_fam( n = n )
# set other parameters
m_loci <- 100

# this is only necessary for example files to be in a *temporary* location
# (don't use `tempfile` in real cases)
# plink files path without extension
file <- tempfile('test')
# p_anc file should have extension
filep <- tempfile('test-p-anc.txt.gz')

# simulate and write as we go!
sim_and_write_plink( my_sim_chunk, m_loci, fam, file, filep )

# clean up: delete sample outputs
delete_files_plink( file )
file.remove( filep )


OchoaLab/genio documentation built on Feb. 22, 2025, 4:13 a.m.