#'
#' @title Simulation with random family sizes
#' @md
#' @description Simulate genetic data, including genotypes,
#' phenotype status and liabilities, for individuals with family history, where
#' each individual is given a random number of siblings.
#'
#' @details
#' Parents' genotypes are simulated and used for creating the genotypes of
#' the individuals and their siblings. For the methodology behind the
#' simulation, see `vignette("liability-distribution")`.\cr
#' The number of siblings generated by either a multinomial distribution or
#' a Poisson distribution. The function then calls `sim_varied_family()`
#' with the appropriate `n` and `dist` parameters.\cr
#' The choice of distribution depends on the input given to `sib_fert`. If a
#' a single number is given, the function uses the Poisson distribution to
#' randomly select how many siblings the different individuals have. In this
#' case, `dist` should not be specified.\cr
#' If a vector is given, then the function uses `sib_fert` as probabilities
#' of a multinomial distribution to select how many individuals have the number
#' of siblings specified in each entry of `dist`.\cr
#' E.g., `sib_fert = c(1/6, 1/6, 2/3)` and `dist = c(0, 2, 3)` means that the
#' probability of having 0 siblings is 1/6, the probability of having 2
#' siblings is 1/6 and the probability of having 3 siblings is 2/3.\cr
#' Since individuals have a different number of siblings, some entries in
#' `phenotypes.txt` will be missing, denoted by -9.\cr
#' E.g., an individual with 1 sibling in a dataset where the maximum number of
#' siblings is 3, would have -9 in all columns relating to sibling 2 and
#' sibling 3.\cr
#' \code{sim_random_family} makes use of parallel computation in order to
#' decrease the running time. As one CPU core is left unused, the user
#' should be able to do other work while the simulation is running.
#'
#' @param n number of genotypes (individuals).
#' @param m number of SNPS per genotype.
#' @param q number of causal SNPs, i.e. SNPs that effect chances of having
#' the phenotype.
#' @param hsq squared heritability parameter.
#' @param k prevalence of phenotype.
#' @param sib_fert either the distribution vector or a fertility rate. See
#' details section.
#' @param dist if `sib_fert` is a distribution vector, then `dist` is used to
#' specify how many siblings the probabilities in `sib_fert` correspond to.
#' @param path directory where the files will be stored. If nothing is
#' specified, \code{sim_random_family} writes its files in the current
#' working directory.
#'
#' @return A list where first entry is the number of individuals, i.e. the `n`
#' parameter supplied to `sim_varied_family()`, and the second entry is the
#' number of siblings those individuals have, i.e. the `dist` parameter
#' supplied to `sim_varied_family()`.\cr
#' This is returned once the simulation is done printing the following five
#' files to the \code{path} parameter specified in the function call:
#' * Three text files:
#' * beta.txt - a file of \code{m} rows with one column. The i'th row is
#' the true effect of the i'th SNP.
#' * MAFs.txt - a file of \code{m} rows with one column. The i'th row is
#' the true Minor Allelle Frequency of the i'th SNP.
#' * phenotypes.txt - a file of \code{n} rows, number of columns depend on
#' number of siblings. The file contains the phenotype status and liability
#' of each individual as well as information on the liabilities and
#' phenotype status of their parents and siblings.
#' * genotypes.map - a file created such that PLINK will work with the genotype
#' data.
#' * genotypes.ped - the simulated genotypes in a PLINK-readable format.
#' Note: The function only saves genotype data for the target individual.
#'
#' @section Warning:
#' Simulating large datasets takes time and generates large files. For details
#' on time complexity and required disk space, see
#' `vignette("sim-benchmarks")`.\cr
#' The largest file generated is `genotypes.ped`. See `convert_geno_file()` to convert it
#' to another file format, thereby reducing its size significantly.
#'
#' @import stats
#'
#' @export
sim_random_family <- function(n, m, q, hsq, k, sib_fert, dist = 0, path = "") {
stopifnot("sib_fert and dist needs to have same length" =
length(sib_fert) == length(dist),
"n needs to be a positive integer" =
(n > 0 && is.numeric(n) && n == round(n)),
"m needs to be a positive integer" =
(m > 0 && is.numeric(m) && m == round(m)),
"q needs to be a positive integer and smaller than m" =
(q > 0 && is.numeric(q) && q == round(q) && length(q) == 1
&& q <= m),
"hsq needs to be a number between 0 and 1" =
(hsq > 0 && hsq < 1 && is.numeric(hsq) && length(hsq) == 1),
"k needs to be a number between 0 and 1" =
(k > 0 && k < 1 && is.numeric(k) && length(k) == 1),
"sib_fert should either be a probability vector or a positive integer" =
(is.numeric(sib_fert) && all(sib_fert > 0) &&
((length(sib_fert) > 1 && sum(sib_fert) == 1) ||
length(sib_fert) == 1)),
"dist needs to be a vector of non-negative integers" =
(all(dist >= 0) && is.numeric(dist) && all(dist == round(dist))),
"path needs to be default or a valid path ending with '/' or '\\\\'"
= (path == "" || (dir.exists(path))
&& (substr(path, nchar(path), nchar(path)) == "/" ||
substr(path, nchar(path), nchar(path)) == "\\")))
if (length(sib_fert) > 1) {
individual_distribution <- c(rmultinom(1, n, sib_fert))
}
else {
vals <- rpois(n, sib_fert)
individual_distribution <- as.vector(table(vals))
dist <- as.numeric(names(table(vals)))
}
sim_varied_family(n = individual_distribution, m = m, q = q, hsq = hsq, k = k, dist = dist, path = path)
return(list("simulated_individuals(n)" = individual_distribution, "number_of_siblings(dist)" = dist))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.