R/sim_random_family.R

Defines functions sim_random_family

Documented in sim_random_family

#'
#' @title Simulation with random family sizes
#' @md
#' @description Simulate genetic data, including genotypes,
#' phenotype status and liabilities, for individuals with family history, where
#' each individual is given a random number of siblings.
#' 
#' @details
#' Parents' genotypes are simulated and used for creating the genotypes of
#' the individuals and their siblings. For the methodology behind the
#' simulation, see `vignette("liability-distribution")`.\cr
#' The number of siblings generated by either a multinomial distribution or
#' a Poisson distribution. The function then calls `sim_varied_family()`
#' with the appropriate `n` and `dist` parameters.\cr
#' The choice of distribution depends on the input given to `sib_fert`. If a
#' a single number is given, the function uses the Poisson distribution to
#' randomly select how many siblings the different individuals have. In this
#' case, `dist` should not be specified.\cr
#' If a vector is given, then the function uses `sib_fert` as probabilities
#' of a multinomial distribution to select how many individuals have the number
#' of siblings specified in each entry of `dist`.\cr
#' E.g., `sib_fert = c(1/6, 1/6, 2/3)` and `dist = c(0, 2, 3)` means that the
#' probability of having 0 siblings is 1/6, the probability of having 2
#' siblings is 1/6 and the probability of having 3 siblings is 2/3.\cr
#' Since individuals have a different number of siblings, some entries in
#' `phenotypes.txt` will be missing, denoted by -9.\cr
#' E.g., an individual with 1 sibling in a dataset where the maximum number of
#' siblings is 3, would have -9 in all columns relating to sibling 2 and
#' sibling 3.\cr
#' \code{sim_random_family} makes use of parallel computation in order to
#' decrease the running time. As one CPU core is left unused, the user
#' should be able to do other work while the simulation is running.
#'
#' @param n number of genotypes (individuals).
#' @param m number of SNPS per genotype.
#' @param q number of causal SNPs, i.e. SNPs that effect chances of having
#' the phenotype.
#' @param hsq squared heritability parameter.
#' @param k prevalence of phenotype.
#' @param sib_fert either the distribution vector or a fertility rate. See
#' details section.
#' @param dist if `sib_fert` is a distribution vector, then `dist` is used to
#' specify how many siblings the probabilities in `sib_fert` correspond to.
#' @param path directory where the files will be stored. If nothing is
#' specified, \code{sim_random_family} writes its files in the current
#' working directory.
#'
#' @return A list where first entry is the number of individuals, i.e. the `n`
#' parameter supplied to `sim_varied_family()`, and the second entry is the
#' number of siblings those individuals have, i.e. the  `dist` parameter
#' supplied to `sim_varied_family()`.\cr
#' This is returned once the simulation is done printing the following five
#' files to the \code{path} parameter specified in the function call:
#' * Three text files:
#'     * beta.txt - a file of \code{m} rows with one column. The i'th row is
#'     the true effect of the i'th SNP.
#'     * MAFs.txt - a file of \code{m} rows with one column. The i'th row is
#'     the true Minor Allelle Frequency of the i'th SNP.
#'     * phenotypes.txt - a file of \code{n} rows, number of columns depend on
#'     number of siblings. The file contains the phenotype status and liability
#'     of each individual as well as information on the liabilities and
#'     phenotype status of their parents and siblings.
#' * genotypes.map - a file created such that PLINK will work with the genotype
#' data.
#' * genotypes.ped - the simulated genotypes in a PLINK-readable format.
#' Note: The function only saves genotype data for the target individual.
#'
#' @section Warning:
#' Simulating large datasets takes time and generates large files. For details
#' on time complexity and required disk space, see
#' `vignette("sim-benchmarks")`.\cr
#' The largest file generated is `genotypes.ped`. See `convert_geno_file()` to convert it
#' to another file format, thereby reducing its size significantly.
#'
#' @import stats
#'
#' @export

sim_random_family <- function(n, m, q, hsq, k, sib_fert, dist = 0, path = "") {
  stopifnot("sib_fert and dist needs to have same length" =
              length(sib_fert) == length(dist),
            "n needs to be a positive integer" =
              (n > 0 && is.numeric(n) && n == round(n)),
            "m needs to be a positive integer" =
              (m > 0 && is.numeric(m) && m == round(m)),
            "q needs to be a positive integer and smaller than m" =
              (q > 0 && is.numeric(q) && q == round(q) && length(q) == 1
               && q <= m),
            "hsq needs to be a number between 0 and 1" =
              (hsq > 0 && hsq < 1 && is.numeric(hsq) && length(hsq) == 1),
            "k needs to be a number between 0 and 1" =
              (k > 0 && k < 1 && is.numeric(k) && length(k) == 1),
            "sib_fert should either be a probability vector or a positive integer" =
              (is.numeric(sib_fert) && all(sib_fert > 0) &&
              ((length(sib_fert) > 1 && sum(sib_fert) == 1) ||
                 length(sib_fert) == 1)),
            "dist needs to be a vector of non-negative integers" =
              (all(dist >= 0) && is.numeric(dist) && all(dist == round(dist))),
            "path needs to be default or a valid path ending with '/' or '\\\\'"
            = (path == "" || (dir.exists(path))
               && (substr(path, nchar(path), nchar(path)) == "/" ||
                     substr(path, nchar(path), nchar(path)) == "\\")))

  if (length(sib_fert) > 1) {
    individual_distribution <- c(rmultinom(1, n, sib_fert))
  }
  else {
    vals <- rpois(n, sib_fert)
    individual_distribution <- as.vector(table(vals))
    dist <- as.numeric(names(table(vals)))
  }
  sim_varied_family(n = individual_distribution, m = m, q = q, hsq = hsq, k = k, dist = dist, path = path)
  return(list("simulated_individuals(n)" = individual_distribution, "number_of_siblings(dist)" = dist))
}
FireGutter/geneference documentation built on Dec. 17, 2021, 8:27 p.m.