hapSim: Simulation of genetic data in Haplin format

View source: R/hapSim.R

hapSimR Documentation

Simulation of genetic data in Haplin format

Description

Simulates genetic data in Haplin format, consisting of fetal effects, maternal effects and/or parent-of-origin effects. Allows for simulation of both autosomal and X-linked markers, assuming Hardy-Weinberg equilibrium. Enables stratified simulations for gene-environment interaction analyses, i.e the input (relative risks, number of cases etc) may vary across different strata.

Usage

hapSim(nall, n.strata = 1, cases, controls, haplo.freq,
RR, RRcm, RRcf, RRstar, RR.mat, RRstar.mat, 
gen.missing.cases = NULL, gen.missing.controls = NULL,
n.sim = 1000, xchrom = F, sim.comb.sex = "double", BR.girls, dire = "simfiles", 
ask = TRUE, verbose = TRUE, cpus = 1)

Arguments

nall

A vector of the number of alleles at each locus.

n.strata

The number of strata.

cases

A list of the number of case families. Each element is a vector of the number of families of the specified family design(s) in the corresponding stratum. The possible family designs, i.e., the possible names of the elements, are "mfc" (full triad), "mc" (mother-child dyad), "fc" (father-child dyad) or "c" (a single case child). See Details for a thorough description.

controls

A list of the number of control families. Each element is a vector of the number of families of the specified family design(s) in the corresponding stratum. The possible family designs are "mfc" (full triad), "mc" (mother-child-dyad), "fc" (father-child dyad), "mf" (mother-father dyad), "c" (a single control child), "m" (a single control mother) or "f" (a single control father). See Details for a thorough description.

haplo.freq

A list of which each element is a numeric vector of the haplotype frequencies in each stratum. The frequencies will be normalized so that they sum to one. The Details section shows how to implement this argument in agreement with the possible haplotypes.

RR

A list of which each element is a numeric vector of the relative risks in each stratum. The Details section shows how to implement this argument in agreement with the possible haplotypes.

RRcm

A list of numeric vectors. Each vector contains the relative risks associated with the haplotypes transmitted from the mother for this stratum. See Details for description of how to implement this argument in agreement with the possible haplotypes.

RRcf

A list of numeric vectors. Each vector contains the relative risks associated with the haplotypes transmitted from the father for this stratum. See Details for description of how to implement this argument in agreement with the possible haplotypes.

RRstar

A list of numeric vectors. Estimates how much double-dose children would deviate from the risk expected in a multiplicative dose-response relationship.

RR.mat

The interpretation is similar to RR when simulating genetic data with maternal effects.

RRstar.mat

The interpretation is parallel to RRstar when simulating genetic data with maternal effects.

gen.missing.cases

Generates missing values at random for the case families. Set to NULL by default, i.e., no missing values generated. See Details for description of how to implement this argument.

gen.missing.controls

Generates missing values at random for the control families. Set to NULL by default, i.e., no missing values generated. See Details for description of how to implement this argument.

n.sim

The number of simulations, i.e., the number of simulated data files.

xchrom

Logical. Equals FALSE by default, which indicates simulation of autosomal markers. If TRUE, hapSim simulates X-linked genes.

sim.comb.sex

To be used with xchrom = TRUE. A character value which specifies how to handle gender differences on the X-chromosome. If "single", the effect of a (single) allele in males is equal to the effect of a single allele dose in females, and similarly, if "double", a single allele in males has the same effect as a double allele dose in females. Default is "double", which corresponds to X-inactivation.

BR.girls

To be used with xchrom = TRUE. Gives the ratio of baseline risk for females to the baseline risk for males.

dire

Gives the directory of the simulated data files.

ask

Logical. If TRUE, hapSim will ask before overwriting the files in an already existing directory.

verbose

Logical. Default is TRUE, which means that the file name is displayed for each iteration. Works only when cpus = 1.

cpus

Allows simulations to be performed in parallel. The cpus argument should preferably be set to the number of available cores. If set lower, it will save some capacity for other processes to run. Setting it too high should not cause any serious problems.

Details

hapSim simulates allele values for case and control families at multiple markers (typically in LD) simultaneously. The number of markers/SNPs involved will typically be in the range 1 to 6. Data are simulated to produce relative risks of disease as specified by the user input. Simulations can be performed separately over a number of strata so as to simulate gene-environment interactions.

Specifying haplotype risks:
The number of haplotypes used in the simulations is determined by the nall argument, since prod(nall) different haplotypes can be made from the specified number of markers, length(nall). The arguments haplo.freq, RR, RRcm, RRcf, RRstar, RR.mat, and RRstar.mat are all lists where each element represents a stratum. Within each stratum, the arguments are vectors of length equal to the number of haplotypes, specifying the relative risk etc. associated with each haplotype. The stratum specific arguments may be simplified if the number of strata is one, or if the arguments are equal across all strata.

The haplotypes are determined by creating all possible haplotypes from the given markers, in a sequence where the first marker varies mostly quickly. For instance, if nall = c(3,2), the first marker has 3 alleles, the second has 2, and 6 haplotypes are possible. Taken in order, the haplotypes are 1-1, 2-1, 3-1, 1-2, 2-2, and 3-2. When specifying, say, RR = c(1,2,1,1,1,1) the haplotype 2-1 has a double risk compared to the rest. With, for instance, two strata, the specification RR = list(c(1,2,1,1,1,1), c(1,1,1,1,1,1)) would mean that the risk associated with 2-1 is elevated only in the first stratum, not the second.

The simplest example would be with nall = c(2) and RR = c(1,2), which would simulate a single SNP where the second allele has a double risk.

Output file format:
The format of the simulated files is relatively flexible and allows multi-allelic markers and various designs. If both case and control families are present, the simulated files contain a leading column of the case/control status (1/0). If xchrom=TRUE, the neighboring column to the left of the genetic data contains the sex information (1 = male, 2 = female). Each line represents genotypes for a case or control triad.

There are six columns for each locus, two for the mother (M), two for the father (F) and two for the child (C). The columns are placed in the following sequence: M11 M12 F11 F12 C11 C12 M21 M22 F21 F22 C21 C22... etc, where the first number indicates marker, and the second number indicates the first or second allele at this locus. Columns are separated by a single white space, and missing data are coded as NA.

Intermediate designs, for instance mother-child dyads, are represented as full triads with columns of absent family members set to missing. In the case of a pure case-control design, however, each line represents a single individual, and there are no columns representing mothers and fathers.

There are no row or column names in the files.

Some examples are given below. See https://haplin.bitbucket.io/docu/haplin_data_format.html for a thorough description of the Haplin format. Note that this description separates the two alleles for an individual within a locus by a semi-colon, such as 1;2. This is, however, not necessary.
Confer the document https://haplin.bitbucket.io/docu/Haplin_power.pdf for details and examples on how to perform the simulations.

gen.missing.cases and gen.missing.controls are flexible arguments. By default, both equal NULL, which means that no missing data are generated at random. If the arguments are single numbers, missing data are generated at random with this proportion for all cases and/or controls. If the arguments are vectors of length equal to the number of loci, missing data are generated with the corresponding proportion for each locus. The arguments can also be matrices with the number of rows equal to the number of loci and three columns. Each row corresponds to a locus, and the columns correspond to mothers, fathers and children, respectively.

Author(s)

Miriam Gjerdevik,
with Hakon K. Gjessing
Professor of Biostatistics
Division of Epidemiology
Norwegian Institute of Public Health

hakon.gjessing@uib.no

References

Web Site: https://haplin.bitbucket.io

See Also

haplin, hapRun, hapPower

Examples

## Not run: 
## Simulate genetic data (100 files) at two diallelic markers, consisting of fetal effects 
## corresponding to haplo.freq = rep(0.25, 4), RR = c(2,1,1,1) and RRstar = c(1,1,1,1),
## for the combination of 1000 case triads and 1000 control triads with no missing data.
## Only one stratum.
hapSim(nall = c(2,2), n.strata = 1, cases = c(mfc=1000),
controls = c(mfc=1000), haplo.freq = rep(0.25,4),
RR = c(2,1,1,1), RRstar = c(1,1,1,1), n.sim = 100, dire = "simfiles")

## Simulate genetic data (100 files) at two diallelic markers,
## consisting of fetal and maternal effects corresponding to 
## haplo.freq = rep(0.25, 4), RR = c(2,1,1,1), RRstar = c(1,1,1,1),
## RR.mat = c(2,1,1,1) and RRstar.mat = c(1,1,1,1),
## for 1000 case triads and zero control families. 
## One percent of the case triads are missing at random. One stratum only.
hapSim(nall = c(2,2), n.strata=1, cases = c(mfc=1000),
controls = c(mfc=0), haplo.freq = rep(0.25,4), RR = c(2,1,1,1),
RRstar = c(1,1,1,1), RR.mat = c(2,1,1,1), RRstar.mat = c(1,1,1,1), 
gen.missing.cases = 0.01, n.sim = 100, dire = "simfiles")

## Simulate genetic data (100 files) at two diallelic markers. In the first stratum, 
## we have a combination of 500 case triads and 500 control triads with
## haplo.freq = rep(0.25, 4), RR = c(2,1,1,1) and RRstar = c(1,1,1,1).
## In the second stratum, we have 300 case triads and 500 control triads with
## haplo.freq = rep(0.25, 4), RR = c(1,1,1,1) and RRstar = c(1,1,1,1).
## One percent of the control triads are missing at random in the first stratum.
hapSim(nall = c(2,2), n.strata= 2, cases = list(c(mfc=500),c(mfc=300)), 
controls = c(mfc=500),haplo.freq = rep(0.25,4),
RR = list(c(2,1,1,1),c(1,1,1,1)), RRstar = c(1,1,1,1),
gen.missing.controls = list(0.01,NULL), n.sim = 100, dire = "simfiles")

## End(Not run)

Haplin documentation built on Sept. 11, 2024, 7:13 p.m.