Generate familial time-to-event data

Description

This function generates familial time-to-event data for specified study design, genetic model and source of residual familial correlation; the generated data frame also contains family structure (individual's id, father id, mother id, relationship to proband, generation), gender, current age, genotypes of major or second genes.

Usage

1
2
3
4
simfam(N.fam, design="pop", variation="none", depend=1, 
       base.dist="Weibull", frailty.dist="gamma", base.parms, vbeta, 
       allelefreq=c(0.02, 0.2), dominant.m=TRUE, dominant.s=TRUE,
       mrate=0, hr=0, age1=c(65,2.5), age2=c(45,2.5), agemin=20)

Arguments

N.fam

Number of families to generate.

design

Family based study design used in the simulations. Possible choices are: "pop", "pop+", "cli", "cli+" or "twostage", where "pop" is for the population-based design that families are ascertained by affected probands, "pop+" is similar to "pop" but with mutation carrier probands, "cli" is for the clinic-based design that includes affected probands with at least one parent and one sib affected, "cli+" is similar to "cli" but with mutation carrier probands and "twostage" for two-stage design that randomly samples families from the population in the first stage and oversamples high risk families in the second stage that include at least two affected members in the family. Default is "pop".

variation

Source of residual familial correlation. Possible choices are: "frailty" for frailty shared within families, "secondgene" for second gene variation, or "none" for no residual familial correlation. Default is "none".

depend

Variance of the frailty distribution. Dependence within families increases with depend value. Default value is 1.

base.dist

Choice of baseline hazard distribution. Possible choices are: "Weibull", "loglogistic", "Gompertz", "lognormal", or "gamma". Default is "Weibull".

frailty.dist

Choice of frailty distribution. Possible choices are: "gamma" for gamma distribution or "lognormal" for log normal distribution. Default is "gamma".

base.parms

Vector of parameter values for baseline hazard function.

base.parms=c(lambda, rho), where lambda and rho are the shape and scale parameters, respectively.

vbeta

Vector of parameter values for gender, majorgene, and secondgene.

allelefreq

Vector of population allele frequencies of major and second disease gene alleles. Frequencies must be between 0 and 1. Default frequencies are 0.02 for major gene allele and 0.2 for second gene allele, allelefreq=c(0.02, 0.2)

dominant.m

logical; if TRUE, the genetic model of major gene is dominant, otherwise recessive.

dominant.s

logical; if TRUE, the genetic model of second gene is dominant, otherwise recessive.

mrate

Proportion of missing genotypes, value between 0 and 1. Default value is 0.

hr

Proportion of high risk families, which include at least two affected members, to be sampled from the two stage sampling. This value should be specified when design="twostage" is used. Default value is 0. Value should lie between 0 and 1.

age1

Vector of mean and standard deviation for the current age of generation 1 or grandparents. Default values are mean of 65 years and standard deviation of 2.5 years, age1=c(65,2.5).

age2

Vector of mean and standard deviation for the current age of generation 2 or proband generation. Default values are mean of 45 years and standard deviation of 2.5 years, age2=c(45,2.5).

agemin

Minimum age of disease onset. Default is 20 years of age.

Details

The design argument defines the type of family based design to be simulated. Two variants of the population-based and clinic-based design can be chosen: "pop" when proband is affected, "pop+" when proband is affected mutation carrier, "cli" when proband is affected and at least one parent and one sibling are affected, "cli+" when proband is affected mutation-carrier and at least one parent and one sibling are affected. The two-stage design, "twostage", is used to oversample high risk families, where the proportion of high risks families to include in the sample is specified by hr. High risk families often include multiple (at least two) affected members in the family.

Age at onset is generated from the penetrance model where residual familial correlation is induced by either a latent random variable called "frailty"" or a second gene shared by family members.

The penetrance model with a shared frailty model has the form

h(t|Z) = h_0(t-t_0) Z \exp(β_s x_s + β_{g1} x_{g1})

where Z represents a frailty shared within families and follows either a gamma or log-normal distribution; t_0 is a minimum age of disease onset; x_s indicates males (1) and females (0) and x_{g1} indicates carriers (1) and non-carriers (0) of major gene mutation.

The penetrance model with a second gene variation has the form

h(t|Z) = h_0(t-t_0) \exp(β_s x_s + β_{g1} x_{g1} + β_{g2} x_{g2})

where x_{g2} indicates carriers (1) and non-carriers (0) of a second gene mutation.

The current ages for each generation are simulated assuming normal distributions. However, the probands' ages are generated using a left truncated normal distribution as their ages cannot be less than the minimum age of onset. The mean age difference between each generation and their parents is specified as at least 20 years apart.

Value

The function returns a data frame which contains:

famID

Family identification number (id).

indID

Individual id.

gender

Gender indicator: 1 for males, 0 for females.

motherID

Mother id number.

fatherID

Father id number.

proband

Proband indicator: 1 if the individual is the proband, 0 otherwise.

generation

Individuals generation:1=parents of probands,2=probands and siblings,3=children of probands and siblings.

majorgene

Genotype of major gene: 1=AA, 2=Aa, 3=aa where A is disease gene.

secondgene

Genotype of second gene: 1=BB, 2=Bb, 3=bb where B is disease gene.

ageonset

Age at disease onset.

currentage

Current age.

time

Minimum time between current age and age at onset.

status

Disease status: 1 for affected and 0 for unaffected (censored).

mgene

Carrier status of major gene which can possibly be missing: 1 for carrier, 2 for non-carrier, NA for missing carrier status

relation

Family members' relationship with the proband is as follows

1 Proband (self)
2 Brother or sister
3 Son or daughter
4 Parent
5 Nephew or niece
6 Husband
7 Brother or sister in law
fsize

Family size including parents, siblings and children of the proband and the siblings.

naff

Number of affected members in family.

weight

Sampling weights.

Author(s)

Yun-Hee Choi, Wenqing He

References

Choi, Y.-H., Kopciuk, K. and Briollais, L. (2008) Estimating Disease Risk Associated Mutated Genes in Family-Based Designs, Human Heredity 66, 238-251

Choi, Y.-H. and Briollais (2011) An EM Composite Likelihood Approach for Multistage Sampling of Family Data with Missing Genetic Covariates, Statistica Sinica 21, 231-253

See Also

summary.simfam, plot.simfam, penplot

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
## Example 1: simulate family data from population-based design using
#  a Weibull distribution for the baseline hazard and inducing 
#  residual familial correlation through a shared gamma frailty.

fam <- simfam(N.fam=100, design="pop+", variation="frailty", 
       base.dist="Weibull", frailty.dist="gamma", depend=1, 
       allelefreq=0.02, base.parms=c(0.01,3), vbeta=c(-1.13, 2.35))

head(fam) 

#   famID indID gender motherID fatherID proband generation majorgene secondgene
# 1     1     1      1        0        0       0          1         2          0
# 2     1     2      0        0        0       0          1         3          0
# 3     1     3      0        2        1       1          2         2          0
# 4     1     4      1        0        0       0          0         3          0
# 5     1     7      0        3        4       0          3         2          0
# 6     1     8      1        3        4       0          3         3          0
#   ageonset currentage time status mgene relation fsize naff weight
# 1       70         68   68      0     1        4    11    1      1
# 2      110         68   68      0     0        4    11    1      1
# 3       36         40   36      1     1        1    11    1      1
# 4      212         50   50      0     0        6    11    1      1
# 5       79         19   19      0     1        3    11    1      1
# 6      169         16   16      0     0        3    11    1      1

summary(fam)

plot(fam, famid=c(1:2)) # pedigree plots for families with IDs=1 and 2

## Example 2: simulate family data from two stage design to include 
#  30% of high risk families in the sample. 

fam <- simfam(N.fam=100, design="twostage", variation="frailty", 
       base.dist="Weibull", frailty.dist="gamma", depend=1, hr=0.3,
       base.parms=c(0.01,3), vbeta=c(-1.13, 2.35), allelefreq=0.02)

summary(fam)