simfam: Generate familial time-to-event data

View source: R/simfam.R

simfamR Documentation

Generate familial time-to-event data

Description

Generates familial time-to-event data for specified study design, genetic model and source of residual familial correlation; the generated data frame also contains family structure (individual's id, father id, mother id, relationship to proband, generation), gender, current age, genotypes of major or second genes.

Usage

simfam(N.fam, design = "pop", variation = "none", interaction = FALSE, depend = NULL, 
       base.dist = "Weibull", frailty.dist = NULL, base.parms, vbeta, 
       allelefreq = c(0.02, 0.2), dominant.m = TRUE, dominant.s = TRUE,
       mrate = 0, hr = 0, probandage = c(45, 2), agemin = 20, agemax = 100)

Arguments

N.fam

Number of families to generate.

design

Family based study design used in the simulations. Possible choices are: "pop", "pop+", "cli", "cli+" or "twostage", where "pop" is for the population-based design that families are ascertained by affected probands, "pop+" is similar to "pop" but with mutation carrier probands, "cli" is for the clinic-based design that includes affected probands with at least one parent and one sib affected, "cli+" is similar to "cli" but with mutation carrier probands and "twostage" for two-stage design that randomly samples families from the population in the first stage and oversamples high risk families in the second stage that include at least two affected members in the family. Default is "pop".

variation

Source of residual familial correlation. Possible choices are: "frailty" for frailty shared within families, "secondgene" for second gene variation, or "none" for no residual familial correlation. Default is "none".

interaction

Logical; if TRUE, allows the interaction between gender and mutation status. Default is FALSE.

depend

Variance of the frailty distribution. Dependence within families increases with depend value. Default is NULL. Value should be specified as a positive real number when variation="frailty".

base.dist

Choice of baseline hazard distribution. Possible choices are: "Weibull", "loglogistic", "Gompertz", "lognormal" "gamma", "logBurr". Default is "Weibull".

frailty.dist

Choice of frailty distribution. Possible choices are: "gamma" or "lognormal" when variation="frailty". Default is NULL.

base.parms

Vector of parameter values for the specified baseline hazard function. base.parms=c(lambda, rho) should be specified for base.dist="Weibull", "loglogistic", "Gompertz", "gamma", and "lognormal". For base.dist="logBurr", three parameters should be specified base.parms = c(lambda, rho, eta).

vbeta

Vector of regression coefficients for gender, majorgene, interaction between gender and majorgene (if interaction = TRUE), and secondgene (if variation = "secondgene").

allelefreq

Vector of population allele frequencies of major and second disease gene alleles. Frequencies must be between 0 and 1. Default frequencies are 0.02 for major gene allele and 0.2 for second gene allele, allelefreq = c(0.02, 0.2).

dominant.m

Logical; if TRUE, the genetic model of major gene is dominant, otherwise recessive.

dominant.s

Logical; if TRUE, the genetic model of second gene is dominant, otherwise recessive.

mrate

Proportion of missing genotypes, value between 0 and 1. Default value is 0.

hr

Proportion of high risk families, which include at least two affected members, to be sampled from the two stage sampling. This value should be specified when design="twostage". Default value is 0. Value should lie between 0 and 1.

probandage

Vector of mean and standard deviation for the proband age. Default values are mean of 45 years and standard deviation of 2 years, probandage = c(45, 2).

agemin

Minimum age of disease onset or minimum age. Default is 20 years of age.

agemax

Maximum age of disease onset or maximum age. Default is 100 years of age.

Details

The design argument defines the type of family based design to be simulated. Two variants of the population-based and clinic-based design can be chosen: "pop" when proband is affected, "pop+" when proband is affected mutation carrier, "cli" when proband is affected and at least one parent and one sibling are affected, "cli+" when proband is affected mutation-carrier and at least one parent and one sibling are affected. The two-stage design, "twostage", is used to oversample high risk families, where the proportion of high risks families to include in the sample is specified by hr. High risk families often include multiple (at least two) affected members in the family.

The ages at onset are generated from the following penetrance models depending on the choice of variation = "none", "frailty", "secondgene".. When variation = "none", the ages at onset are independently generated from the proportional hazard model conditional on the gender and carrier status of major gene mutation, X = c(xs, xg). The ages at onset correlated within families are generated from the shared frailty model (codevariation = "frailty") or the two-gene model (codevariation = "secondene"), where the residual familial correlation is induced by a frailty or a second gene, respectively, shared within the family.

The proportional hazard model

h(t|X) = h0(t - t0) exp(βs * xs + βg * xg),

where h0(t) is the baseline hazard function, t0 is a minimum age of disease onset, xx and xg indicate male (1) or female (0) and carrier (1) or non-carrier (0) of a main gene of interest, respectively.

The shared frailty model

h(t|X,Z) = h0(t - t0) Z exp(βs * xs + βg * xg),

where h0(t) is the baseline hazard function, t0 is a minimum age of disease onset, Z represents a frailty shared within families and follows either a gamma or log-normal distribution, xx and xg indicate male (1) or female (0) and carrier (1) or non-carrier (0) of a main gene of interest, respectively.

The two-gene model

h(t|X) = h0(t - t0) Z exp(βs * xs + β1 * x2 + β2 * x2),

where x1, x2 indicate carriers (1) and non-carriers (0) of a major gene and of second gene mutation, respectively.

The current ages for each generation are simulated assuming normal distributions. However, the probands' ages are generated using a left truncated normal distribution as their ages cannot be less than the minimum age of onset. The average age difference between each generation and their parents is specified as 20 years apart.

Note that simulating family data under the clinic-based designs ("cli" or "cli+") or the two-stage design can be slower since the ascertainment criteria for the high risk families are difficult to meet in such settings. Especially, "cli" design could be slower than "cli+" design since the proband's mutation status is randomly selected from a disease population in "cli" design, so his/her family members are less likely to be mutation carriers and have less chance to be affected, whereas the probands are all mutation carriers, their family members have higher chance to be carriers and affected by disease. Therefore, "cli" design requires more iterations to sample high risk families than "cli+" design. All designs simulations that include variation = "frailty" could be also slower in order to generate families with specific familial correlations induced by the chosen frailty distribution.

Value

Returns an object of class 'simfam', a data frame which contains:

famID

Family identification (ID) numbers.

indID

Individual ID numbers.

gender

Gender indicators: 1 for males, 0 for females.

motherID

Mother ID numbers.

fatherID

Father ID numbers.

proband

Proband indicators: 1 if the individual is the proband, 0 otherwise.

generation

Individuals generation: 1=parents of probands,2=probands and siblings, 3=children of probands and siblings.

majorgene

Genotypes of major gene: 1=AA, 2=Aa, 3=aa where A is disease gene.

secondgene

Genotypes of second gene: 1=BB, 2=Bb, 3=bb where B is disease gene.

ageonset

Ages at disease onset in years.

currentage

Current ages in years.

time

Ages at disease onset for the affected or ages of last follow-up for the unaffected.

status

Disease statuses: 1 for affected, 0 for unaffected (censored).

mgene

Major gene mutation indicators: 1 for mutated gene carriers, 0 for mutated gene noncarriers, or NA if missing.

relation

Family members' relationship with the proband:

1 Proband (self)
2 Brother or sister
3 Son or daughter
4 Parent
5 Nephew or niece
6 Spouse
7 Brother or sister in law
fsize

Family size including parents, siblings and children of the proband and the siblings.

naff

Number of affected members in family.

weight

Sampling weights.

Author(s)

Yun-Hee Choi, Wenqing He

References

Choi, Y.-H., Briollais, L., He, W. and Kopciuk, K. (2021) FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs, Journal of Statistical Software 97 (7), 1-30. doi:10.18637/jss.v097.i07

Choi, Y.-H., Kopciuk, K. and Briollais, L. (2008) Estimating Disease Risk Associated Mutated Genes in Family-Based Designs, Human Heredity 66, 238-251.

Choi, Y.-H. and Briollais (2011) An EM Composite Likelihood Approach for Multistage Sampling of Family Data with Missing Genetic Covariates, Statistica Sinica 21, 231-253.

See Also

summary.simfam, plot.simfam, penplot

Examples


## Example 1: simulate family data from population-based design using
#  a Weibull distribution for the baseline hazard and inducing 
#  residual familial correlation through a shared gamma frailty.

set.seed(4321)
fam <- simfam(N.fam = 10, design = "pop+", variation = "frailty", 
       base.dist = "Weibull", frailty.dist = "gamma", depend=1, 
       allelefreq = 0.02, base.parms = c(0.01, 3), vbeta = c(-1.13, 2.35))

head(fam) 

## Not run: 
  famID indID gender motherID fatherID proband generation majorgene secondgene
1     1     1      1        0        0       0          1         2          0
2     1     2      0        0        0       0          1         2          0
3     1     3      0        2        1       1          2         2          0
4     1     4      1        0        0       0          0         3          0
5     1     9      0        3        4       0          3         2          0
6     1    10      1        3        4       0          3         3          0
   ageonset currentage     time status mgene relation fsize naff weight
1 103.76925   69.19250 69.19250      0     1        4    18    2      1
2  64.88982   67.31119 64.88982      1     1        4    18    2      1
3  45.84891   47.57119 45.84891      1     1        1    18    2      1
4 269.71990   47.37403 47.37403      0     0        6    18    2      1
5  69.78355   27.80081 27.80081      0     1        3    18    2      1
6 192.09392   25.34148 25.34148      0     0        3    18    2      1

## End(Not run)

summary(fam)

plot(fam, famid = c(1:2)) # pedigree plots for families with IDs = 1 and 2

## Example 2: simulate family data from two stage design to include 
#  30% of high risk families in the sample. 

set.seed(4321)
fam <- simfam(N.fam = 50, design = "twostage", variation = "none", base.dist = "Weibull", 
       base.parms = c(0.01, 3), vbeta = c(-1.13, 2.35), hr = 0.3, allelefreq = 0.02)

summary(fam)


FamEvent documentation built on Nov. 17, 2022, 5:06 p.m.