EM algorithm for estimating the penetrance model with missing genotypes

Share:

Description

This function fits family data with missing genotypes via the EM algorithm and provides model parameter estimates and corresponding gender- and genotype-specific penetrance estimates.

Usage

1
2
penmodelEM(parms, vbeta, data, design="pop", base.dist="Weibull", 
           method="data", mode="dominant", q=0.02)

Arguments

parms

Vector of initial values for baseline parameters. parms=c(lambda, rho), where lambda and rho are the initial values for the scale and shape parameters, respectively; lambda > 0, rho > 0.

vbeta

Vector of initial values for the regression coefficients for gender and majorgene, vbeta=c(beta.s, beta.g).

data

Family data structure should follow the format of the data generated from simfam. data should contain specific variables: famID, indID, generation, gender, currentage, mgene, time, status and weight; attr(data,"agemin") should be specified.

design

The study design of the family data. Possible choices are: "pop", "pop+", "cli", "cli+" or "twostage", where "pop" is for the population-based design with affected probands whose mutation status can be either carrier or non-carrier, "pop+" is similar to "pop" but with mutation carrier probands, "cli" is for the clinic-based design that includes affected probands with at least one parent and one sib affected, "cli+" is similar to "cli" but with mutation carrier probands, and "twostage" is for the two-stage design with oversampling of high risks families. Default is "pop".

base.dist

Choice of baseline hazard distribution to fit. Possible choices are: "Weibull", "loglogistic", "Gompertz", "lognormal", or "gamma". Default is "Weibull".

method

Choice of methods for calculating the carrier probabilities for individuals with missing mutation status. Possible choices are "data" for empirical calculation of the carrier probabilities based on the observed carriers' statuses in the entire sample, specific to generation and proband's mutation status or "mendelian" for calculating carrier probabilities based on Mendelian transmission probabilies with the given allele frequency and mutation statuses observed in the family. Default is "data".

If method="mendelian", specify both mode of the inheritance and the allele frequency q.

mode

Choice of modes of inheritance for calculating carrier probabilies for individuals with missing mutation status. Possible choices are "dominant" for dominant model or "recessive" for recessive model. Default is "dominant".

q

Frequency of the disease causing allele used for calculating carrier pobabilities. The value should be between 0 and 1. If NULL, the estimated allele frequency from data will be used. Default value is 0.02.

Details

The expectation and maximization (EM) algorithm is applied for making inference about the missing genotypes. In the expectation step, for individuals with unknown carrier status, we first compute their carrier probabilities given their family's observed phenotype and genotype information based on current estimates of parameters θ

w_{fi} = P(X_{fi}=1|Y_{fi}, X^o_f) ,

where X_{fi} represents the mutation carrier status and Y_{fi} represents the phenotype (t_{fi}, δ_{fi}) in terms of age at onset t_{fi} and disease status δ_{fi} for individual i in family f and X^o_f represents the observed genotypes in family f.

Then, we obtain the conditional expectation of the log-likelihood function of the complete data given the observed data as a weighted log-likelihood, which has the form

E_{θ} [\ell (θ) | Y, X^o)] = ∑_f^n ∑_i^{n_f} \ell_{fi}(θ | X_{fi}=1) w_{fi} + \ell_{fi}(θ | X_{fi}=0) (1-w_{fi}),

In the maximization step, the updated parameter estimates are obtained by maximizing the weighted log likelihood computed in the E-step.

These expectation and maximization steps iterate until convergence to obtain the maximum likelihood estimates.

See more details in Choi and Briollais (2011) or Choi et al. (2014).

Value

An object of class penmodel, a list including elements

parms.est

Parameter estimates of baseline parameters (λ, ρ) and regression coefficients for gender and mutation status (β_s, β_g) including their standard errors and also their robust standard errors.

parms.cov

Covariance matrix of parameter estimates.

parms.se

Standard errors of parameter estimates.

parms.rcov

Robust (sandwich) covariance matrix of parameter estimates.

parms.rse

Robust standard errors of parameter estimates.

pen70.est

Penetrance estimates by age 70 specific to gender and mutation-status subgroups.

pen70.se

Standard errors of penetrance estimates by age 70 specific to gender and mutation-status subgroups.

pen70.ci

95% confidence interval estimates of penetrance by age 70 specific to gender and mutation-status subgroups.

ageonset

Vector of ages of onset ranging from agemin to 80 years.

pen.maleCarr

Vector of penetrance estimates for male carriers from agemin to 80 years.

pen.femaleCarr

Vector of penetrance estimates for female carriers from agemin to 80 years.

pen.maleNoncarr

Vector of penetrance estimates for male non-carriers from agemin to 80 years.

pen.femaleNoncarr

Vector of penetrance estimates for female non-carriers from agemin to 80 years.

Author(s)

Yun-Hee Choi

References

Choi, Y.-H. and Briollais, L. (2011) An EM composite likelihood approach for multistage sampling of family data with missing genetic covariates, Statistica Sinica 21, 231-253.

Choi, Y.-H., Briollais, L., Green, J., Parfrey, P., and Kopciuk, K. (2014) Estimating successive cancer risks in Lynch Syndrome families using a progressive three-state model, Statistics in Medicine 33, 618-638.

See Also

simfam, penmodel, link{summary.penmodel}, plot.penmodel, carrierprob

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Family data simulated with 30% of members missing their genetic information.

fam <- simfam(N.fam=100, design="pop+", base.dist="Weibull", base.parms=c(0.01,3),
       vbeta=c(-1.13, 2.35, 0.5), agemin=20, allelefreq=0.02, mrate=0.3)
 
# EM algorithm for fitting family data with missing genotypes

fit <- penmodelEM(parms=c(0.01, 3), vbeta=c(-1.13, 2.35), data=fam, design="pop+",
       base.dist="Weibull", method="mendelian", mode="dominant", q=NULL)

# Summary of the model parameter and penetrance estimates from model fit 
# by penmodelEM 

summary(fit)

# Generate the lifetime penetrance curves from model fit for gender and 
# mutation status groups along with their non-parametric penetrance curves 
# based on observed data
 
plot(fit)