Generalized additive model estimation of gene-environment interaction using data from case-parent trios

Share:

Description

trioGxE estimates statistical interaction (GxE) between a single nucleotide polymorphism (SNP) and a continuous environmental or non-genetic attributes in case-parent trio data by fitting a generalized additive model (GAM) using a penalized iteratively re-weighted least squares algorithm.

Usage

1
2
3
4
5
trioGxE(data, pgenos, cgeno, cenv, 
        penmod = c("codominant","dominant","additive","recessive"), 
        k = NULL, knots = NULL, sp = NULL, lsp0 = NULL, lsp.grid = NULL, 
        control = list(maxit = 100, tol = 1e-07, trace = FALSE), 
        testGxE = FALSE, return.data = TRUE, ...)

Arguments

data

a data frame with columns for parental genotypes, child genotypes and child environmental/non-genetic attribute. See ‘Details’ section for the required format.

pgenos

a length-2 vector of character strings specifying the names of the two columns in data that hold parental genotypes.

cgeno

a character string specifying the name of the column in data that holds the child genotypes.

cenv

a character string specifying the name of the column in data that holds the non-genetic attribute being examined for interaction with genotype.

penmod

the penetrance mode of the genetic and interaction effects: "codominant" (default), "dominant", "additive", or "recessive".

k

an optional vector or single value specifying the desired number(s) of knots to be used for the cubic spline basis construction of smooth function(s) representing GxE. When penmod="codominant", a length-2 vector with positive integers must be provided to specify the numbers of knots (or basis dimensions) for the two interaction functions. Otherwise, a single positive integer must be provided. The minimum value for each integer is 3. The default basis dimension is either k=c(5,5) or k=5. See ‘Details’ section for more information.

knots

knot positions for the cubic spline basis construction. When penmod="codominant", a list of two vectors must be provided. For the other penetrance modes, a single vector must be provided. When NULL (default), knots will be placed at equally-spaced quantiles of the distribution of E within trios from appropriate parental mating types. If both knots and k are provided, the argument k is ignored. See ‘Details’ section for more information.

sp

smoothing parameters for the interaction functions. When penmod="codominant", a vector with two non-negative numbers must be provided. Otherwise, a single non-negative number must be provided. When NULL (default), a double (under the co-dominant mode) or a single (under a non-co-dominant mode) 1-dimensional grid search finds the optimal smoothing parameter values.

lsp0

an optional length-2 numeric vector or a single numeric value used for choosing trial values of log smoothing parameters in the grid search for the optimal smoothing parameters. When NULL (default), trioGxE takes the log of smoothing parameter estimates obtained by applying a likelihood approach that makes inference of GxE conditional on the parental genotypes, non-genetic attribute and partial information on child genotypes.

lsp.grid

trial values of log smoothing parameters used in the grid search for smoothing parameters. When penmod= "codominant", a list of two vectors of lengths ≥q 2 must be provided. As the vector is longer, the grid becomes more refined. When the penetrance mode is not co-dominant, a single vector must be provided. When lsp.grid=NULL (default), the function take the vectors of length 6 obtained by using the truncated normal distributions constructed based on lsp0.

control

a list of convergence parameters for the penalized iteratively re-weighted least squares (PIRLS) procedure:

maxit: positive integer giving the maximal number of PIRLS iterations
tol: positive convergence tolerance in terms the relative difference in penalized
deviances (pdev) between iterations: {|\code{pdev} - \code{pdev}_{old}|}/{(|\code{pdev}| + 0.1)} < \code{tol}
trace: logical indicating if output should be produced for each PIRLS iteration.
testGxE

a logical specifying whether the fitting is for testing interaction. Default is FALSE. User should not modify this argument.

return.data

a logical specifying whether the original data should be returned. If TRUE (default), the data is returned.

...

sets the arguments for control, which includes maxit, tol or trace.

Details

trioGxE fits data from case-parent trios to a GAM with smooth functions representing gene-environment interaction (G \times E).

The data input must be a data frame containing the following 4 columns (of any order):

[,1] number (0, 1 or 2) of copies of a putative risk allele carried by the mother
[,2] number of copies of a putative risk allele carried by the father
[,3] number of copies of a putative risk allele carried by the affected child (G)
[,4] value of a continuous environmental/non-genetic attribute measured on the child (E)

The function trioGxE does basic error checking to ensure that only the trios that are consistent with Mendelian segregation law with complete genotype, environment and parental genotype information. The function determines which trios are from informative parental mating types. An informative parental pair has at least one heterozygote; such parental pair can have offspring that are genetically different. Under the assumption that the parents transmit the alleles to their child under Mendel's law, with no mutation, there are three types of informative mating types G_p=1,2,3:

G_p=1: if one parent is heterozygous, and the other parent is homozygous for the non-risk allele
G_p=2: if one parent is heterozygous, and the other parent is homozygous for the risk allele
G_p=3: if both parents are heterozygous

Since GxE occurs when genotype relative risks vary with non-genetic attribute values E=e, GxE inference is based on the attribute-specific genotype relative risks, GRR_h(e), expressed as

{\rm GRR}_h(e) = \frac{P(D=1|G=h,E=e)}{P(D=1|G=h-1,E=e)} = \exp{(γ_h + f_h(e))}, ~~h=1,2,

where D=1 indicates the affected status, γ_1 and γ_2 represent genetic main effect, and f_1(e) and f_2(e) are unknown smooth functions. The functions f_h(e) represent GxE since GRRs vary with E=e only when f_1(e)\neq 0 or f_2(e)\neq 0 vary with E. The expressions are followed by assuming a log-linear model of disease risk in Shin et al. (2010).

Under the co-dominant penetrance mode (i.e., penetrance="codominant"), GRR_1(e) and GRR_2(e) are estimated using the information in the trios from the informative mating types G_p=1, 3 and those from G_p=2, 3, respectively. Under a non-co-dominant penetrance mode, only one GRR function, GRR(e)=γ+f(e), is estimated from an appropriate set of informative trios. Under the dominant penetrance mode (i.e., penetrance="dominant"), because GRR_2(e) is 1 (i.e., γ_2=0 and f_2(e)=0), GRR(e)\equiv GRR_1(e) is estimated based on the trios from G_p=1 and 3. Under the recessive penetrance mode (i.e., penetrance="recessive"), because GRR_1(e) is 1 (i.e., γ_1=0 and f_1(e)=0), GRR(e)\equiv GRR_2(e) is estimated based on the trios from G_p=2 and 3. Under the multiplicative or log-additive penetrance model (penetrance="additive"), since GRR_1(e) = GRR_2(e) (i.e., γ_1=γ_2 and f_1=f_2), GRR(e)\equiv GRR_1(e) \equiv GRR_2(e) is estimated based on all informative trios.

The interaction functions are approximated by cubic regression spline functions defined by the knots specified through the arguments k and knots. For each interaction function, at least three knots are chosen within the range of the observed non-genetic attribute values. Under the co-dominant mode, k[1] and k[2] knots, respectively, located at knots[[1]] and knots[[2]] are used to construct the basis for f_1(e) and f_2(e), respectively. By default, a total of 5 knots are placed for each interaction function: three interior knots at the 25th, 50th and 75th quantiles and two boundary knots at the endpoints of the data in trios from G_p=1 or 3, for f_1(e), and in trios from G_p=2 or 3, for f_2(e). Similarly, under a non-co-dominant penetrance mode, when knots=NULL, k knots are chosen based on the data in trios from G_p=1 and 3 (dominant mode); in trios from all informative mating types (log-additive mode); and in trios from G_p=2 or 3 (recessive mode). A standard model identifiability constraint is imposed on each interaction function, which involves the sum of the interaction function over all observed attribute values of cases in the appropriate set of informative trios.

For smoothing parameter estimation, trioGxE finds the optimal values using either a double (if co-dominant) or a single 1-dimensional grid search constructed based on the arguments lsp0 and lsp.grid. When lsp0 = NULL, trioGxE takes the log smoothing parameter estimates obtained from a likelihood approach that makes inference of GxE conditional on parental mating types, non-genetic attribute and partial information on child genotypes (Duke, 2007). When lsp.grid = NULL, trioGxE takes the following 6 numbers to be the trial values of the log-smoothing parameter for each interaction function: -20 and 20, lsp0 and the quartiles of the truncated normal distributions constructed based on lsp0. At each trial value in lsp.grid, the prediction error criterion function, UBRE (un-biased risk estimator, is minimized to find the optimal smoothing parameter. For more details on how to estimate the smoothing parameters, see Appendix B.3 in Shin (2012).

For variance estimation, trioGxE uses a Bayesian approach (Wood, 2006); the resulting Bayesian credible intervals have been shown to have good frequentist coverage probabilities as long as the bias is a relatively small fraction of the mean squared error (Marra and Wood, 2012)

Value

coefficients

a vector holding the spline coefficient estimates for \hat{f}_{1} and/or \hat{f}_{2}. The length of the vector is equal to the total number of knots used for constructing the bases of the interaction curves. For example, under the default basis dimension with co-dominant penetrance mode, the vector has size 10 (i.e., 5 for f_1 and the other 5 for f_2).

control

a list of convergence parameters used for the penalized iteratively re-weighted least squares (PIRLS) procedure

data

original data passed to trioGxE() as an argument: returned when return.data=TRUE

edf

a vector of effective degrees of freedom for the model parameters (see page 171 in Wood, 2006 for details).

Gp

a vector containing the values of parental mating types G_p (see ‘Details’ for the definition)

lsp0

log smoothing parameter(s) used in the grid search. Not returned (i.e., NULL) when smoothing parameters were not estimated but provided by the user.

lsp.grid

trial values of the log smoothing parameter(s) used in the grid search. Not returned when smoothing parameters were not estimated but provided by the user.

penmod

the penetrance mode under which the data were fitted.

qrc

a list containing the QR-decomposition results used for imposing the identifiability constraints. See qr for the list of values.

smooth

a list with components:

model.mat: The design matrix of the problem. The number of rows is equal to
n_1+n_2+2n_3, where n_m is the number of mth informative
mating types. The number of columns is equal to the size of coefficient.
pen.mat: penalty matrix with size equal to the size of coefficient.
bs.dim: number of knots used for basis construction.
knots: knot positions used for basis construction.
sp

optimal smoothing parameter values calculated from UBRE optimization or smoothing parameter values provided by the user.

sp.user

logical whether or not the smoothing parmeter values were provided by the user. If FALSE, sp contains the smoothing parameter values estimated by the UBRE optimization.

terms

list of character strings of column names in data corresponding to the child genotypes, parental genotypes and child's non-genetic attributes.

triodata

Formatted data passed into the internal fitting functions. To be used in test.trioGxE function.

ubre

the minimum value of the un-biased risk estimator (UBRE), measure of predictability for the interaction function estimators \hat{f}_{1} or \hat{f}_{2}. Not returned when smoothing parameters were not estimated but provided by the user.

ubre.val

a list or a vector of ubre values corresponding to the trial values of smoothing parameter(s) in lsp.grid.

Vp

Bayesian posterior variance-covariance matrix for the coefficients. The size the matrix is the same as that of coefficient.

Author(s)

Ji-Hyung Shin <shin@sfu.ca>, Brad McNeney <mcneney@sfu.ca>, Jinko Graham <jgraham@sfu.ca>

References

Duke, L. (2007): A graphical tool for exploring SNP-by-environment interaction in case-parent trios, M.Sc. Thesis, Statistics and Actuarial Science, Simon Fraser University:
URL http://www.stat.sfu.ca/content/dam/sfu/stat/alumnitheses/Duke-2007.pdf.

Marra, G., Wood, S.N. (2012). Coverage properties of confidence intervals for generalized additive model components. Scand J Stat, 39: 53-74.

Shin, J.-H., McNeney, B. and Graham, J. (2010). On the use of allelic transmission rates for assessing gene-by-environment interaction in case-parent trios. Ann Hum Gen, 74: 439-51.

Shin, J.-H. (2012): Inferring gene-environment interaction from case-parent trio data: evaluation of and adjustment for spurious G\times E and development of a data-smoothing method to uncover true G\times E, Ph.D. Thesis, Statistics and Actuarial Science, Simon Fraser University: URL https://theses.lib.sfu.ca/sites/all/files/public_copies/etd7214-j-shin-etd7214jshin.pdf.

Wood, S. (2006): Generalized Additive Models: An Introduction with R, Boca Raton, FL: Chapman & Hall/CRC.

See Also

trioSim, plot.trioGxE, test.trioGxE

Examples

1
2
3
4
5
6
7
8
## fitting a co-dominant model
data(hypoTrioDat)
simfit <- trioGxE(data=hypoTrioDat,pgenos=c("parent1","parent2"),cgeno="child",cenv="attr",
                  k=c(5,5),knots=NULL,sp=NULL)

## fitting a dominant model to the hypothetical data
simfit.dom <- trioGxE(data=hypoTrioDat,pgenos=c("parent1","parent2"),cgeno="child",cenv="attr",
                      penmod="dom",k=5,knots=NULL,sp=NULL)