Logistic regression analysis for a single SNP
Description
Performs logistic regression including a particular SNP (G) and a set of covariates (X) that could include environmental covariates or/and other genetic variables. Included are three analysis options: (i) Unconstrained maximumlikelihood: This method is equivalent to prospective logistic regression analysis and corresponds to maximumlikelihood analysis of casecontrol data allowing the joint distribution of all the factors (the SNP of interest and all other covariates) of the model to be completely unrestricted (nonparametric) (ii) Constrained maximumlikelihood: This method performs maximumlikelihood analysis of casecontrol data under the assumption of HWE and indepenence between the SNP and other factors of the model.The analysis allows the assumptions of HWE and independence to be valid only conditional on certain stratification variables (S), such as self reported ethnicity or principal compoenets of population stratification. (iii) EmpiricalBayes: This method uses an empiricalBayes type "shrinkage estimation" technique to tradeoff bias and variance between the constrained and unconstrained maximumlikelihood estimators.
Usage
1 2 
Arguments
data 
Data frame containing all the data. No default. 
response.var 
Name of the binary response variable coded as 0 (controls) and 1 (cases). No default. 
snp.var 
Name of the SNP variable, which must be coded 012 (or 01). The SNP will be included as a main effect in the model. No default. 
main.vars 
Character vector of variable names or a formula for all covariates of interest which need to be included in the model as main effects. The default is NULL, so that only the SNP variable will be included as a main effect in the model. 
int.vars 
Character vector of variable names or a formula for all covariates of interest that will interact with the SNP variable. The default is NULL, so that no interactions will be in the model. 
strata.var 
Name of the stratification variable or a formula (see details for more info).
If 
op 
A list with names 
Details
Note: Nondummy continuous variables should be scaled for stability of the algorithm.
The scale
function can be used for this.
The data is first fit using standard logistic regression. The estimated
parameters from the standard logistic regression are then used as the initial
estimates for the constrained model. For this,
the optim()
function is used to compute the maximum likelihood estimates and
the estimated covariance matrix. The empirical Bayes estimates are then computed by
combining both sets of estimated parameters (see below). The "strata" option, that is relevent for the
CML and EB method, allows
the assumption of HWE and GX independence to be valid only conditional on a given set of other factors.
If a single categorical variable name is provided, then the unique levels of the variable will be used to define categorical strata.
Otherwise it is assumed that strata.var
defines a parametric model for variation of allele
frequency of the SNP as a function of the variables included. No assumption
is made about the relationship between X and S. Typically, S would include self reported ethnicity,
study, center/geographic region and principal components of population stratification. The CML method with the "strata"
defined by principal compoenents of population stratification can be viewed as a generalization of adjusted caseonly method
described in Bhattacharjee et al. (2010). More details of the individual methods follow.
Definition of the likelihood under the geneenvironment independence assumption:
Let D = 0, 1 be the casecontrol status, G = 0, 1, 2 denote the SNP genotype, S denote the stratification variable(s) and X denote the set of all other factors to be included in the regression model. Suppose the risk of the disease (D), given G, X and S can be described by a logistic regression model of the form
log(Pr(D=1)/Pr(D=0)) = alpha + Z*beta
where Z is the entire design matrix (including G, X, possibly S and their interaction with X) and beta is the vector of associated regression coefficients. The CML method assumes Pr(GX,S)=Pr(GS), i.e., G and X are conditionally independent given S. The current implementation of the CML method also assume the SNP genotype frequency follows HWE given S=s, although this is not necessary in general. Thus, if f_s denotes the allele frequency given S=s, then
P(G = 0) = (1  f_s)^2,
P(G = 1) = 2*f_s*(1  f_s),
P(G = 2) = (f_s)^2.
If xi_s = log(f_S/(1  f_s)), then
log(P(G=1)/P(G=0)) = log(2) + xi_s
and
log(P(G=2)/P(G=0)) = 2*xi_s.
Chatterjee and Carroll (2005) showed that under the above constraints, the maximumlikelihood estimate for the beta coefficients under casecontrol design can be obtained based on a simple conditional likelihood of the form
P`(D=d, G=g  Z, S) = exp(theta_s(d,gZ))/SUM[exp(theta_s(d,gZ))]
where the sum is taken over the 6 combinations of d and g and theta_s(d,g) = d*alpha` + d*Z*beta + I(g=1)*log(2) + g*xi_s. If S is a single categorical variable, then a separate xi_s is allowed for each S=s. Otherwise it is assumed xi_s=V_s*gamma, where V_s is the design matrix associated with the stratification and gamma is the vector of stratification parameters. If for example, S is specified as "strata=~PC1+PC2+...PCK" where PCk's denote principal components of population stratification, then it is assumed that the allele frequency of the SNP varies in directions of the different principal components in a logistic linear fashion.
Definition of the empirical bayes estimates:
Let beta_UML be the parameter estimates from standard logistic regression, and
let eta = (beta_CML, xi_CML)
be the estimates under the geneenvironment independence assumption.
Let psi = beta_UML  beta_CML, and
phi^2 be the vector of variances of beta_UML.
Define diagonal matrices of weights to be
W1 = diag(psi^2/(psi^2 + phi^2)) and
W2 = diag(phi^2/(psi^2 + phi^2)),
where
psi^2 is the elementwise product of the vector
psi.
Now, the empirical bayes parameter estimates are
beta_EB = W1*beta_UML + W2*beta_CML.
For the estimated covariance matrix, define the diagonal matrix
A = diag(phi^2*(phi^2  psi^2)/(phi^2 + psi^2)^2),
where again the exponentiation is the elementwise product of the vectors. If I is the pxp identity matrix and we define the px2p matrix C = (A, I  A), then the estimated covariance matrix is
VAR(beta_EB) = C*COV(beta_UML, beta_CML)*C'.
The covariance term COV(beta_UML, beta_CML) is obtained using
an influence function method (see Chen YH, Chatterjee N, and Carroll R. for details about
the above formulation of the empiricalBayes method).
Options list:
Below are the names for the options list op
. All names have default values
if they are not specified.

genetic.model
03: The genetic model for the SNP. 0=additive, 1=dominant, 2=recessive, 3=general (codominant). 
reltol
Stopping tolerance. The default is 1e8. 
maxiter
Maximum number of iterations. The default is 100. 
optimizer
One of "BFGS", "CG", "LBFGSB", "NelderMead", "SANN". The default is "BFGS".
Value
A list containing sublists with names UML (unconstrained maximum likelihood), CML (constrained maximum likelihood), and EB (empirical Bayes). Each sublist contains the parameter estimates (parms) and covariance matrix (cov). The lists UML and CML also contain the loglikelihood (loglike). The list CML also contains the results for the stratum specific allele frequencies under the HWE assumption (strata.parms and strata.cov). The EB sublist contains the joint UMLCML covariance matrix.
References
Mukherjee B, Chatterjee N. Exploiting geneenvironment independence in analysis of casecontrol studies: An empirical Bayes approach to tradeoff between bias and efficiency. Biometrics 2008, 64(3):68594.
Mukherjee B et al. Tests for geneenvironment interaction from casecontrol data: a novel study of type I error, power and designs. Genetic Epidemiology, 2008, 32:61526.
Chatterjee, N. and Carroll, R. Semiparametric maximum likelihood estimation exploting geneenvironment independence in casecontrol studies. Biometrika, 2005, 92, 2, pp.399418.
Chen YH, Chatterjee N, Carroll R. Shrinkage estimators for robust and efficient inference in haplotypebased casecontrol studies. Journal of the American Statistical Association, 2009, 104: 220233.
Bhattacharjee S, Wang Z, Ciampa J, Kraft P, Chanock S, Yu K, Chatterjee N
Using Principal Components of Genetic Variation for Robust and Powerful Detection of GeneGene Interactions in CaseControl and CaseOnly studies.
American Journal of Human Genetics, 2010, 86(3):331342.
See Also
snp.score
, snp.matched
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  # Use the ovarian cancer data
data(Xdata, package="CGEN")
# Fit using a stratification (categorical) variable
ret < snp.logistic(Xdata, "case.control", "BRCA.status",
main.vars=c("oral.years", "n.children"),
int.vars=c("oral.years", "n.children"),
strata.var=~factor(ethnic.group))
# Compute a summary table for the models
getSummary(ret)
# Compute a Wald test for the main effect of the SNP and interaction
getWaldTest(ret, c("BRCA.status", "BRCA.status:oral.years", "BRCA.status:n.children"))
# Fit the same model as above using formulas
ret2 < snp.logistic(Xdata, "case.control", "BRCA.status",
main.vars=~oral.years + n.children,
int.vars=~oral.years + n.children,
strata.var=~factor(ethnic.group))
