fitNullModel: Fit a Model Under the Null Hypothesis
In GENESIS: GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness

Description Usage Arguments Details Value Author(s) References See Also Examples

fitNullModel fits a regression model or a mixed model with random effects specified by their covariance structures; this allows for the inclusion of a polygenic random effect using a kinship matrix or genetic relationship matrix (GRM). The output of fitNullModel can be used to estimate genetic heritability and can be passed to assocTestSingle or assocTestAggregate for the purpose of genetic association testing.

nullModelInvNorm does an inverse normal transform of a previously fit null model.

nullModelSmall returns a small version of the null model with no NxN matrices.

isNullModelSmall returns TRUE if a null model is small; FALSE otherwise.

## S4 method for signature 'data.frame'
fitNullModel(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, family = "gaussian", start = NULL,
            AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0, drop.zeros = TRUE,
            return.small = FALSE, verbose = TRUE)
## S4 method for signature 'AnnotatedDataFrame'
fitNullModel(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, sample.id = NULL, ...)
## S4 method for signature 'SeqVarData'
fitNullModel(x, ...)
## S4 method for signature 'ScanAnnotationDataFrame'
fitNullModel(x, ...)
## S4 method for signature 'GenotypeData'
fitNullModel(x, ...)

nullModelInvNorm(null.model, cov.mat = NULL, norm.option = c("by.group", "all"),
                 rescale = c("none", "model", "residSD"),
                 AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0, verbose = TRUE)

nullModelSmall(null.model)

isNullModelSmall(null.model)

`x`	An object of class `data.frame`, `AnnotatedDataFrame`, or `SeqVarData` containing the outcome and covariate data for the samples to be used for the analysis.
`outcome`	A character string specifying the name of the outcome variable in `x`.
`covars`	A vector of character strings specifying the names of the fixed effect covariates in `x`; an intercept term is automatically included. If `NULL` (default) the only fixed effect covariate is the intercept term.
`cov.mat`	A matrix or list of matrices specifying the covariance structures of the random effects terms. Objects from the Matrix package are supported. See 'Details' for more information.
`group.var`	This variable can only be used when `family = "gaussian"`. A character string specifying the name of a categorical variable in `x` that is used to fit heterogeneous residual error variances. If `NULL` (default), then a standard LMM with constant residual variance for all samples is fit. See 'Details' for more information.
`sample.id`	A vector of IDs for samples to include in the analysis. If `NULL`, all samples in `x` are included. This argument is ignored if `x` is a data.frame; see 'Details'.
`family`	A description of the error distribution to be used in the model. The default `"gaussian"` fits a linear model; see `family` for further options, and see 'Details' for more information.
`start`	A vector of starting values for the variance component estimation procedure. The function will pick reasonable starting values when left `NULL` (default). See 'Details' for more information.
`AIREML.tol`	The convergence threshold for the Average Information REML (AIREML) procedure used to estimate the variance components of the random effects. See 'Details' for more information.
`max.iter`	The maximum number of iterations allowed to reach convergence.
`EM.iter`	The number of EM iterations to run prior to AIREML; default is 0.
`drop.zeros`	Logical indicator of whether variance component terms that converge to 0 should be removed from the model; the default is TRUE. See 'Details' for more information.
`return.small`	Logical for whether to return a small version of the null model without NxN matrices. Default is FALSE; only set to TRUE for use in association tests with `test = "BinomiRare"` or `test = "CMP"` and `recalc.pval.thresh = 1`.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.
`...`	Arguments to pass to other methods.
`null.model`	The output of `fitNullModel`.
`norm.option`	Whether the normalization should be done separately within each value of `group.var` (`"by.group"`) or with all samples together (`"all"`).
`rescale`	Controls whether to rescale the variance after inverse-normal transform, restoring it to the original variance before the transform. `"none"` for no rescaling of the residuals; `"model"` for model-based rescaling, and `"residSD"` to rescale to the standard deviation of the marginal residuals. See 'Details' for more information.

If x is a data.frame, the rownames of x must match the row and column names of cov.mat (if cov.mat is specified). If x is an AnnotatedDataFrame or other object containing an AnnotatedDataFrame, x will be re-ordered (if necessary) so that sample.id or scanID is in the same order as the row and column names of cov.mat.

The code checks for multicollinearity of covariates by checking that the rank of the design matrix is equal to the number of columns; if the rank is smaller, it fails with an error.

cov.mat is used to specify the covariance structures of the random effects terms in the model. For example, to include a polygenic random effect, one matrix in cov.mat could be a kinship matrix or a genetic relationship matrix (GRM). As another example, to include household membership as a random effect, one matrix in cov.mat should be a 0/1 matrix with a 1 in the [i,j] and [j,i] entries if individuals i and j are in the same household and 0 otherwise; the diagonals of such a matrix should all be 1.

When family is not gaussian, the penalized quasi-likelihood (PQL) approximation to the generalized linear mixed model (GLMM) is fit following the procedure of GMMAT (Chen et al.).

For some outcomes, there may be evidence that different groups of observations have different residual variances, and the standard LMM assumption of homoscedasticity is violated. When group.var is specified, separate (heterogeneous) residual variance components are fit for each unique value of group.var.

Let m be the number of matrices in cov.mat and let g be the number of categories in the variable specified by group.var. The length of the start vector must be (m + 1) when family is gaussian and group.var is NULL; (m + g) when family is gaussian and group.var is specified; or m when family is not gaussian.

A Newton-Raphson iterative procedure with Average Information REML (AIREML) is used to estimate the variance components of the random effects. When the absolute change between all of the new and previous variance component estimates is less than var(outcome)*AIREML.tol, the algorithm declares convergence of the estimates. Sometimes a variance component may approach the boundary of the parameter space at 0; step-halving is used to prevent any component from becomming negative. However, when a variance component gets near the 0 boundary, the algorithm can sometimes get "stuck", preventing the other variance components from converging; if drop.zeros is TRUE, then variance components that converge to a value less than AIREML.tol will be dropped from the model and the estimation procedure will continue with the remaining variance components.

After inverse-normal transformation, the variance rescaling is done with the same grouping; i.e. if norm.option == "by.group", rescaling is done within each group, and if norm.option == "all", rescaling is done with all samples.

An object of class 'GENESIS.nullModel' or 'GENESIS.nullMixedModel'. A list including:

`family`	A character string specifying the family used in the analysis.
`hetResid`	A logical indicator of whether heterogeneous residual variance components were used in the model (specified by `group.var`).
`varComp`	The variance component estimates. There is one variance component for each random effect specified in `cov.mat`. When `family` is gaussian, there are additional residual variance components; one residual variance component when `group.var` is `NULL`, and as many residual variance components as there are unique values of `group.var` when it is specified.
`varCompCov`	The estimated covariance matrix of the variance component estimates given by `varComp`. This can be used for hypothesis tests regarding the variance components.
`fixef`	A data.frame with effect size estimates (betas), standard errors, chi-squared test statistics, and p-values for each of the fixed effect covariates specified in `covars`.
`betaCov`	The estimated covariance matrix of the effect size estimates (betas) of the fixed effect covariates. This can be used for hypothesis tests regarding the fixed effects.
`fitted.values`	The fitted values from the model; i.e. W*beta where W is the design matrix and beta are the effect size estimates for the fixed effects.
`resid.marginal`	The marginal residuals from the model; i.e. Y - W*beta where Y is the vector of outcome values.
`resid.conditional`	The conditional residuals from the model; i.e. Y - Wbeta - Zu.
`logLik`	The log-likelihood value.
`logLikR`	The restricted log-likelihood value.
`AIC`	The Akaike Information Criterion value.
`workingY`	The "working" outcome vector. When `family` is gaussian, this is just the original outcome vector. When `family` is not gaussian, this is the PQL linearization of the outcome vector. This is used by `assocTestSingle` or `assocTestAggregate` for genetic association testing. See 'Details' for more information.
`outcome`	The original outcome vector, as a 1-column matrix with column name. When `family` is gaussian, this is equal to `workingY`.
`model.matrix`	The design matrix for the fixed effect covariates used in the model.
`group.idx`	If `group.var` is not `NULL`, a list of indices for samples in each group.
`cholSigmaInv`	The Cholesky decomposition of the inverse of the estimated outcome covariance structure. This is used by `assocTestSingle` or `assocTestAggregate` for genetic association testing.
`converged`	A logical indicator of whether the AIREML procedure for estimating the random effects variance components converged.
`zeroFLAG`	A vector of logicals the same length as `varComp` specifying whether the corresponding variance component estimate was set to 0 by the function due to convergence to the boundary in the AIREML procedure.
`RSS`	The residual sum of squares from the model fit. When `family` is gaussian, this will typically be 1 since the residual variance component is estimated separately.
`sample.id`	A vector of IDs for the samples used in the analysis.

Matthew P. Conomos, Stephanie M. Gogarten, Tamar Sofer, Ken Rice, Chaoyu Yu

Chen H, Wang C, Conomos MP, Stilp AM, Li Z, Sofer T, Szpiro AA, Chen W, Brehm JM, Celedon JC, Redline S, Papanicolaou GJ, Thornton TA, Laurie CC, Rice K and Lin X. (2016) Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies Using Logistic Mixed Models. American Journal of Human Genetics, 98(4):653-66.

Breslow NE and Clayton DG. (1993). Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 88: 9-25.

Gilmour, A.R., Thompson, R., & Cullis, B.R. (1995). Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 1440-1450.

varCompCI for estimating confidence intervals for the variance components and the proportion of variability (heritability) they explain, assocTestSingle or assocTestAggregate for running genetic association tests using the output from fitNullModel.

library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")

# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat,
                divobj = HapMap_ASW_MXL_KINGmat)

# run PC-Relate
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData, snpBlock=20000)
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
    			training.set = mypcair$unrels)
close(HapMap_genoData)

# generate a phenotype
set.seed(4)
pheno <- 0.2*mypcair$vectors[,1] + rnorm(mypcair$nsamp, mean = 0, sd = 1)

annot <- data.frame(sample.id = mypcair$sample.id,
                    pc1 = mypcair$vectors[,1], pheno = pheno)

# make covariance matrix
cov.mat <- pcrelateToMatrix(mypcrel, verbose=FALSE)[annot$sample.id, annot$sample.id]

# fit the null mixed model
nullmod <- fitNullModel(annot, outcome = "pheno", covars = "pc1", cov.mat = cov.mat)