or relatedness

fitNullModel

R Documentation

Fit a Model Under the Null Hypothesis

Description

fitNullModel fits a regression model or a mixed model with random effects specified by their covariance structures; this allows for the inclusion of a polygenic random effect using a kinship matrix or genetic relationship matrix (GRM). The output of fitNullModel can be used to estimate genetic heritability and can be passed to assocTestSingle or assocTestAggregate for the purpose of genetic association testing.

nullModelInvNorm does an inverse normal transform of a previously fit null model.

nullModelSmall returns a small version of the null model with no NxN matrices.

isNullModelSmall returns TRUE if a null model is small; FALSE otherwise.

fitNullModelFastScore fits a null model that can be used for association testing with the fast approximation to the score standard error (SE).

calcScore calculates the score, its true SE, and the fast SE for a set of variants; used to compute the SE correction factor used for the fast approximation.

nullModelFastScore updates a previously fit null model so that it can be used for association testing with the fast approximation to the score SE.

isNullModelFastScore returns TRUE if a null model can be used for association testing with the fast approximation to the score SE; FALSE otherwise.

Usage

## S4 method for signature 'data.frame'
fitNullModel(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, family = "gaussian", two.stage = FALSE,
            norm.option = c("all", "by.group"), rescale = c("residSD", "none", "model"),
            start = NULL, AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
            drop.zeros = TRUE, return.small = FALSE, verbose = TRUE)
## S4 method for signature 'AnnotatedDataFrame'
fitNullModel(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, sample.id = NULL, ...)
## S4 method for signature 'SeqVarData'
fitNullModel(x, ...)
## S4 method for signature 'ScanAnnotationDataFrame'
fitNullModel(x, ...)
## S4 method for signature 'GenotypeData'
fitNullModel(x, ...)

nullModelInvNorm(null.model, cov.mat = NULL,
                 norm.option = c("all", "by.group"),
                 rescale = c("residSD", "none", "model"),
                 AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
                 drop.zeros = TRUE, return.small = FALSE, verbose = TRUE)

nullModelSmall(null.model)

isNullModelSmall(null.model)

## S4 method for signature 'SeqVarData'
fitNullModelFastScore(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, family = "gaussian", two.stage = FALSE,
            norm.option = c("all", "by.group"), rescale = c("residSD", "none", "model"),
            start = NULL, AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
            drop.zeros = TRUE, return.small = TRUE,
            variant.id = NULL, nvar = 100, min.mac = 20, sparse = TRUE,
            imputed = FALSE, male.diploid = TRUE, genome.build = c("hg19", "hg38"),
            verbose = TRUE)

calcScore(x, null.model, variant.id = NULL, nvar = 100, min.mac = 20, sparse = TRUE,
          imputed = FALSE, male.diploid = TRUE, genome.build = c("hg19", "hg38"),
          verbose = TRUE)

nullModelFastScore(null.model, score.table, return.small = TRUE, verbose = TRUE)

isNullModelFastScore(null.model)

Arguments

`x`	An object of class `data.frame`, `AnnotatedDataFrame`, or `SeqVarData` containing the outcome and covariate data for the samples to be used for the analysis. Must be class `SeqVarData` when using `fitNullModelFastScore` or `calcScore`. See 'Details' for more information.
`outcome`	A character string specifying the name of the outcome variable in `x`.
`covars`	A vector of character strings specifying the names of the fixed effect covariates in `x`; an intercept term is automatically included. If `NULL` (default), the only fixed effect covariate is the intercept term.
`cov.mat`	A matrix or list of matrices specifying the covariance structures of the random effects terms. Objects from the Matrix package are supported. If `NULL` (default), no random effects terms are included. See 'Details' for more information.
`group.var`	This variable can only be used when `family = "gaussian"`. A character string specifying the name of a categorical variable in `x` that is used to fit heterogeneous residual error variances. If `NULL` (default), then a standard LMM with constant residual variance for all samples is fit. See 'Details' for more information.
`sample.id`	A vector of IDs for samples to include in the analysis. If `NULL`, all samples in `x` are included. This argument is ignored if `x` is a data.frame; see 'Details'.
`family`	A description of the error distribution to be used in the model. The default `"gaussian"` fits a linear model; `"binomial"` and `"poisson"` are also supported. See 'Details' for more information.
`two.stage`	Logical indicator of whether to use a fully-adjusted two-stage rank normalization procedure for fitting the model. Can only be used when `family = "gaussian"`. See 'Details' for more information.
`norm.option`	Specifies whether the rank normalization should be done separately within each value of `group.var` (`"by.group"`) or with all samples together (`"all"`) when using `two.stage` or `nullModelInvNorm.`
`rescale`	Specifies how to rescale the residuals after rank normalization when using `two.stage` or `nullModelInvNorm.`: `"none"` for no rescaling of the residuals; `"model"` to rescale by the model-based variance components, and `"residSD"` (default) to rescale by the standard deviation of the original marginal residuals. Rescaling is done with the same grouping as the rank normalization, as specified by `norm.option`.
`start`	A vector of starting values for the variance component estimation procedure. The function will pick reasonable starting values when left `NULL` (default). See 'Details' for more information.
`AIREML.tol`	The convergence threshold for the Average Information REML (AIREML) procedure used to estimate the variance components of the random effects. See 'Details' for more information.
`max.iter`	The maximum number of iterations allowed to reach convergence.
`EM.iter`	The number of EM iterations to run prior to AIREML; default is 0.
`drop.zeros`	Logical indicator of whether variance component terms that converge to 0 should be removed from the model; the default is `TRUE`. See 'Details' for more information.
`return.small`	Logical for whether to return a small version of the null model without NxN matrices. Default for `fitNullModel` is `FALSE`; only set to TRUE for use in association tests with `test = "BinomiRare"` or `test = "CMP"` and `recalc.pval.thresh = 1`. Default for `fitNullModelFastScore` is `TRUE`, as NxN matrices are not needed for the fast score SE approximation.
`null.model`	The output of `fitNullModel`.
`variant.id`	Optional list of variant.ids in `x` specifying which variants to use for computing the SE correction factor; if `NULL`, a random selection of `nvar` variants with minor allele count at least `min.mac` is used.
`nvar`	The number of random variants to select from `x` for computing the SE correction factor; ignored if `variant.id` is specified.
`min.mac`	The minimum minor allele count allowed for the random variants selected from `x` for computing the SE correction factor; ignored if `variant.id` is specified.
`sparse`	Logical indicator of whether to read genotypes as sparse Matrix objects; the default is `TRUE`. Set this to `FALSE` if the alternate allele dosage of the genotypes in the test are not expected to be mostly 0.
`imputed`	Logical indicator of whether to read dosages from the "DS" field containing imputed dosages instead of counting the number of alternate alleles.
`male.diploid`	Logical for whether males on sex chromosomes are coded as diploid.
`genome.build`	A character sting indicating genome build; used to identify pseudoautosomal regions on the X and Y chromosomes.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.
`...`	Arguments to pass to other methods.
`score.table`	The output of `calcScore`.

Details

If x is a data.frame, the rownames of x must match the row and column names of cov.mat (if cov.mat is specified). If x is an AnnotatedDataFrame or other object containing an AnnotatedDataFrame, x will be re-ordered (if necessary) so that sample.id or scanID is in the same order as the row and column names of cov.mat.

If any covariates have the same value for all samples, they will be dropped from the model with a message. Note that the 'model' and 'covars' element in the output object will still include that covariate.

The code checks for multicollinearity of covariates by checking that the rank of the design matrix is equal to the number of columns; if the rank is smaller, it fails with an error.

cov.mat is used to specify the covariance structures of the random effects terms in the model. For example, to include a polygenic random effect, one matrix in cov.mat could be a kinship matrix or a genetic relationship matrix (GRM). As another example, to include household membership as a random effect, one matrix in cov.mat should be a 0/1 matrix with a 1 in the [i,j] and [j,i] entries if individuals i and j are in the same household and 0 otherwise; the diagonals of such a matrix should all be 1. If cov.mat is a list without names, the code will assign sequential letters as names. If some elements are named but not others, it will produce an error.

For some outcomes, there may be evidence that different groups of observations have different residual variances, and the standard LMM assumption of homoscedasticity is violated. When group.var is specified, separate (heterogeneous) residual variance components are fit for each unique value of group.var. This parameter is only applicable when family = "gaussian".

When family is not gaussian, the penalized quasi-likelihood (PQL) approximation to the generalized linear mixed model (GLMM) is fit following the procedure of GMMAT (Chen et al., 2016).

Let m be the number of matrices in cov.mat and let g be the number of categories in the variable specified by group.var. The length of the start vector must be (m + 1) when family is gaussian and group.var is NULL; (m + g) when family is gaussian and group.var is specified; or m when family is not gaussian.

A Newton-Raphson iterative procedure with Average Information REML (AIREML) is used to estimate the variance components of the random effects. When the absolute change between all of the new and previous variance component estimates is less than var(outcome)*AIREML.tol, the algorithm declares convergence of the estimates. Sometimes a variance component may approach the boundary of the parameter space at 0; step-halving is used to prevent any component from becomming negative. However, when a variance component gets near the 0 boundary, the algorithm can sometimes get "stuck", preventing the other variance components from converging; if drop.zeros is TRUE, then variance components that converge to a value less than AIREML.tol will be dropped from the model and the estimation procedure will continue with the remaining variance components.

When two.stage = TRUE, the fully-adjusted two-stage rank normalization procedure from Sofer et. al. (2019) is used. With this procedure: the stage 1 model is fit as usual, with the specified outcome, covars, cov.mat, and group.var; the marginal residuals from the stage 1 model are rank normalized as specified by norm.option and then rescaled as specified by rescale; the stage 2 model is then fit using the rank normalized and rescaled residuals as the new outcome, with the same covars, cov.mat, and group.var as the stage 1 model. The output of this stage 2 model is saved and should be used for association testing. This procedure is only applicable when family = "gaussian". The nullModelInvNorm function takes a previously fit null model as input and does the rank normalization, rescaling, and stage 2 model fitting.

The function fitNullModelFastScore fits a null model that can be used for testing variant association with the fast approximation to the score standard error (SE) implemented by Zhou et al. (2018) in their SAIGE software. This approximation may be much faster than computing the true score SE in large samples, as it replaces the full covariance matrix in the calculation with the product of a diagonal matrix and a scalar correction factor (se.correction in the null model output); see the option fast.Score.SE in assocTestSingle for further details. A null model previously fit with fitNullModel can be updated to this format by using calcScore to compute the necessary statistics on a set of variants, followed by nullModelFastScore to calculate the se.correction factor and update the null model format.

p-values that are calculated using pchisq and are smaller than .Machine\$double.xmin are set to .Machine\$double.xmin.

Value

An object of class 'GENESIS.nullModel' or 'GENESIS.nullMixedModel'. A list including:

model: A list with elements describing the model that was fit. See below for explanations of the elements in this list.
varComp: The variance component estimates. There is one variance component for each random effect specified in cov.mat. When family is gaussian, there are additional residual variance components; one residual variance component when group.var is NULL, and as many residual variance components as there are unique values of group.var when it is specified.
varCompCov: The estimated covariance matrix of the variance component estimates given by varComp. This can be used for hypothesis tests regarding the variance components.
fixef: A data.frame with effect size estimates (betas), standard errors, chi-squared test statistics, and p-values for each of the fixed effect covariates specified in covars.
betaCov: The estimated covariance matrix of the effect size estimates (betas) of the fixed effect covariates. This can be used for hypothesis tests regarding the fixed effects.
fit: A data frame with the outcome, fitted values, and residuals. See below for explanations of the columns of this data frame.
logLik: The log-likelihood value.
logLikR: The restricted log-likelihood value.
AIC: The Akaike Information Criterion value.
model.matrix: The design matrix for the fixed effect covariates used in the model.
group.idx: If group.var is not NULL, a list of indices for samples in each group.
cholSigmaInv: The Cholesky decomposition of the inverse of the estimated outcome covariance structure. This is used by assocTestSingle or assocTestAggregate for genetic association testing.
W: The diagonal weight matrix with terms given by the variance function for the specified family. This is the inverse of portion of the estimated outcome covariance structure not attributed to random effects specified with cov.mat. This matrix is used in place of the inverse of the estimated outcome covariance structure when using fast.score.SE for association testing with assocTestSingle.
converged: A logical indicator of whether the AIREML procedure for estimating the random effects variance components converged.
zeroFLAG: A vector of logicals the same length as varComp specifying whether the corresponding variance component estimate was set to 0 by the function due to convergence to the boundary in the AIREML procedure.
RSS: The residual sum of squares from the model fit. When family is gaussian, this will typically be 1 since the residual variance component is estimated separately.
RSS0: the sum of resid.cholesky squared. This is the sum of squared residuals under the null hypothesis of no genetic effect for the covariate and random effect adjusted model using the Frisch-Waugh-Lovell theorem.
CX: a matrix needed for calculating association test statistics
CXCXI: a matrix needed for calculating association test statistics
se.correction: The scalar correction factor for the fast approximation to the score SE; the average of the se.ratio values in score.table.
score.table: A data frame with information about the variants used to compute the se.correction factor.

The fit data frame contains the following columns, depending on which type of model was fit:

outcome: The original outcome vector.
workingY: The "working" outcome vector. When family is gaussian, this is just the original outcome vector. When family is not gaussian, this is the PQL linearization of the outcome vector. This is used by assocTestSingle or assocTestAggregate for genetic association testing. See 'Details' for more information.
fitted.values: The fitted values from the model. For mixed models, this is X*beta where X is the design matrix and beta is the vector of effect size estimates for the fixed effects. For non-mixed models, this is the inverse link function applied to X*beta (e.g., expit(X*beta) for logistic regression when family = "binomial").
resid.marginal: The marginal residuals from the model; i.e. Y - X*beta where Y is the vector of outcome values.
linear.predictor: The linear predictor from the model; i.e. X*beta + Z*u, where Z*u represents the effects captured by the random effects specified with the cov.mat input.
resid.conditional: The conditional residuals from the model; i.e. Y - X*beta - Z*u, where Z*u represents the effects captured by the random effects specified with the cov.mat input.
resid.cholesky: The Cholesky residuals from the model; a transformation of the marginal residuals using the estimated model covariance structure such that they are uncorrelated and follow a standard multivariate Normal distribution with mean 0 and identity covariance matrix asymptotically. Linear regression of the Cholesky residual vector on an equivalently transformed genotype vector provides the same estimates as fitting the full GLS model (by the Frisch-Waugh-Lovell theorem).
resid.PY: The outcome vector (Y) pre-multiplied by a projection matrix (P) that adjusts for covariates, random effects, and model covariance structure. These projected outcome values are essentially what are correlated with genotype values for association testing.
sample.id: A vector of IDs for the samples used in the analysis, if called with an AnnotatedDataFrame.

The model list element contains the following elements:

outcome: The outcome variable name.
covars: A vector of covariate names
formula: The model string.
family: A character string specifying the family used in the analysis.
hetResid: A logical indicator of whether heterogeneous residual variance components were used in the model (specified by group.var).

The score.table data frame contains the following columns:

variant.id: The variant ID
chr: The chromosome value
pos: The base pair position
allele.index: The index of the alternate allele. For biallelic variants, this will always be 1.
n.obs: The number of samples with non-missing genotypes
freq: The estimated alternate allele frequency
MAC: The minor allele count. For multiallelic variants, "minor" is determined by comparing the count of the alternate allele specified by allele.index with the sum of all other alleles.
Score: The value of the score function
Score.SE: The estimated true standard error of the Score
Score.SE.fast: The estimated fast standard error of the Score (before scalar correction)
se.ratio: The ratio of Score.SE to Score.SE.fast; these values are averaged across varaints to estimate se.correction in nullModelFastScore.

Author(s)

Matthew P. Conomos, Stephanie M. Gogarten, Tamar Sofer, Ken Rice, Chaoyu Yu

References

Chen H, Wang C, Conomos MP, Stilp AM, Li Z, Sofer T, Szpiro AA, Chen W, Brehm JM, Celedon JC, Redline S, Papanicolaou GJ, Thornton TA, Laurie CC, Rice K and Lin X. (2016) Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies Using Logistic Mixed Models. American Journal of Human Genetics, 98(4):653-66.

Sofer, T., Zheng, X., Gogarten, S. M., Laurie, C. A., Grinde, K., Shaffer, J. R., ... & Rice, K. M. (2019). A fully adjusted two-stage procedure for rank-normalization in genetic association studies. Genetic epidemiology, 43(3), 263-275.

Zhou, W., Nielsen, J. B., Fritsche, L. G., Dey, R., Gabrielsen, M. E., Wolford, B. N., ... & Bastarache, L. A. (2018). Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature genetics, 50(9), 1335.

Breslow NE and Clayton DG. (1993). Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 88: 9-25.

Gilmour, A.R., Thompson, R., & Cullis, B.R. (1995). Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 1440-1450.

Examples

library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")

# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat,
                divobj = HapMap_ASW_MXL_KINGmat)

# run PC-Relate
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData, snpBlock=20000)
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
    			training.set = mypcair$unrels,
                        BPPARAM = BiocParallel::SerialParam())
close(HapMap_genoData)

# generate a phenotype
set.seed(4)
pheno <- 0.2*mypcair$vectors[,1] + rnorm(mypcair$nsamp, mean = 0, sd = 1)

annot <- data.frame(sample.id = mypcair$sample.id,
                    pc1 = mypcair$vectors[,1], pheno = pheno)

# make covariance matrix
cov.mat <- pcrelateToMatrix(mypcrel, verbose=FALSE)[annot$sample.id, annot$sample.id]

# fit the null mixed model
nullmod <- fitNullModel(annot, outcome = "pheno", covars = "pc1", cov.mat = cov.mat)

smgogarten/GENESIS documentation built on Feb. 4, 2025, 4:28 p.m.