GridLMM_GWAS_set: GridLMM GWAS set
In deruncie/GridLMM: Efficient Mixed Models for GWAS with multiple Random Effects

View source: R/GridLMM_GWAS_set.R

GridLMM_GWAS_set

R Documentation

GridLMM GWAS set

Description

Performs a GWAS of set-tests using GridLMM algorithm. Can perform LRTs, Wald-tests, or calculate Bayes Factors. By default, uses the targeted grid search heuristic (fast algorithm), though can perform a full grid search as well.

Usage

GridLMM_GWAS_set(
  formula,
  data,
  weights = NULL,
  X,
  X_ID = "ID",
  set_matrix,
  centerX = FALSE,
  scaleX = FALSE,
  fillNAX = FALSE,
  X_map = NULL,
  relmat = NULL,
  normalize_relmat = TRUE,
  h2_step = 0.01,
  h2_start = NULL,
  h2_start_tolerance = 0.001,
  max_steps = 100,
  method = c("REML"),
  algorithm = c("Fast", "Full"),
  inv_prior_X = NULL,
  target_prob = 0.99,
  proximal_markers = NULL,
  proximal_Xs = NULL,
  V_setup = NULL,
  save_V_folder = NULL,
  diagonalize = T,
  mc.cores = my_detectCores(),
  verbose = T
)

Arguments

`formula`	A two-sided linear formula as used in `lmer` describing the fixed-effects and random-effects of the model on the RHS and the response on the LHS. Note: correlated random-effects are not implemented, so using one or two vertical bars (`\|`) or one is identical. At least one random effect is needed. Unlike `lmer`, random effects can have as many as there are observations.
`data`	A data frame containing the variables named in `formula`.
`weights`	An optional vector of observation-specific weights.
`X`	Matrix of markers with `p` columns. Each column of X is used as a separate association test. Should have row names that correspond to the `X_ID` column of `data`. Colnames are used as IDs for each test, and should align with names of `proximal_markers` if provided.
`X_ID`	Column of `data` that identifies the row of `X` that corresponding to each observation. It is possible that multiple observations reference the same row of `X`.
`centerX`, `scaleX`, `fillNAX`	TRUE/FALSE for each. Applied to the `X` matrix before using `X` to form any GRMs.
`X_map`	Optional. Data frame with information on each marker such as chromosome, position, etc. Will be appended to the results
`relmat`	Either: 1) A list of matrices that are proportional to the (within) covariance structures of the group level effects. 2) A list of lists with elements (`K`, `p`) with a covariance matrix and an integer listing the number of markers used to estimate the covariance matrix. This is used for appropriate downdating of `V` to remove proximal markers for each test. The names of the matrices / list elements should correspond to the columns in `data` that are used as grouping factors. All levels of the grouping factor should appear as rownames of the corresponding matrix.
`h2_step`	Step size of the grid
`h2_start`	Optional. Matrix with each row a vector of `h^2` parameters defining starting values for the grid. Typically ML/REML solutions for the null model. If null, will be calculated using GridLMM_ML.
`h2_start_tolerance`	Optional. Grid size for GridLMM_ML in finding ML/REML solutions for the mull model.
`max_steps`	Maximum iterations of the heuristic algorithm per marker.
`method`	One of 'REML', 'ML', or 'BF'. 'REML' wimplies a Wald-test. 'ML' implies Maximum Likelihood evaluation, with the LRT. 'BF' does posterior evaluation and calculates Bayes Factors.
`algorithm`	Either 'Fast' or 'Full'. See details.
`inv_prior_X`	Vector of values for the prior precision of each of the fixed effects (including an intercept). Will be recycled if necessary.
`target_prob`	see Details
`proximal_markers`	A list of integer vectors with length equal to the number of columns of `X`/ Each element is a vector of indices of markers that should be removed from any GRMs before the current test is calculated. If `proximal_Xs` is provided, then the indices correspond to columns of `proximal_Xs`. Otherwise, the indices correspond to columns of `X`. If null, no downdating will be performed.
`proximal_Xs`	Optional. A list of matrices to be used for downdating GRMs. If multiple GRMs are calculated from markers, this list can have multiple elements. Each matrix should have rownames like `X` corresponding to the levels of `X_ID` in `data`.
`V_setup`	Optional. A list produced by a GridLMM function containing the pre-processed V decompositions for each grid vertex, or the information necessary to create this. Generally saved from a previous run of GridLMM on the same data.
`save_V_folder`	Optional. A character vector giving a folder to save pre-processed V decomposition files for future / repeated use. If null, V decompositions are stored in memory
`diagonalize`	If TRUE and the model includes only a single random effect, the "GEMMA" trick will be used to diagonalize V. This is done by calculating the SVD of K, which can be slow for large samples.
`mc.cores`	Number of processor cores used for parallel evaluations. Note that this uses 'mclapply', so the memory requires grow rapidly with `mc.cores`, because the marker matrix gets duplicated in memory for each core.
`verbose`	Should progress be printed to the screen?
`test_formula`	test_formula One-sided formula for the alternative model (ML or BF), or full model (REML) to be applied to each test (ie marker, or column of `X`). Each term on the RHS will be multiplied by a column of X to form a new covariate. Ex. `~1` specifices an intercept for each marker. `~1+cov` species an intercept and slope on `cov` for each marker.
`reduced_formula`	One-sided formula for the reduced model. Same format as `test_formula`. Should have fewer degrees of freedom than `test_formula`. Not used for REML models.

Details

GridLMM performs approximate likelihood or posterior-based inference for linear mixed models efficiently by finding solutions to many models in parallel. Rather than optimizing to high precision for each separate model, GridLMM finds "good enough" solutions that satisfy many tests at once - so the expensive calculations can be re-used. It does this by trying variance components on a grid, and selecting the best grid cell for each model. The Full algorithm performs a full grid search over all variance component parameters. The Fast algorithm uses heuristics to reduce the number of grid cells that need to be evaluated - focusing from the maximum likelihood solutions under a null model with no markers, and then working out to neighboring grid cells from there.

Posterior inference involves an adaptive grid search. Generally, we start with a very coarse grid (with as few as 2-3 vertices per variance component) and then progressively increase the grid resolution focusing only on regions of high posterior probability. This is controlled by h2_divisions, target_prob, thresh_nonzero, and thresh_nonzero_matrginal. The sampling algorithm is as follows:

Start by evaluating the posterior at each vertex of a trial grid with resolution m
Find the minimum number of vertices needed to sum to target_prob of the current (discrete) posterior. Repeat for the marginal posteriors of each variance component#'
If these numbers are smaller than thresh_nonzero or thresh_nonzero_matrginal, respectively, form a new grid by increasing the grid resolution to m/2. Otherwise, STOP.
Begin evaluating the posterior at the new grid only at those grid vertices that are adjacent (in any dimension) to any of the top grid vertices in the old grid.
Re-evaluate the distribution of the posterior over the new grid. If any new vertices contribute to the top target_prob fraction of the overall posterior, include these in the "top" set and return to step 4. Note - the prior weights for the grid vertices must be updated each time the grid increases in resolution.
Repeat steps 4-5 until no new grid vertices contribute to the "top" set.
Repeat steps 2-6 until a STOP is reached at step 3.

Value

A list with two elements:

`results`	A data frame with each row the results of the association test for a column of `X`, plus asssociated parameter values and statistics.
`setup`	A list with several objects needed for re-running the model, including `V_setup` and `downdate_Xs`. These can be re-passed to this function (or other GridLMM functions) to re-fit the model to the same data.