GridLMM_GWAS_set: GridLMM GWAS set

View source: R/GridLMM_GWAS_set.R

GridLMM_GWAS_setR Documentation

GridLMM GWAS set

Description

Performs a GWAS of set-tests using GridLMM algorithm. Can perform LRTs, Wald-tests, or calculate Bayes Factors. By default, uses the targeted grid search heuristic (fast algorithm), though can perform a full grid search as well.

Usage

GridLMM_GWAS_set(
  formula,
  data,
  weights = NULL,
  X,
  X_ID = "ID",
  set_matrix,
  centerX = FALSE,
  scaleX = FALSE,
  fillNAX = FALSE,
  X_map = NULL,
  relmat = NULL,
  normalize_relmat = TRUE,
  h2_step = 0.01,
  h2_start = NULL,
  h2_start_tolerance = 0.001,
  max_steps = 100,
  method = c("REML"),
  algorithm = c("Fast", "Full"),
  inv_prior_X = NULL,
  target_prob = 0.99,
  proximal_markers = NULL,
  proximal_Xs = NULL,
  V_setup = NULL,
  save_V_folder = NULL,
  diagonalize = T,
  mc.cores = my_detectCores(),
  verbose = T
)

Arguments

formula

A two-sided linear formula as used in lmer describing the fixed-effects and random-effects of the model on the RHS and the response on the LHS. Note: correlated random-effects are not implemented, so using one or two vertical bars (|) or one is identical. At least one random effect is needed. Unlike lmer, random effects can have as many as there are observations.

data

A data frame containing the variables named in formula.

weights

An optional vector of observation-specific weights.

X

Matrix of markers with p columns. Each column of X is used as a separate association test. Should have row names that correspond to the X_ID column of data. Colnames are used as IDs for each test, and should align with names of proximal_markers if provided.

X_ID

Column of data that identifies the row of X that corresponding to each observation. It is possible that multiple observations reference the same row of X.

centerX, scaleX, fillNAX

TRUE/FALSE for each. Applied to the X matrix before using X to form any GRMs.

X_map

Optional. Data frame with information on each marker such as chromosome, position, etc. Will be appended to the results

relmat

Either: 1) A list of matrices that are proportional to the (within) covariance structures of the group level effects. 2) A list of lists with elements (K, p) with a covariance matrix and an integer listing the number of markers used to estimate the covariance matrix. This is used for appropriate downdating of V to remove proximal markers for each test. The names of the matrices / list elements should correspond to the columns in data that are used as grouping factors. All levels of the grouping factor should appear as rownames of the corresponding matrix.

h2_step

Step size of the grid

h2_start

Optional. Matrix with each row a vector of h^2 parameters defining starting values for the grid. Typically ML/REML solutions for the null model. If null, will be calculated using GridLMM_ML.

h2_start_tolerance

Optional. Grid size for GridLMM_ML in finding ML/REML solutions for the mull model.

max_steps

Maximum iterations of the heuristic algorithm per marker.

method

One of 'REML', 'ML', or 'BF'. 'REML' wimplies a Wald-test. 'ML' implies Maximum Likelihood evaluation, with the LRT. 'BF' does posterior evaluation and calculates Bayes Factors.

algorithm

Either 'Fast' or 'Full'. See details.

inv_prior_X

Vector of values for the prior precision of each of the fixed effects (including an intercept). Will be recycled if necessary.

target_prob

see Details

proximal_markers

A list of integer vectors with length equal to the number of columns of X/ Each element is a vector of indices of markers that should be removed from any GRMs before the current test is calculated. If proximal_Xs is provided, then the indices correspond to columns of proximal_Xs. Otherwise, the indices correspond to columns of X. If null, no downdating will be performed.

proximal_Xs

Optional. A list of matrices to be used for downdating GRMs. If multiple GRMs are calculated from markers, this list can have multiple elements. Each matrix should have rownames like X corresponding to the levels of X_ID in data.

V_setup

Optional. A list produced by a GridLMM function containing the pre-processed V decompositions for each grid vertex, or the information necessary to create this. Generally saved from a previous run of GridLMM on the same data.

save_V_folder

Optional. A character vector giving a folder to save pre-processed V decomposition files for future / repeated use. If null, V decompositions are stored in memory

diagonalize

If TRUE and the model includes only a single random effect, the "GEMMA" trick will be used to diagonalize V. This is done by calculating the SVD of K, which can be slow for large samples.

mc.cores

Number of processor cores used for parallel evaluations. Note that this uses 'mclapply', so the memory requires grow rapidly with mc.cores, because the marker matrix gets duplicated in memory for each core.

verbose

Should progress be printed to the screen?

test_formula

test_formula One-sided formula for the alternative model (ML or BF), or full model (REML) to be applied to each test (ie marker, or column of X). Each term on the RHS will be multiplied by a column of X to form a new covariate. Ex. ~1 specifices an intercept for each marker. ~1+cov species an intercept and slope on cov for each marker.

reduced_formula

One-sided formula for the reduced model. Same format as test_formula. Should have fewer degrees of freedom than test_formula. Not used for REML models.

Details

GridLMM performs approximate likelihood or posterior-based inference for linear mixed models efficiently by finding solutions to many models in parallel. Rather than optimizing to high precision for each separate model, GridLMM finds "good enough" solutions that satisfy many tests at once - so the expensive calculations can be re-used. It does this by trying variance components on a grid, and selecting the best grid cell for each model. The Full algorithm performs a full grid search over all variance component parameters. The Fast algorithm uses heuristics to reduce the number of grid cells that need to be evaluated - focusing from the maximum likelihood solutions under a null model with no markers, and then working out to neighboring grid cells from there.

Posterior inference involves an adaptive grid search. Generally, we start with a very coarse grid (with as few as 2-3 vertices per variance component) and then progressively increase the grid resolution focusing only on regions of high posterior probability. This is controlled by h2_divisions, target_prob, thresh_nonzero, and thresh_nonzero_matrginal. The sampling algorithm is as follows:

  • Start by evaluating the posterior at each vertex of a trial grid with resolution m

  • Find the minimum number of vertices needed to sum to target_prob of the current (discrete) posterior. Repeat for the marginal posteriors of each variance component#'

  • If these numbers are smaller than thresh_nonzero or thresh_nonzero_matrginal, respectively, form a new grid by increasing the grid resolution to m/2. Otherwise, STOP.

  • Begin evaluating the posterior at the new grid only at those grid vertices that are adjacent (in any dimension) to any of the top grid vertices in the old grid.

  • Re-evaluate the distribution of the posterior over the new grid. If any new vertices contribute to the top target_prob fraction of the overall posterior, include these in the "top" set and return to step 4. Note - the prior weights for the grid vertices must be updated each time the grid increases in resolution.

  • Repeat steps 4-5 until no new grid vertices contribute to the "top" set.

  • Repeat steps 2-6 until a STOP is reached at step 3.

Value

A list with two elements:

results

A data frame with each row the results of the association test for a column of X, plus asssociated parameter values and statistics.

setup

A list with several objects needed for re-running the model, including V_setup and downdate_Xs. These can be re-passed to this function (or other GridLMM functions) to re-fit the model to the same data.


deruncie/GridLMM documentation built on May 2, 2023, 7:18 p.m.