run_genomic_prediction: Interface with BGLR to run genomic prediction with snpRdata...

View source: R/association_functions.R

run_genomic_predictionR Documentation

Interface with BGLR to run genomic prediction with snpRdata objects.

Description

Run genomic prediction given a single response variable (usually a phenotype) using the BGLR function. Unlike other snpR functions, this returns the resulting model directly, so overwrite with caution.

Usage

run_genomic_prediction(
  x,
  facets = NULL,
  response,
  iterations,
  burn_in,
  thin,
  model = "BayesB",
  interpolate = "bernoulli",
  ncp = NULL,
  ncp.max = 5,
  par = FALSE,
  verbose = FALSE,
  ...
)

Arguments

x

snpRdata object

facets

character, default NULL. Categorical metadata variables by which to break up analysis. See Facets_in_snpR for more details.

response

character. Name of the column containing the response variable of interest. Must match a column name in sample metadata.

iterations

numeric. Number of iterations to run the MCMC chain for.

burn_in

numeric. Number of burn in iterations to run prior to the MCMC chain.

thin

numeric. Number of iterations to discard between each recorded data point.

model

character, default "BayesB". Prediction model to use, see description for the ETA argument in BGLR.

interpolate

character, default "bernoulli". Interpolation method for missing data. Options:

  • bernoulli: binomial draws for the minor allele.

  • af: insertion of the average allele frequency

  • iPCA: As a slower but more accurate alternative to "af" interpolation, "iPCA" may be selected. This an iterative PCA approach to interpolate based on SNP/SNP covariance via imputePCA. If the ncp argument is not defined, the number of components used for interpolation will be estimated using estim_ncpPCA. In this case, this method is much slower than the other methods, especially for large datasets. Setting an ncp of 2-5 generally results in reasonable interpolations without the time constraint.

.

ncp

numeric or NULL, default NULL. Used only if iPCA interpolation is selected. Number of components to consider for iPCA sn format interpolations of missing data. If null, the optimum number will be estimated, with the maximum specified by ncp.max. This can be very slow.

ncp.max

numeric, default 5. Used only if iPCA interpolation is selected. Maximum number of components to check for when determining the optimum number of components to use when interpolating sn data using the iPCA approach.

par

numeric or FALSE, default FALSE. If a number specifies the number of processing cores to use across facet levels. Not used if only one facet level.

verbose

Logical, default FALSE. If TRUE, some progress updates will be printed to the console.

...

additional arguments passed to BGLR

Details

This function is provided as a wrapper to plug snpRdata objects into the BGLR function in order to easily run genomic prediction on a simple model where a single, sample specific meta data variable is provided as the response variable. To do so, this function formats the data into a transposed "sn" format, as described in format_snps using the bernoulli method to interpolate missing genotypes. Several different prediction models are available, see the documentation the ETA argument in BGLR for details. Defaults to the "BayesB" model, which assumes a "spike-slab" prior for allele effects on phenotype where most markers have a very small effect size and a few can have a much larger effect.

Unlike most snpR functions, this function does not support facets, since each run can be very slow. Instead, an individual facet and facet level of interest should be selected with subset_snpR_data. See examples.

See documentation for BGLR for more details and for a full list of references.

Value

A list containing: two parts:

  • x: The provided snpRdata object with effect sizes merged in.

  • models: Other model results, a list containing:

    • model: The model output from BGLR. See BGLR.

    • h2: Estimated heritability of the response variable.

    • predictions: A data.frame containing the provided phenotypes and the predicted Breeding Values (BVs) for those phenotypes.

Author(s)

William Hemstrom

References

PĂ©rez, P., and de los Campos, G. (2014). Genetics.

Examples

# run and plot a basic prediction
## add some dummy phenotypic data.
dat <- stickSNPs
sample.meta(dat) <- cbind(weight = rnorm(ncol(stickSNPs)), 
                          sample.meta(stickSNPs))
## run prediction
gp <- run_genomic_prediction(dat, response = "weight", iterations = 1000, 
                             burn_in = 100, thin = 10)
## dummy phenotypes vs. predicted Breeding Values for dummy predictions.
# given that weight was randomly assigned, definitely overfit!
with(gp$models$.base_.base$predictions, plot(phenotype, predicted_BV)) 
## fetch estimated loci effects
get.snpR.stats(gp$x, stats = "genomic_prediction")

## Not run: 
# with facets, not run
gp <- run_genomic_prediction(gp$x, facets = "pop", response = "weight", 
                             iterations = 1000, burn_in = 100, thin = 10)
get.snpR.stats(gp$x, facets = "pop", stats = "genomic_prediction")

## End(Not run)

hemstrow/snpR documentation built on July 15, 2024, 7:14 p.m.