run_random_forest: Run a RANGER random forest using snpRdata for a given...

View source: R/association_functions.R

run_random_forestR Documentation

Run a RANGER random forest using snpRdata for a given phenotype or model.

Description

Creates forest machine learning models using snpRdata objects via interface with ranger. Models can be created either for a specific phenotype with no-covariates or using a formula which follows the basic format specified in formula.

Usage

run_random_forest(
  x,
  facets = NULL,
  response,
  formula = NULL,
  num.trees = 10000,
  mtry = NULL,
  importance = "impurity_corrected",
  interpolate = "bernoulli",
  ncp = NULL,
  ncp.max = 5,
  pvals = TRUE,
  par = FALSE,
  ...
)

Arguments

x

snpRdata object.

facets

character, default NULL. Categorical metadata variables by which to break up analysis. See Facets_in_snpR for more details.

response

character. Name of the column containing the response variable of interest. Must match a column name in sample metadata.

formula

character, default NULL. Model for the response variable, as described in formula. If NULL, the model will be equivalent to response ~ 1.

num.trees

numeric, default 10000. Number of trees to grow. Higher numbers will increase model accuracy, but increase calculation time. See ranger for details.

mtry

numeric, default is the square root of the number of SNPs. Number of variables (SNPs) by which to split each node. See ranger for details.

importance

character, default "impurity_corrected". The method by which SNP importance is determined. Options:

  • impurity

  • impurity_corrected

  • permutation

. See ranger for details.

interpolate

character, default "bernoulli". Interpolation method for missing data. Options:

  • bernoulli: binomial draws for the minor allele.

  • af: insertion of the average allele frequency

  • iPCA: As a slower but more accurate alternative to "af" interpolation, "iPCA" may be selected. This an iterative PCA approach to interpolate based on SNP/SNP covariance via imputePCA. If the ncp argument is not defined, the number of components used for interpolation will be estimated using estim_ncpPCA. In this case, this method is much slower than the other methods, especially for large datasets. Setting an ncp of 2-5 generally results in reasonable interpolations without the time constraint.

.

ncp

numeric or NULL, default NULL. Used only if iPCA interpolation is selected. Number of components to consider for iPCA sn format interpolations of missing data. If null, the optimum number will be estimated, with the maximum specified by ncp.max. This can be very slow.

ncp.max

numeric, default 5. Used only if iPCA interpolation is selected. Maximum number of components to check for when determining the optimum number of components to use when interpolating sn data using the iPCA approach.

pvals

logical, default TRUE. Determines if p-values should be calculated for importance values. If the response variable is quantitative, no p-values will be returned, since they must be calculated via permutation and this is very slow. For details, see importance_pvalues.

par

numeric, default FALSE. Number of parallel computing cores to use for computing RFs across multiple facet levels or within a single facet if only a single category is run (either a one-category facet or no facet).

...

Additional arguments passed to ranger.

Details

Random forest models can be created across multiple facets of the data at once following the typical snpR framework explained in Facets_in_snpR. Since RF models are calculated without allowing for any SNP-specific categories (e.g. independent of chromosome etc.), any sample level facets provided will be ignored. As usual, if facets is set to NULL, an RF will be calculated for all samples without splitting across any sample metadata categories.

Since the ranger RF implementation can behave unexpectedly when given incomplete data, missing genotypes will be imputed. Imputation can occur either via the insertion of the average allele frequency or via binomial draws for the minor allele using the "af" or "bernoulli" options for the "interpolate" argument.

Extra arguments can be passed to ranger.

In general, random forest parameters should be tuned so as to reduce the out-of-bag error rates (OOB-ER). This value is visible in the returned object under the model lists. Simply calling a specific model will output the OOB-ER, and they are also stored under the 'prediction.error' name in the model. For details on tuning RF models, we recommend Goldstein et al. (2011).

For more detail on the random forest model and ranger arguments, see ranger.

Value

A list containing:

  • data: A snpRdata object with RF importance values merged in to the stats slot.

  • models: A named list containing both the models and data.frames containing the predictions vs observed phenotypes.

Author(s)

William Hemstrom

References

Wright, Marvin N and Ziegler, Andreas. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software.

Goldstein et al. (2011). Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology.

See Also

ranger predict.ranger.

Examples

## Not run: 
# run and plot a basic rf
## add some dummy phenotypic data.
dat <- stickSNPs
sample.meta(dat) <- cbind(weight = rnorm(ncol(stickSNPs)), sample.meta(stickSNPs))
## run rf
rf <- run_random_forest(dat, response = "weight", pvals = FALSE)
rf$models
## dummy phenotypes vs. predicted
with(rf$models$.base_.base$predictions, plot(pheno, predicted)) # not overfit

## End(Not run)


hemstrow/snpR documentation built on March 20, 2024, 7:03 a.m.