View source: R/association_functions.R
run_random_forest | R Documentation |
Creates forest machine learning models using snpRdata objects via interface
with ranger
. Models can be created either for a
specific phenotype with no-covariates or using a formula which follows the
basic format specified in formula
.
run_random_forest(
x,
facets = NULL,
response,
formula = NULL,
num.trees = 10000,
mtry = NULL,
importance = "impurity_corrected",
interpolate = "bernoulli",
ncp = NULL,
ncp.max = 5,
pvals = TRUE,
par = FALSE,
...
)
x |
snpRdata object. |
facets |
character, default NULL. Categorical metadata variables by
which to break up analysis. See |
response |
character. Name of the column containing the response variable of interest. Must match a column name in sample metadata. |
formula |
character, default NULL. Model for the response variable, as
described in |
num.trees |
numeric, default 10000. Number of trees to grow. Higher
numbers will increase model accuracy, but increase calculation time. See
|
mtry |
numeric, default is the square root of the number of SNPs. Number
of variables (SNPs) by which to split each node. See
|
importance |
character, default "impurity_corrected". The method by which SNP importance is determined. Options:
. See
|
interpolate |
character, default "bernoulli". Interpolation method for missing data. Options:
. |
ncp |
numeric or NULL, default NULL. Used only if |
ncp.max |
numeric, default 5. Used only if |
pvals |
logical, default TRUE. Determines if p-values should be
calculated for importance values. If the response variable is quantitative,
no p-values will be returned, since they must be calculated via permutation
and this is very slow. For details, see
|
par |
numeric, default FALSE. Number of parallel computing cores to use for computing RFs across multiple facet levels or within a single facet if only a single category is run (either a one-category facet or no facet). |
... |
Additional arguments passed to |
Random forest models can be created across multiple facets of the data at
once following the typical snpR framework explained in
Facets_in_snpR
. Since RF models are calculated without
allowing for any SNP-specific categories (e.g. independent of chromosome
etc.), any sample level facets provided will be ignored. As usual, if facets
is set to NULL, an RF will be calculated for all samples without splitting
across any sample metadata categories.
Since the ranger
RF implementation can behave
unexpectedly when given incomplete data, missing genotypes will be imputed.
Imputation can occur either via the insertion of the average allele frequency
or via binomial draws for the minor allele using the "af" or "bernoulli"
options for the "interpolate" argument.
Extra arguments can be passed to ranger
.
In general, random forest parameters should be tuned so as to reduce the out-of-bag error rates (OOB-ER). This value is visible in the returned object under the model lists. Simply calling a specific model will output the OOB-ER, and they are also stored under the 'prediction.error' name in the model. For details on tuning RF models, we recommend Goldstein et al. (2011).
For more detail on the random forest model and ranger arguments, see
ranger
.
A list containing:
data: A snpRdata object with RF importance values merged in to the stats slot.
models: A named list containing both the models and data.frames containing the predictions vs observed phenotypes.
William Hemstrom
Wright, Marvin N and Ziegler, Andreas. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software.
Goldstein et al. (2011). Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology.
ranger
predict.ranger
.
## Not run:
# run and plot a basic rf
## add some dummy phenotypic data.
dat <- stickSNPs
sample.meta(dat) <- cbind(weight = rnorm(ncol(stickSNPs)), sample.meta(stickSNPs))
## run rf
rf <- run_random_forest(dat, response = "weight", pvals = FALSE)
rf$models
## dummy phenotypes vs. predicted
with(rf$models$.base_.base$predictions, plot(pheno, predicted)) # not overfit
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.