hyperparameter_random_forest: Estimate hyperparameters from simulations using a random...

View source: R/parameter_estimation.R

hyperparameter_random_forestR Documentation

Estimate hyperparameters from simulations using a random forest.

Description

Estimate one or more hyperparameters using descriptions of GWAS p-values distributions from simulated effect size distributions, such as those produced by sim_gen using a random forest.

Usage

hyperparameter_random_forest(
  x,
  meta,
  phenos,
  sims,
  hyperparameter_to_estimate = c("pi"),
  center = T,
  hold_percent = 0.25,
  num_trees = 1000,
  mtry = function(columns) columns,
  num_threads = NULL,
  parameter_transforms = reasonable_transform(hyperparameter_to_estimate)$forward,
  parameter_back_transforms = reasonable_transform(hyperparameter_to_estimate)$back,
  importance = "permutation",
  scheme = "gwas",
  peak_delta = 0.5,
  peak_pcut = 5e-04,
  window_sigma = 50,
  quantiles = seq(0 + 0.001, 1 - 0.001, by = 0.001),
  save_rf = FALSE,
  pass_windows = NULL,
  pass_G = NULL,
  GMMAT_infile = NULL,
  phased = FALSE,
  maf = 0.05,
  ...
)

Arguments

x

numeric matrix. Input genotypes, SNPs as rows, columns as individuals. Genotypes formatted as 0,1,2 for the major homozygote, heterozygote, and minor homozygote, respectively.

meta

data.frame. Metadata for SNPs. First two columns must hold chromosome ID and position. Futher columns ignored.

phenos

numeric vector. Observed phenotypes, one per individual.

sims

data.frame. Data.frame that matches that produced by sim_gen, containing descriptive statistics for the estimated effect sizes/associaion p-values from the simulted data as well as other columns containing the hyperparameters of interest for those simulations. Each row should be a single simulation. Columns not matching those expected and produced via sim_gen will be ignored.

hyperparameter_to_estimate

character vector, default "pi". Names of the hyperparameters to estimate via random forest. Must match column names in sims.

center

logical, default T. Determines if the phenotypes provided should be centered (have their means set to 0). This should match what was provided to sim_gen, as it does given the defaults for both functions.

hold_percent

numeric < 1 and > 0, default .25. Proportion of sims to hold out from model estimation for use in cross-evalutation.

num_trees

numeric, default 1000. Number of trees to grow during the random forest.

mtry

function, default function(columns) columns. A function that, when given the number of columns containing distribution summary statistics, returns the number of variables to possibly split at each node during random forest. For example, function(columns) columns/2 would have an mtry equal to half the number of summary statistics. See ranger for details about mtry.

num_threads

numeric, default NULL. Number of processing threads to use for tree growth and cross-evaluation.

importance

character, default "permutation". Determines how variable importance is computed, if it is at all. See ranger for details. Note that "permutation", while accurate and thus the default, can be very slow. By and large, this isn't needed for genetic architecture prediciton, and can be set to "none" if not wanted.

peak_delta

numeric, default 0.5. Value used to determine spacing between called peaks during peak identification for distribution description.

peak_pcut

numeric, default 0.0005. Only p-values below this quantile will be used for peak detection during peak indentification for distribution description.

window_sigma

numeric, default = 50. Size of the windows in megabases to be used during distribution description.

quantiles

numeric, default seq(0 + 0.001, 1 - 0.001, .001). Density quantiles over which to estimate parameter values.

save_rf

logical, default FALSE. If true, the raw ranger random forest object is returned. Can be extremely large, and not needed unless different quantiles/predictions/etc are needed.

...

Extra arguments passed to ranger.

parameter_transforms.

Named list of parameter transformation functions or NULL, default the reasonable_transform for the named hyperparameter. Transformations to use on any estimated parameters. Any estimated hyperparameters with names matching those in this list will be transformed as given. Usefull if pi or other hyperparameter values in simulations are heavily skewed, as is often likely.

parameter_back_transforms.

Named list of parameter back transformation functions or NULL, default reasonable_transform for the named hyperparameter. Back transformations to use on any estimated parameters. Any estimated hyperparameters with names matching those in this list will be back_transformed as given prior to being returned.

Author(s)

William Hemstrom


hemstrow/GeneArchEst documentation built on June 10, 2025, 5:06 a.m.