hyperparameter_random_forest: Estimate hyperparameters from simulations using a random...
In hemstrow/GeneArchEst: Estimate Genomic Architecture for Quantitative Traits

hyperparameter_random_forest

R Documentation

Estimate hyperparameters from simulations using a random forest.

Description

Estimate one or more hyperparameters using descriptions of GWAS p-values distributions from simulated effect size distributions, such as those produced by sim_gen using a random forest.

Usage

hyperparameter_random_forest(
  x,
  meta,
  phenos,
  sims,
  hyperparameter_to_estimate = c("pi"),
  center = T,
  hold_percent = 0.25,
  num_trees = 1000,
  mtry = function(columns) columns,
  num_threads = NULL,
  parameter_transforms = reasonable_transform(hyperparameter_to_estimate)$forward,
  parameter_back_transforms = reasonable_transform(hyperparameter_to_estimate)$back,
  importance = "permutation",
  scheme = "gwas",
  peak_delta = 0.5,
  peak_pcut = 5e-04,
  window_sigma = 50,
  quantiles = seq(0 + 0.001, 1 - 0.001, by = 0.001),
  save_rf = FALSE,
  pass_windows = NULL,
  pass_G = NULL,
  GMMAT_infile = NULL,
  phased = FALSE,
  maf = 0.05,
  ...
)

Arguments

`x`	numeric matrix. Input genotypes, SNPs as rows, columns as individuals. Genotypes formatted as 0,1,2 for the major homozygote, heterozygote, and minor homozygote, respectively.
`meta`	data.frame. Metadata for SNPs. First two columns must hold chromosome ID and position. Futher columns ignored.
`phenos`	numeric vector. Observed phenotypes, one per individual.
`sims`	data.frame. Data.frame that matches that produced by `sim_gen`, containing descriptive statistics for the estimated effect sizes/associaion p-values from the simulted data as well as other columns containing the hyperparameters of interest for those simulations. Each row should be a single simulation. Columns not matching those expected and produced via `sim_gen` will be ignored.
`hyperparameter_to_estimate`	character vector, default "pi". Names of the hyperparameters to estimate via random forest. Must match column names in sims.
`center`	logical, default T. Determines if the phenotypes provided should be centered (have their means set to 0). This should match what was provided to `sim_gen`, as it does given the defaults for both functions.
`hold_percent`	numeric < 1 and > 0, default .25. Proportion of sims to hold out from model estimation for use in cross-evalutation.
`num_trees`	numeric, default 1000. Number of trees to grow during the random forest.
`mtry`	function, default function(columns) columns. A function that, when given the number of columns containing distribution summary statistics, returns the number of variables to possibly split at each node during random forest. For example, function(columns) columns/2 would have an mtry equal to half the number of summary statistics. See `ranger` for details about mtry.
`num_threads`	numeric, default NULL. Number of processing threads to use for tree growth and cross-evaluation.
`importance`	character, default "permutation". Determines how variable importance is computed, if it is at all. See `ranger` for details. Note that "permutation", while accurate and thus the default, can be very slow. By and large, this isn't needed for genetic architecture prediciton, and can be set to "none" if not wanted.
`peak_delta`	numeric, default 0.5. Value used to determine spacing between called peaks during peak identification for distribution description.
`peak_pcut`	numeric, default 0.0005. Only p-values below this quantile will be used for peak detection during peak indentification for distribution description.
`window_sigma`	numeric, default = 50. Size of the windows in megabases to be used during distribution description.
`quantiles`	numeric, default seq(0 + 0.001, 1 - 0.001, .001). Density quantiles over which to estimate parameter values.
`save_rf`	logical, default FALSE. If true, the raw ranger random forest object is returned. Can be extremely large, and not needed unless different quantiles/predictions/etc are needed.
`...`	Extra arguments passed to `ranger`.
`parameter_transforms.`	Named list of parameter transformation functions or NULL, default the `reasonable_transform` for the named hyperparameter. Transformations to use on any estimated parameters. Any estimated hyperparameters with names matching those in this list will be transformed as given. Usefull if pi or other hyperparameter values in simulations are heavily skewed, as is often likely.
`parameter_back_transforms.`	Named list of parameter back transformation functions or NULL, default `reasonable_transform` for the named hyperparameter. Back transformations to use on any estimated parameters. Any estimated hyperparameters with names matching those in this list will be back_transformed as given prior to being returned.