aklimate: aklimate

Description Usage Arguments Value References

View source: R/aklimate.R

Description

AKLIMATE : Algorithm for Kernel Learning with Approximating Tree Ensembles

Usage

1
2
3
aklimate(dat, dat_grp, lbls, fsets, always_add = NULL,
  rf_pars = list(), akl_pars = list(), store_kernels = FALSE,
  verbose = FALSE)

Arguments

dat

samples x features data frame where columns might be of different type

dat_grp

a list of vectors, each consisting of suffixes for data types that match the ones used in dat. Each vector corresponds to a particular combination of data types that will be tested for each component RF. Only the data type combination with the best performance for a given feature set is retained. The data type suffixes should be distinct from one another so that none is a proper substring of another - i.e. c('cnv','cnv_gistic') is not OK, but c('MUTA:HOT','MUTA:NONSENSE') is. This argument is considered experimental - we recommend supplying a list of length 1, with the list entry a vector of all possible suffixes.

lbls

vector of training data labels

fsets

list of prior knowledge feature sets

always_add

vector of dat column names that are to be included with each fset

rf_pars

list of parameters for RF base kernels run

ntree

Number of trees for RF kernel construction. Default is 1000.

min_node_prop

Minimal size of leaf nodes (unit is proportion of training set size). Default is 0.01.

min_nfeat

Minimal size of feature set (across all data modalities) for an RF to be constructed. Default is 15.

mtry_prop

Proportion of features to be considered for each splitting decision. Default is 0.25

regression_q

For regression predictions only. Quantile of the per-sample empirical distribution of absolute differences between RF sample predictions and sample label. Used for binarization of sample predictions during best RF selection. Default 0.05.

replace

TRUE/FALSE. Is subsampling to be done with replacement? Default is FALSE.

sample_frac

Fraction of training data points to subsample for each tree. Default is 0.5 for sampling without replacement and 1 for bootstrapping.

ttype

Type of learning task - choices are "binary","multiclass", and "regression". Default is "binary".

split_rule

Type of splitting criteria- choices are "gini","hellinger","variance",and "beta". See ranger documentation for more details. Default is "gini".

importance

Rule for calculating feature and feature set importance - choices are "impurity_corrected","permutation",and "impurity". Default is "impurity_corrected".

metric

Metric for ranking RF base learner performance used in the selection of best RFs. Choices are "roc","pr","acc","bacc","mar","rmse","rsq","mae","pearson", and "spearman". Default is "roc".

unordered_factors

How to treat unordered factors. Choices are "order","ignore", and "partition". See ranger for more details. Default is "order".

oob_cv

A data frame of parameters to tune during trainings of all RF base learners, with OOB metric performance (from choices above) used to select best combination. Each row of the data frame includes a different combination of RF hyperparameters. The data frame has to contain at least two columns, with one column being "ntree". Having too many hyperparameter combinations can lead to significant slowdown in computation time. Default is a data frame of 1 row using the "min_node_prop","mtry_prop", and "ntree"/2 values of the rf_pars list. This argument is experimental - we recommend using the default setting.

akl_pars

list of parameters for RF best kernel selection and MKL meta-learner

topn

number of RF kernels (ranked by metric specified in rf_pars) that correctly predict a given sample to be included in best RF list. Default is 5.

cvlen

Number of random MKL hyperparameter combinations to be tested during MKL CV step. Default is 100.

nfold

Number of folds to be used in MKL CV. Default is 5.

lamb

Interval bounds from which random MKL hyperparameter combinations are drawn (log2 units). Default is (-20,0).

subsetCV

TRUE/FALSE. When TRUE, the MKL CV step also randomly varies the number of RF kernels in addition to the MKL regularization hyperparameters. It does so by training on a subset of kernels of size K, randomly selected on the (0,number best RF kernels) interval. Once K is selected, the top K kernels (ranked by metric specified in rf_pars) are included in current CV run. Default is TRUE.

type

Type of predictions - possible choices are "response" and "probability". Default is "response".

celnet

Hyperparameters for MKL elastic net run. Should be a vector of length 2. Default is NULL - hyperparameters are tuned via internal cross-validation.

store_kernels

TRUE/FALSE. Should the model store the training RF kernels. Default is FALSE.

verbose

TRUE/FALSE. Should the model print verbose progress statements. Default is FALSE.

Value

a model of class AKLIMATE with the following fields:

rf_stats

List of metrics and predictions from training run on all RF base learners.

kernels

RF kernels used in MKL training step. NULL if store_kernels is set to FALSE.

kern_cv

if akl_pars$celnet is NULL, hyperparameter vectors examined during MKL cross-validation, along with matching metric scores.

rf_models

Set of RF base learners used to produce RF kernels for stacked MKL.

akl_model

Trained spicer MKL model, with either user-supplied elastic net hyperparameters, or the hyperparameters selected via CV tuning.

rf_pars_global

rf_pars argument

rf_pars_local

optimal RF parameters for each RF base learner. Those will be the same (with the exception of ntree) as the rf_pars_global parameters unless rf_pars$oob_cv was specified by the user.

akl_pars

akl_pars argument

dat_grp

dat_grp argument

idx_train

Vector of training data instances.

preds_train

AKLIMATE predictions on training set.

References


VladoUzunangelov/aklimate documentation built on Aug. 17, 2020, 4:40 a.m.