splits_selection: Split dataset and select variables
In APML: An Approach for Machine-Learning Modelling

View source: R/splits_selection.R

splits_selection

R Documentation

Split dataset and select variables

Description

Split dataset into training data and testing data and select variables based on relative importance.

Usage

splits_selection(data,split_ratio,split_seed,
feature_model,imbalance,nfolds,
RAN_type,RAN.seed,smote.seed,
xcol_enter,distribution)

Arguments

`data`	A data.frame used to build models
`split_ratio`	A numeric value indicating the ratio of total rows contained in each split. Must less than 1
`split_seed`	Random seed for splitting
`feature_model`	Name of model for feature selection. Currently, only allow "gbm" for gradient boosted tree, and "rf" for random forest
`imbalance`	Logical or "SMOTE"(for categorical response). True for balancing training data class counts via over/under-sampling when building the model. "SMOTE" for applying SMOTE and returning SMOTE training data.
`nfolds`	Number of folds for K-fold cross-validation. Default:5.
`RAN_type`	"both", "binominal" or "normal". "both" for generating both binominal and normal random terms for feature selection. "binominal" or "normal" only generate one specific type of random term. Categorical or continuous variables with relative importance greater than corresponding random term(s) will be selected.
`RAN.seed`	Random seed for random term(s)
`smote.seed`	Random seed for SMOTE. Only used if argument "imbalance"="SMOTE"
`xcol_enter`	A character vector of variables are required to enter the model, also called "forced entry". If xcol_enter contains all independent variables' names, it will not use random terms to select variables.
`distribution`	Distribution type. Must be one of: "AUTO", "bernoulli", "quasibinomial", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom". Defaults to AUTO.

Details

This function applys a technique to use random term to select variables. We consider variables with relative importance greater than random term as truly important variables.

Value

`importance`	A data.frame containing the relative importance scores of selected variables.
`train_data`	Training dataset. If "imbalance"="SMOTE", it returns the SMOTE training set.
`test_data`	Testing dataset.
`raw_traindata`	Same training dataset. If "imbalance"="SMOTE", it returns the original training set before SMOTE.

Note

This function is based on h2o package. In order to run this function, we need to run h2o.init() before using this function. The response variable should be the first column.

Examples


library(survival)
library(h2o)
library(performanceEstimation)
data("lung")
attach(lung)
data <- datatrans(lung,factor_dummy = 'dummy',rescale = TRUE)
data <- data[,c(3,1,2,4:14)]
h2o.init()
selection <- splits_selection(data,imbalance = 'SMOTE')
h2o.shutdown(prompt=FALSE)
Sys.sleep(2)

APML documentation built on May 12, 2022, 9:06 a.m.