Train and Test datasets.

Share:

Description

This standard function will allow multiple machine learning algorithms to be utilized on the same data to determine, which algorithm may be the most appropriate.

Usage

1
2
blkbox(data, labels, holdout, holdout.labels, ntrees, mTry, Kernel, Gamma,
  exclude, max.depth, xgtype = "binary:logistic", seed)

Arguments

data

Data partitioned by into a list or a data frame of training data where the features correspond to columns and the samples are rows. As data size increases the memory required and run time of some algorithms may compound exponentially.

labels

a character or numeric vector that contains the training class identifiers for the samples in the data frame. Must appear in the same order. Does not need to be specified if using a partitoned data list.

holdout

a data frame of holdout of testing data where the features correspond to columns and the samples are the rows. Does not need to be specified if using a partitoned data list.

holdout.labels

a character or numeric vector that contains the holdout or testing class identifiers for the samples in the holdout data frame. Does not need to be specified if using a partitoned data list.

ntrees

The number of trees used in the ensemble based learners (randomforest, bigrf, party, bartmachine). default = 500.

mTry

The number of features sampled at each node in the trees of ensemble based learners (randomforest, bigrf, party, bartmachine). default = sqrt(number of features).

Kernel

The type of kernel used in the support vector machine algorithm (linear, radial, sigmoid, polynomial). default = "linear".

Gamma

dvanced parameter, defines the distance of which a single training example reaches. Low gamma will produce a SVM with softer boundaries, as Gamma increases the boundaries will eventually become restricted to their singular support vector. default is 1/(ncol - 1).

exclude

removes certain algorithms from analysis - to exclude random forest which you would set exclude = "randomforest". The algorithms each have their own numeric identifier. randomforest = "randomforest", knn = "kknn", bartmachine = "bartmachine", party = "party", glmnet = "GLM", pam = "PamR, nnet = "nnet", svm = "SVM", xgboost = "xgboost".

max.depth

the maximum depth of the tree in xgboost model, default is sqrt(ncol(data)).

xgtype

either "binary:logistic" or "reg:linear" for logistic regression or linear regression respectively.

seed

Sets the seed for the bartMachine model.

Author(s)

Zachary Davies, Boris Guennewig

Examples

1
2
3
4
my_data <- iris[1:100, 1:4]
my_labels <- as.character(iris[1:100, 5])
my_partition = Partition(data = my_data, labels = my_labels)
model_1 <- blkbox(data = my_partition)