This package provides functions for preparing data, training models, testing models, and evaluating the outcomes. Its goal is to be an extendable application programming interface (API) to creating and testing predictions made through machine learning.
The package aims create and run tests for two different machine learning problems: classification and regression. Classification predicts a categorical outcome, whereas regression predicts a continuous outcome. Both problems require different approaches to data preparation, modeling and evaluation.
The main package functions fall into one of six categories:
The required parameters to creating a test using
createtest are a dataset, a dependent variable, a problem and a method, a name, and finally an index of the rows to use as training set.
createtest checks that its inputs are not missing and are in the right shape. If possible, it will try to fix issues. For example, classification requires the dependent variable to be a factor, but
createtest will convert other vector types. Furthermore,
createtest tries to provide informative messages when it cannot fix an issue.
It returns a list of the class determined by the problem parameter. This class determines, amongst others, which preparation, training, testing and evaluation functions will be called by
The structure of the return value (the test) is as follows:
dependent : The fully specified column name of the dependent variable.
problem : The type of problem for the test (classification or regression). This is therefore the same as the test class
data : The list of train and holdout data.frames.
name : The name of the test. This could be used when printing the test results
description : A description of the test.
method : The method that should be used for solving the problem: a single-item list of class method. This is used to determine which functions are to be called for creating a model and making predictions with that model
extra.args : Extra arguments that should be passed to the runtest method executing the test. These are not used in the package, but exist for extendability.
: The call to the
createtest function. Again, this could be used when printing the test results, so it is obvious at a glance what the test was about.
The package provides the
runtest generic function, which allows execution to be redirected based on the class of the test. Although there are perhaps cases where this distinction is meaningful, the package only provides
The default implementation first calls
prepare. It calls
train_model with the data from
prepare. It calls
make_predictions with the model from
train_model. It calls
evaluate with the predictions from
make_predictions and the holdout set from
prepare and returns the result.
The main preparation function is
prepare. It is a generic function, which allows for different implementations for the different class of problems. However, the package itself only provides
prepare.default, as it provides generic data preparation that is suitable to both regression and classification.
prepare.default calls a helper function
prepare_data, which removes all missing values and provides informative messages of what happened. Furthermore, this function deals with an issue some algorithms can't deal with: different levels for factors in the train and holdout sets. To deal with this, it calls
apply_levels goes through the data column by column, and replaces the levels of the factors in the holdout sets by those of the same column in the training set.
method_prepare. This generic functions makes it possible to have machine learning method-specific data preparation. These functions are called with the method attribute of the test object. Most methods do not need specific preparation, so
method_prepare.default simply returns
identity(data). Random Forests do need method specific preparation, as they cannot deal with categorical predictors with more than 32 categories. Therefore
group_levels to group infrequent factor levels to make sure no factor has more than 32 levels.
train_model functions trains or fits a model to the data through some form of machine learning or statistical technique. There exist two implementations:
train_model.classification for creating a classification model and
train_model.regression for fitting a regression model.
classification_model, of which
classification_model.default calls a machine learning method with parameters
classification_model exists so method-specific implementations can be made that need other parameters. For example,
rpart requires a
classification_model.rpart makes one from the dependent variable of the test object.
train_model.regression is the analogue of
train_model.classification for regression. It calls
regression_model with parameters
data. Again, methods can have their own implementations.
make_predictions functions serve as a wrapper to
predict with the holdout set as
newdata and the model produced by
train_model. If a machine learning methods requires different parameters, the solution is an implementation of
make_predictions for that method. For example
predict.rpart needs a
type attribute, so
make_predictions.rpart was developed.
evaluate functions evaluate the performance of a model.
evaluate.default extracts the dependent variable from the holdout set (the
observations) and the
make_predictions and calls its helper method
evaluate_problem.regression returns a
observations - predictions.
As a convenience, the package provides a function to run multiple instances of a test:
multitest. The parameters to this method are mostly just passed along to
create_and_run_test. The other parameters govern the sampling method used, as the goal of multiple runs of a test usually is to reduce sample bias.
There are two options: cross-fold, which makes sure every row of data is used a holdout once; and random, which randomly selects a percentage of the data for training. These are implemented by
library(crtests) library(randomForest) library(caret) data(iris) # A classification test test <- createtest(data = iris, dependent = "Species", problem = "classification", method = "randomForest", name = "An example classification test", train_index = sample(150, 100) ) runtest(test) # A regression test test <- createtest(data = iris, dependent = "Sepal.Width", problem = "regression", method = "randomForest", name = "An example regression test", train_index = sample(150,100) ) runtest(test)
library(crtests) library(randomForest) library(rpart) library(caret) library(stringr) # A classification multitest summary( multitest(data = iris, dependent = "Species", problem = "classification", method = "randomForest", name = "An example classification multitest", iterations = 10, cross_validation = TRUE, preserve_distribution = TRUE ) ) # A regression multitest summary( multitest(data = iris, dependent = "Sepal.Width", problem = "regression", method = "rpart", name = "An example regression multitest", iterations = 15, cross_validation = FALSE ) )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.