__ __ ___ ___\ \/ /__ _ / _ \_ /\ // _` | | __// / / \ (_| | \___/___/_/\_\__, | |___/ Easy Xgboost Implementation
ezXg is a simple R utility, designed as a package, whose goal is to simplify the calibration with the xgboost R library and make it directly usable with new data.
It is inspired from Über's ludwig project, which allows to easily train model with Tensorflow.
It uses 5 functions in order to prepare data, train a model and make predictions.
xg_load_data: load and clean the data.
xg_train: train the model.
xg_gs: grid search for hyperparameter selection.
xg_predict: prediction using the model.
xg_auto_ml: auto ML feature for running a model in 1 line.
Tha package can be installed from its Github repository with the
install_github function from the
It relies on the
For example purpose, we will use the famous Kaggle titanic dataset, that can be loaded from the library.
titanic <- system.file("extdata", "titanic.csv", package = "ezXg")
The data can easily be loaded with the
d <- xg_load_data(titanic, inputs = c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"), output = "Survived", train.size = 0.8)
The model can be trained with the
xg_train function, that is just a wrapper of the
md <- xg_train(d)
xg_gs function implements a two-step calibration process in order to find the optimal set of hyperparameters for model:
gs <- xg_gs(d)
The prediction function takes the data (as a data.frame or data.table) as an input in order to make predictions on these new values.
new_data <- fread(titanic) p <- xg_predict(md, new_data)
The auto ML feature is simply a wrapper that load the data, implements a grid search and train the model.
conf <- system.file("extdata", "ex_param.yml", package = "ezXg") md <- xg_auto_ml(titanic, conf)
It is fed by the path to the dataset and a yaml configuration file, whose structure is the one represented below.
global: seed: 1 # Seed for the model train.size: 0.8 # Training size for validation max.levels: 50 # maximum number of levels for factors nthread: 2 # Number of thread for the training verbose: true # Should information be printed ? retrain.full: true # If set to true, the model is trained with full dataset at the end model: inputs: # Input columns (either list or set to "auto") - "Pclass" - "Sex" - "Age" - "SibSp" - "Parch" - "Fare" - "Embarked" output: "Survived" # Output column inputs.class: "auto" # Input class (either list or set to "auto") output.class: "auto" # Output class (either "cat", "num" or "auto") na.handle: "mean" # Way to handle NA for numeric values param: eta: # Eta parameter, either one value or list if cv > 1 - 0.05 - 0.1 - 0.15 gamma: # Gamma parameter, either one value or list if cv > 1 - 0.0 - 0.1 - 0.2 - 0.3 max_depth: # Max_depth parameter, either one value or list if cv > 1 - 5 - 6 - 8 colsample_bytree: # Colsample_bytree parameter, either one value or list if cv > 1 - 0.8 - 0.9 - 1.0 min_child_weight: # Min_child_weight parameter, either one value or list if cv > 1 - 1 - 3 nrounds: 100 # Number of rounds for the training objective: "auto" # Objective function cv: 5 # Number of folds for cross validation. If set to 1, there will not be any.
The fields are simply the different variables of the functions contained in the library, grouped under three categories: + global for global parameters; + model for model related parameters, such as the input and output values; + param the parameters and hyperparameters for model calibration.
md <- xg_auto_ml(system.file("extdata", "titanic.csv", package = "ezXg"), system.file("extdata", "ex_param.yml", package = "ezXg"))
Most of the fields are optional and have default values.
The only required field is the name of the output value.
The following file structure is thuss correct:
model: output: "Survived"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.