xg_load_data: Load Data

Description Usage Arguments Value Examples

View source: R/xgez.R

Description

xg_load_data returns a list with all the prepared element for loading and preparing the data for xgboost modelling. The model principally relies on two categories of input: numeric (num) and category (cat).

Usage

1
2
3
xg_load_data(file, inputs = "auto", output, inputs.class = "auto",
  output.class = "auto", train.size = 1, seed = 1,
  na.handle = "inf", max.levels = 50)

Arguments

file

Character. The link to the file containing the data. The data are imported with the fread function from the data.table package (fread), so the format must be consistent with a csv file.

inputs

Character vector. Vector of the column names for the inputs of the model. Only those columns will be used for the model. Using the "auto" value will use as inputs all the columns from the table except the one labelled as output.

output

Character. A single string specifying the name of the output column for the model training.

inputs.class

Character vector. A vector specifying the classes for the input column. If set to "auto", the classes will be determined from the output of the fread function. Else, it must me a vector whose size is exactly the number of input and whose values can only be num (for numerical inputs) and cat (for categorical inputs).

output.class

Character. Class for output. If set to "auto", the class will be determide from the output of the fread function. Else, it must be equal to num or cat for numerical or categorical inputs.

train.size

Numeric. Size for training set for the future model. Can go from 0 (no training set: will produce an error) to 1 (no test set).

seed

Numeric. Seed for reproducibility of the results.

na.handle

Character. Way to handle na value in numeric inputs. Five possibilities have been implemented:

  • inf: replace missing values with Inf.

  • mean: replace missing values with the mean of the column.

  • median: replace missing values with the median of the column.

  • max: replace missing values with the max of the column.

  • min: replace missing values with the min of the column.

max.levels

Numeric. Maximum number of levels admitted for a category. This parameters is here to make sure that the model does not have to many input data when transformed into a one-hot encoded matrix.

Value

A list with following values:

Examples

1
2
3
4
d <- xg_load_data(system.file("extdata", "titanic.csv", package = "ezXg"),
               inputs = c("Pclass", "Sex", "Age", "SibSp",
                          "Parch", "Fare", "Embarked"),
               output = "Survived")

ArnaudBu/ezXg documentation built on Oct. 30, 2019, 4:59 a.m.