xg_load_data: Load Data
In ArnaudBu/ezXg: Easy xbgoost wrapper

Description Usage Arguments Value Examples

View source: R/xgez.R

xg_load_data returns a list with all the prepared element for loading and preparing the data for xgboost modelling. The model principally relies on two categories of input: numeric (num) and category (cat).

1
2
3

xg_load_data(file, inputs = "auto", output, inputs.class = "auto",
  output.class = "auto", train.size = 1, seed = 1,
  na.handle = "inf", max.levels = 50)

`file`	Character. The link to the file containing the data. The data are imported with the `fread` function from the `data.table` package (fread), so the format must be consistent with a csv file.
`inputs`	Character vector. Vector of the column names for the inputs of the model. Only those columns will be used for the model. Using the "auto" value will use as inputs all the columns from the table except the one labelled as output.
`output`	Character. A single string specifying the name of the output column for the model training.
`inputs.class`	Character vector. A vector specifying the classes for the input column. If set to "auto", the classes will be determined from the output of the fread function. Else, it must me a vector whose size is exactly the number of input and whose values can only be num (for numerical inputs) and cat (for categorical inputs).
`output.class`	Character. Class for output. If set to "auto", the class will be determide from the output of the fread function. Else, it must be equal to num or cat for numerical or categorical inputs.
`train.size`	Numeric. Size for training set for the future model. Can go from 0 (no training set: will produce an error) to 1 (no test set).
`seed`	Numeric. Seed for reproducibility of the results.
`na.handle`	Character. Way to handle na value in numeric inputs. Five possibilities have been implemented: inf: replace missing values with `Inf`. mean: replace missing values with the mean of the column. median: replace missing values with the median of the column. max: replace missing values with the max of the column. min: replace missing values with the min of the column.
`max.levels`	Numeric. Maximum number of levels admitted for a category. This parameters is here to make sure that the model does not have to many input data when transformed into a one-hot encoded matrix.

A list with following values:

train: training set for the model, with a matrix for the input values and a vector for the target variables.
test: test set for the model, on the same format that the training set
formula: the formula used for constructing the model matrix and that is applied when running the model.
template: an empty data.table that has saved all the input values and that is used to appropriately format data when using the prediction function.
data: A data.table with the cleaned data and an additional logical column, train, that indicates which data are used in the training data set.
na.handle: passed to reapply to prediction

d <- xg_load_data(system.file("extdata", "titanic.csv", package = "ezXg"),
               inputs = c("Pclass", "Sex", "Age", "SibSp",
                          "Parch", "Fare", "Embarked"),
               output = "Survived")