Description Usage Arguments Details Value Examples
This function is a massive helper in feature engineering, supposing your variables are already conditioned well enough for 2-way or deeper interactions and you are looking for non-linear relationships. It uses a decision tree (Classification and Regression Trees), and supports factors, integer, and numeric variables.
1 2 3 4 5 6 | FeatureLookup(data, label, ban = NULL, antiban = FALSE, type = "auto",
split = "information", folds = 5, seed = 0, verbose = TRUE,
plots = TRUE, max_depth = 4, min_split = max(20, nrow(data)/1000),
min_bucket = round(min_split/3), min_improve = 0.01,
competing_splits = 2, surrogate_search = 5, surrogate_type = 2,
surrogate_style = 0)
|
data |
Type: data.frame (preferred) or data.table. Your data, preferably a data.frame but it "should" also work perfectly with data.table. |
label |
Type: vector. Your labels. |
ban |
Type: vector of characters or of numerics The names (or column numbers) of variables to be banned from the decision tree. Defaults to |
antiban |
Type: boolean. Whether banned variable selection should be inverted, which means if |
type |
Type: character. The type of problem to solve. Either classification ( |
split |
Type: character. If a classification task has been requested ( |
folds |
Type: integer or list of vectors. The folds to use for cross-validation. If you intend to keep the same folds over and over, it is preferrable to provide your own list of folds. A numeric vector matching the length of |
seed |
Type: integer. The random seed applied to the decision tree and the fold generation (if required). |
verbose |
Type: boolean. Whether to print debug information about the model. For each node, a maximum of |
plots |
Type: boolean. Whether to plot debug information about the model. If using knitr / Rmarkdown, you will have two plots printed: the complexity plot, and the decision tree. Without knitr / Rmarkdown, make sure you look at both. Defaults to |
max_depth |
Type: numeric. The maximum depth of the decision tree. Do not set to large values if the intent is for analysis. Defaults to |
min_split |
Type: integer. The minimum number of observations in a node to allow a split to be made. If this number is not reached in a node, the node is kept but any other potential splits are cancelled. Keep it large to avoid overfitting. Defaults to |
min_bucket |
Type: integer. The minimum number of observations in a leaf. If this number is not reached in a leaf, the leaf is destroyed. Defaults to |
min_improve |
Type: numeric. The minimum fitting improvement to create a node (complexity parameter in Classification and Regression Trees). For regression, the requirement for a leaf to be created and kept is an R-squared increase by at least |
competing_splits |
Type: numeric. The number of best splitting rules retained per split. When using |
surrogate_search |
Type: numeric. The number of surrogate splits to look for. A greater number means more surrogates will be looked for, but increased computation time is required. They are also printed when |
surrogate_type |
Type: numeric. Controls the surrogate creation, with three possible values. If set to |
surrogate_style |
Type: numeric. Controls the selection of the best surrogate, with two values. If set to |
To use this function properly, you require to set the max_depth
to a very small value (like 3
). This ensures interpretability.
Moreover, if you have a sparse frame (with lot of missing values), it is important to keep an eye at surrogate_type
and surrogate_style
as they will dictate whether a split point will be made depending on the missing values. Default values are made to handle them appropriately. However, if your intent is to penalize missing values (for instance if missing values are anomalies), changing their values respectively to 0
and 1
is recommended.
The fitted rpart
model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | ## Not run:
# An example of a heavily regularized decision tree
# Settings are intentionally difficult enough for a decision tree
# This way, only great split points are reported
FeatureLookup(data,
label,
ban = c("CAR", "TOBACCO"),
antiban = FALSE,
type = "anova",
folds = 20,
seed = 11111,
verbose = TRUE,
plots = TRUE,
max_depth = 3,
min_split = 1000,
min_bucket = 200,
min_improve = 0.10,
competing_splits = 10,
surrogate_search = 10,
surrogate_type = 2,
surrogate_style = 0)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.