partial_dep.obs_all: Partial Dependency Observation, Contour (multiple...
In Laurae2/Laurae: Advanced High Performance Data Science Toolbox for R

Description Usage Arguments Value Examples

This function computes partial dependency of a supervised machine learning model over a range of values for multiple observation. Does not work for multiclass problems! Beware as the number of data might explode, especially if you have many different data. The amount of observations you will end up explodes quickly and is linear to the sum of each (max(unique value of feature, accuracy)). For instance, 100 features with 50 unique values and 1000 observations ends up as 5,000,000 observations!!!!! (5000 times the initial dataset)

1
2
3

partial_dep.obs_all(model, predictor, data, observation,
  column = colnames(data), accuracy = min(length(data), 10),
  exact_only = TRUE, label_name = "Target", comparator_name = "Evolution")

`model`	Type: unknown. The model to pass to `predictor`.
`predictor`	Type: function(model, data). The predictor function which takes a model and data as inputs, and return predictions. `data` is provided as data.table for maximum performance.
`data`	Type: data.table (mandatory). The data we need to use to sample from for the partial dependency.
`observation`	Type: data.table (mandatory). The observation we want to get partial dependence from. You can put the same as `data` if you wish to get partial dependence from it.
`column`	Type: character. The column we want partial dependence from. You can specify two or more `column` as a vector, but it is highly not recommended to go for a lot of columns because the complexity is linearly multiplicative, think as `Olength(column)accuracy`. For instance, `accuracy = 100`, `length(column) = 100`, and `nrow(data) = 1000000` leads to `1e+10` theoretical observations, which could explode the memory of any computer.
`accuracy`	Type: integer. The accuracy of the partial dependence from, exprimed as number of sampled points by percentile of the `column` from the the `data`. Defaults to `min(length(data), 10)`, which means either 10 samples or all samples of `data` if the latter has less than 10 observations.
`exact_only`	Type: logical. Whether to select only exact values for data sampling. Defaults to `TRUE`.
`label_name`	Type: character. The column name given to the predicted values in the output table. Defaults to `"Target"`, this assumes you do not have a column called `"Target"` in your `column` vector.
`comparator_name`	Type: character. The column name given to the evolution value (`"Increase"`, `"Fixed"`, `"Decrease"`) in the output table. Defaults to `"Evolution"`, this assumes you do not have a column called `"Evolution"` in your `column` vector.

A list with different elements: grid_init for the grid before expansion, grid_exp for the expanded grid with predictions, preds for the predictions, and obs for the original predictions on data

## Not run: 
# Let's load a dummy dataset
data(mtcars)
setDT(mtcars) # Transform to data.table for easier manipulation

# We train a xgboost model on 31 observations, keep last to analyze later
set.seed(0)
xgboost_model <- xgboost(data = data.matrix(mtcars[, -1]),
                         label = mtcars$mpg,
                         nrounds = 20)

# Perform partial dependence grid prediction to analyze the behavior of the 32th observation
# We want to check how it behaves with:
# => horsepower (hp)
# => number of cylinders (cyl)
# => transmission (am)
# => number of carburetors (carb)
preds_partial <- partial_dep.obs_all(model = xgboost_model,
                                     predictor = predictor_xgb, # Default for xgboost
                                     data = mtcars[, -1], # train data
                                     observation = mtcars[, -1], # train data
                                     # when column is not specified => all columns
                                     accuracy = 20, # Up to 20 unique values per column
                                     exact_only = TRUE, # Not allowing approximations,
                                     label_name = "mpg", # Label is supposed "mpg"
                                     comparator_name = "evo") # Comparator +/-/eq for analysis

# How many observations? 3360, that's a lot coming from original 32 observations.
nrow(preds_partial$grid_exp)

# How many observations analyzed per column?
summary(preds_partial$grid_init)
#      Length Class  Mode   
# cyl   3     -none- numeric
# disp 19     -none- numeric
# hp   16     -none- numeric
# drat 16     -none- numeric
# wt   19     -none- numeric
# qsec 19     -none- numeric
# vs    2     -none- numeric
# am    2     -none- numeric
# gear  3     -none- numeric
# carb  6     -none- numeric

# Great plotting skills!
partial_dep.plot(preds_partial$grid_exp,
                 backend = c("plotly", "line"),
                 label_name = "mpg",
                 comparator_name = "evo")

# Get statistics to analyze fast
partial_dep.feature(preds_partial$grid_exp, metric = "emp", in_depth = FALSE)

# Get statistics to analyze, but is very slow when there is large data
# Note: unreliable for large amount of observations due to asymptotic infinites
partial_dep.feature(preds_partial$grid_exp, metric = "emp", in_depth = TRUE)

## End(Not run)