partial_dep.obs: Partial Dependency Observation, Contour (single observation)
In Laurae2/Laurae: Advanced High Performance Data Science Toolbox for R

Description Usage Arguments Value Examples

This function computes partial dependency of a supervised machine learning model over a range of values for a single observation. Does not work for multiclass problems! Check predictor_xgb to get an example of predictor to use (so you can create your own).

partial_dep.obs(model, predictor, data, observation, column,
  accuracy = min(length(data), 100), safeguard = TRUE,
  safeguard_val = 1048576, exact_only = TRUE, label_name = "Target",
  comparator_name = "Evolution")

`model`	Type: unknown. The model to pass to `predictor`.
`predictor`	Type: function(model, data). The predictor function which takes a model and data as inputs, and return predictions. `data` is provided as data.table for maximum performance.
`data`	Type: data.table (mandatory). The data we need to use to sample from for the partial dependency with `observation`.
`observation`	Type: data.table (mandatory). The observation we want to get partial dependence from. It is mandatory to use a data.table to retain column names.
`column`	Type: character. The column we want partial dependence from. You can specify two or more `column` as a vector, but it is highly not recommended to go for a lot of columns because the complexity is exponential, think as `O^length(column)`. For instance, `accuracy = 100` and `length(column) = 10` leads to `1e+20` theoretical observations, which will explode the memory of any computer.
`accuracy`	Type: integer. The accuracy of the partial dependence from, exprimed as number of sampled points by percentile of the `column` from the the `data`. Defaults to `min(length(data), 100)`, which means either 100 samples or all samples of `data` if the latter has less than 100 observations.
`safeguard`	Type: logical. Whether to safeguard `accuracy^length(column)` value to `safeguard_val` observations maximum. If `TRUE`, it will prevent that value to go over `safeguard_val` (and adjust accordingly the `accuracy` value). Note that if safeguard is disabled, you might get at the end less observations than you expected initially (there is cleaning performed for uniqueness.
`safeguard_val`	Type: integer. The maximum number of observations allowed when `safeguard` is `TRUE`. Defaults to `1048576`, which is `4^10`.
`exact_only`	Type: logical. Whether to select only exact values for data sampling. Defaults to `TRUE`.
`label_name`	Type: character. The column name given to the predicted values in the output table. Defaults to `"Target"`, this assumes you do not have a column called `"Target"` in your `column` vector.
`comparator_name`	Type: character. The column name given to the evolution value (`"Increase"`, `"Fixed"`, `"Decrease"`) in the output table. Defaults to `"Evolution"`, this assumes you do not have a column called `"Evolution"` in your `column` vector.

A list with different elements: grid_init for the grid before expansion, grid_exp for the expanded grid with predictions, preds for the predictions, and obs for the original prediction on observation.

## Not run: 
# Let's load a dummy dataset
data(mtcars)
setDT(mtcars) # Transform to data.table for easier manipulation

# We train a xgboost model on 31 observations, keep last to analyze later
set.seed(0)
xgboost_model <- xgboost(data = data.matrix(mtcars[-32, -1]),
                         label = mtcars$mpg[-32],
                         nrounds = 20)

# Perform partial dependence grid prediction to analyze the behavior of the 32th observation
# We want to check how it behaves with:
# => horsepower (hp)
# => number of cylinders (cyl)
# => transmission (am)
# => number of carburetors (carb)
preds_partial <- partial_dep.obs(model = xgboost_model,
                                 predictor = predictor_xgb, # Default for xgboost
                                 data = mtcars[-32, -1], # train data = 31 first observations
                                 observation = mtcars[32, -1], # 32th observation to analyze
                                 column = c("hp", "cyl", "am", "carb"),
                                 accuracy = 20, # Up to 20 unique values per column
                                 safeguard = TRUE, # Prevent high memory usage
                                 safeguard_val = 1048576, # No more than 1048576 observations,
                                 exact_only = TRUE, # Not allowing approximations,
                                 label_name = "mpg", # Label is supposed "mpg"
                                 comparator_name = "evo") # Comparator +/-/eq for analysis

# How many observations? 300
nrow(preds_partial$grid_exp)

# How many observations analyzed per column? hp=10, cyl=3, am=2, carb=5
summary(preds_partial$grid_init)

# When cyl decreases, mpg increases!
partial_dep.plot(grid_data = preds_partial$grid_exp,
                 backend = "tableplot",
                 label_name = "mpg",
                 comparator_name = "evo")

# Another way of plotting... hp/mpg relationship is not obvious
partial_dep.plot(grid_data = preds_partial$grid_exp,
                 backend = "car",
                 label_name = "mpg",
                 comparator_name = "evo")

# Do NOT do this on >1k samples, this will kill RStudio
# Histograms make it obvious when decrease/increase happens.
partial_dep.plot(grid_data = preds_partial$grid_exp,
                 backend = "plotly",
                 label_name = "mpg",
                 comparator_name = "evo")

# Get statistics to analyze fast
partial_dep.feature(preds_partial$grid_exp, metric = "emp", in_depth = FALSE)

# Get statistics to analyze, but is very slow when there is large data
# Note: unreliable for large amount of observations due to asymptotic infinites
partial_dep.feature(preds_partial$grid_exp, metric = "emp", in_depth = TRUE)

## End(Not run)