partial_dep.obs: Partial Dependency Observation, Contour (single observation)

Description Usage Arguments Value Examples

Description

This function computes partial dependency of a supervised machine learning model over a range of values for a single observation. Does not work for multiclass problems! Check predictor_xgb to get an example of predictor to use (so you can create your own).

Usage

1
2
3
4
partial_dep.obs(model, predictor, data, observation, column,
  accuracy = min(length(data), 100), safeguard = TRUE,
  safeguard_val = 1048576, exact_only = TRUE, label_name = "Target",
  comparator_name = "Evolution")

Arguments

model

Type: unknown. The model to pass to predictor.

predictor

Type: function(model, data). The predictor function which takes a model and data as inputs, and return predictions. data is provided as data.table for maximum performance.

data

Type: data.table (mandatory). The data we need to use to sample from for the partial dependency with observation.

observation

Type: data.table (mandatory). The observation we want to get partial dependence from. It is mandatory to use a data.table to retain column names.

column

Type: character. The column we want partial dependence from. You can specify two or more column as a vector, but it is highly not recommended to go for a lot of columns because the complexity is exponential, think as O^length(column). For instance, accuracy = 100 and length(column) = 10 leads to 1e+20 theoretical observations, which will explode the memory of any computer.

accuracy

Type: integer. The accuracy of the partial dependence from, exprimed as number of sampled points by percentile of the column from the the data. Defaults to min(length(data), 100), which means either 100 samples or all samples of data if the latter has less than 100 observations.

safeguard

Type: logical. Whether to safeguard accuracy^length(column) value to safeguard_val observations maximum. If TRUE, it will prevent that value to go over safeguard_val (and adjust accordingly the accuracy value). Note that if safeguard is disabled, you might get at the end less observations than you expected initially (there is cleaning performed for uniqueness.

safeguard_val

Type: integer. The maximum number of observations allowed when safeguard is TRUE. Defaults to 1048576, which is 4^10.

exact_only

Type: logical. Whether to select only exact values for data sampling. Defaults to TRUE.

label_name

Type: character. The column name given to the predicted values in the output table. Defaults to "Target", this assumes you do not have a column called "Target" in your column vector.

comparator_name

Type: character. The column name given to the evolution value ("Increase", "Fixed", "Decrease") in the output table. Defaults to "Evolution", this assumes you do not have a column called "Evolution" in your column vector.

Value

A list with different elements: grid_init for the grid before expansion, grid_exp for the expanded grid with predictions, preds for the predictions, and obs for the original prediction on observation.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
## Not run: 
# Let's load a dummy dataset
data(mtcars)
setDT(mtcars) # Transform to data.table for easier manipulation

# We train a xgboost model on 31 observations, keep last to analyze later
set.seed(0)
xgboost_model <- xgboost(data = data.matrix(mtcars[-32, -1]),
                         label = mtcars$mpg[-32],
                         nrounds = 20)

# Perform partial dependence grid prediction to analyze the behavior of the 32th observation
# We want to check how it behaves with:
# => horsepower (hp)
# => number of cylinders (cyl)
# => transmission (am)
# => number of carburetors (carb)
preds_partial <- partial_dep.obs(model = xgboost_model,
                                 predictor = predictor_xgb, # Default for xgboost
                                 data = mtcars[-32, -1], # train data = 31 first observations
                                 observation = mtcars[32, -1], # 32th observation to analyze
                                 column = c("hp", "cyl", "am", "carb"),
                                 accuracy = 20, # Up to 20 unique values per column
                                 safeguard = TRUE, # Prevent high memory usage
                                 safeguard_val = 1048576, # No more than 1048576 observations,
                                 exact_only = TRUE, # Not allowing approximations,
                                 label_name = "mpg", # Label is supposed "mpg"
                                 comparator_name = "evo") # Comparator +/-/eq for analysis

# How many observations? 300
nrow(preds_partial$grid_exp)

# How many observations analyzed per column? hp=10, cyl=3, am=2, carb=5
summary(preds_partial$grid_init)

# When cyl decreases, mpg increases!
partial_dep.plot(grid_data = preds_partial$grid_exp,
                 backend = "tableplot",
                 label_name = "mpg",
                 comparator_name = "evo")

# Another way of plotting... hp/mpg relationship is not obvious
partial_dep.plot(grid_data = preds_partial$grid_exp,
                 backend = "car",
                 label_name = "mpg",
                 comparator_name = "evo")

# Do NOT do this on >1k samples, this will kill RStudio
# Histograms make it obvious when decrease/increase happens.
partial_dep.plot(grid_data = preds_partial$grid_exp,
                 backend = "plotly",
                 label_name = "mpg",
                 comparator_name = "evo")

# Get statistics to analyze fast
partial_dep.feature(preds_partial$grid_exp, metric = "emp", in_depth = FALSE)

# Get statistics to analyze, but is very slow when there is large data
# Note: unreliable for large amount of observations due to asymptotic infinites
partial_dep.feature(preds_partial$grid_exp, metric = "emp", in_depth = TRUE)

## End(Not run)

Laurae2/Laurae documentation built on May 8, 2019, 7:59 p.m.