View source: R/prob-lift_curve.R
lift_curve | R Documentation |
lift_curve()
constructs the full lift curve and returns a tibble. See
gain_curve()
for a closely related concept.
lift_curve(data, ...)
## S3 method for class 'data.frame'
lift_curve(
data,
truth,
...,
na_rm = TRUE,
event_level = yardstick_event_level(),
case_weights = NULL
)
data |
A |
... |
A set of unquoted column names or one or more
|
truth |
The column identifier for the true class results
(that is a |
na_rm |
A |
event_level |
A single string. Either |
case_weights |
The optional column identifier for case weights.
This should be an unquoted column name that evaluates to a numeric column
in |
There is a ggplot2::autoplot()
method for quickly visualizing the curve.
This works for binary and multiclass output, and also works with grouped data
(i.e. from resamples). See the examples.
A tibble with class lift_df
or lift_grouped_df
having
columns:
.n
The index of the current sample.
.n_events
The index of the current unique sample. Values with repeated
estimate
values are given identical indices in this column.
.percent_tested
The cumulative percentage of values tested.
.lift
First calculate the cumulative percentage of true results relative
to the total number of true results. Then divide that by .percent_tested
.
If using the case_weights
argument, all of the above columns will be
weighted. This makes the most sense with frequency weights, which are integer
weights representing the number of times a particular observation should be
repeated.
The motivation behind cumulative gain and lift charts is as a visual method
to determine the effectiveness of a model when compared to the results one
might expect without a model. As an example, without a model, if you were to
advertise to a random 10% of your customer base, then you might expect to
capture 10% of the of the total number of positive responses had you
advertised to your entire customer base. Given a model that predicts which
customers are more likely to respond, the hope is that you can more
accurately target 10% of your customer base and capture >
10% of the total
number of positive responses.
The calculation to construct lift curves is as follows:
truth
and estimate
are placed in descending order by the estimate
values (estimate
here is a single column supplied in ...
).
The cumulative number of samples with true results relative to the entire number of true results are found.
The cumulative %
found is divided by the cumulative %
tested
to construct the lift value. This ratio represents the factor of improvement
over an uninformed model. Values >
1 represent a valuable model. This is the
y-axis of the lift chart.
If a multiclass truth
column is provided, a one-vs-all
approach will be taken to calculate multiple curves, one per level.
In this case, there will be an additional column, .level
,
identifying the "one" column in the one-vs-all calculation.
There is no common convention on which factor level should
automatically be considered the "event" or "positive" result
when computing binary classification metrics. In yardstick
, the default
is to use the first level. To alter this, change the argument
event_level
to "second"
to consider the last level of the factor the
level of interest. For multiclass extensions involving one-vs-all
comparisons (such as macro averaging), this option is ignored and
the "one" level is always the relevant result.
Max Kuhn
Other curve metrics:
gain_curve()
,
pr_curve()
,
roc_curve()
# ---------------------------------------------------------------------------
# Two class example
# `truth` is a 2 level factor. The first level is `"Class1"`, which is the
# "event of interest" by default in yardstick. See the Relevant Level
# section above.
data(two_class_example)
# Binary metrics using class probabilities take a factor `truth` column,
# and a single class probability column containing the probabilities of
# the event of interest. Here, since `"Class1"` is the first level of
# `"truth"`, it is the event of interest and we pass in probabilities for it.
lift_curve(two_class_example, truth, Class1)
# ---------------------------------------------------------------------------
# `autoplot()`
library(ggplot2)
library(dplyr)
# Use autoplot to visualize
autoplot(lift_curve(two_class_example, truth, Class1))
# Multiclass one-vs-all approach
# One curve per level
hpc_cv %>%
filter(Resample == "Fold01") %>%
lift_curve(obs, VF:L) %>%
autoplot()
# Same as above, but will all of the resamples
hpc_cv %>%
group_by(Resample) %>%
lift_curve(obs, VF:L) %>%
autoplot()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.