Introduction to Lift Chart, ROC Curve and Word Cloud

In V2.7 release of DataRobot API, the following model insights have been added:

Insights provided by Lift Chart and ROC Curves are helpful in checking the performance of machine learning models. Word clouds are helpful for understanding useful words and phrases generated after applying different NLP techniques to unstructured data. We will explore each one of these in detail.

Connecting to DataRobot

To access the DataRobot modeling engine, it is necessary to establish an authenticated connection, which can be done in one of two ways. In both cases, the necessary information is an endpoint, the URL address of the specific DataRobot server being used and a token, a previously validated access token.

token is unique for each DataRobot modeling engine account and can be accessed using the DataRobot webapp in the account profile section.

endpoint depends on DataRobot modeling engine installation (cloud-based vs. on-premise) you are using. Contact your DataRobot admin for information on which endpoint to use if you do not know. The endpoint for DataRobot cloud accounts is https://app.datarobot.com/api/v2.

The first access method uses a YAML configuration file with these two elements - labeled token and endpoint - located at $HOME/.config/datarobot/drconfig.yaml. If this file exists when the datarobot package is loaded, a connection to the DataRobot modeling engine is automatically established during library(datarobot). It is also possible to establish a connection using this YAML file via the ConnectToDataRobot() function, by specifying the configPath parameter.

The second method of establishing a connection to the DataRobot modeling engine is to call the function ConnectToDataRobot with the endpoint and token parameters.

library(datarobot)
ConnectToDataRobot(endpoint = "http://<YOUR DR SERVER>/api/v2", token = "<YOUR API TOKEN>")

Data

We will be using the Lending Club dataset, a sample dataset related to credit scoring open-sourced by LendingClub. We can create a project with this dataset like this:

lendingClubURL <- "https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv"
project <- StartProject(dataSource = lendingClubURL,
                        projectName = "AdvancedModelInsightsVignette",
                        mode = "auto",
                        target = "is_bad",
                        workerCount = "max",
                        wait = TRUE)

Once the modeling process has completed, the ListModels function returns an S3 object of class listOfModels that characterizes all of the models in a specified DataRobot project. It is important to use WaitforAutopilot before calling ListModels, as the function will return only a partial list (and a warning) if the autopilot is not yet complete.

results <- as.data.frame(ListModels(project))
saveRDS(results, "resultsModelInsights.rds")
library(knitr)
kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE)
results <- readRDS("resultsModelInsights.rds")
library(knitr)
kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE)

Lift Chart

Lift chart data can be retrieved for a specific data partition (validation, cross-validation, or holdout) or for all the data partitions using GetLiftChart and ListLiftCharts. To retrieve the data for holdout partition, it needs to be unlocked first.

Let's retrieve the validation partition data for top model using GetLiftChart. The GetLiftChart function returns data for validation partition by default. We can retrieve data for specific data partition by passing value to source parameter in GetLiftChart.

project <- GetProject("5eed0d790ef80408ae212f09")
allModels <- ListModels(project)
saveRDS(allModels, "modelsModelInsights.rds")
modelFrame <- as.data.frame(allModels)
metric <- modelFrame$validationMetric
if (project$metric %in% c('AUC', 'Gini Norm')) {
  bestIndex <- which.max(metric)
} else {
  bestIndex <- which.min(metric)
}
bestModel <- allModels[[bestIndex]]
bestModel$modelType
allModels <- readRDS("modelsModelInsights.rds")
bestModel <- allModels[[1]]
bestModel$modelType

This selects a Gradient Boosted Greedy Trees Classifier with Early Stopping model.

The lift chart data we retrieve from the server includes the mean of the model prediction and the mean of the actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.

lc <- GetLiftChart(bestModel)
saveRDS(lc, "liftChartModelInsights.rds")
head(lc)
lc <- readRDS("liftChartModelInsights.rds")
head(lc)
ValidationLiftChart <- GetLiftChart(bestModel, source = "validation")
dr_dark_blue <- "#08233F"
dr_blue <- "#1F77B4"
dr_orange <- "#FF7F0E"

# Function to plot lift chart
library(data.table)
LiftChartPlot <- function(ValidationLiftChart, bins = 10) {
  if (60 %% bins == 0) {
    ValidationLiftChart$bins <- rep(seq(bins), each = 60 / bins)
    ValidationLiftChart <- data.table(ValidationLiftChart)
    ValidationLiftChart[, actual := mean(actual), by = bins]
    ValidationLiftChart[, predicted := mean(predicted), by = bins]
    unique(ValidationLiftChart[, -"binWeight"])
  } else {
    "Please provide bins less than 60 and divisor of 60"
  }
}
LiftChartData <- LiftChartPlot(ValidationLiftChart)
saveRDS(LiftChartData, "LiftChartDataVal.rds")
par(bg = dr_dark_blue)
plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
     main = "Lift Chart", xlab = "Bins", ylab = "Value")
lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")
# dr_dark_blue <- "#08233F"
# dr_blue <- "#1F77B4"
# dr_orange <- "#FF7F0E"
# LiftChartData <- readRDS("LiftChartDataVal.rds")
# par(bg = dr_dark_blue)
# plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
#      main = "Lift Chart", xlab = "Bins", ylab = "Value")
# lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")
knitr::include_graphics("liftChartValidation.png")

All the available lift chart data can be retrieved using ListLiftCharts. Here is an example retrieving data for all the available partitions, followed by plotting the cross validation partition:

AllLiftChart <- ListLiftCharts(bestModel)
LiftChartData <- LiftChartPlot(AllLiftChart[["crossValidation"]])
saveRDS(LiftChartData, "LiftChartDataCV.rds")
par(bg = dr_dark_blue)
plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
     main = "Lift Chart", xlab = "Bins", ylab = "Value")
lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")
# LiftChartData <- readRDS("LiftChartDataCV.rds")
# par(bg = dr_dark_blue)
# plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
#      main = "Lift Chart", xlab = "Bins", ylab = "Value")
# lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")
knitr::include_graphics("liftChartCrossValidation.png")

We can also plot the lift chart using ggplot2:

library(ggplot2)
lc$actual <- lc$actual / lc$binWeight
lc$predicted <- lc$predicted / lc$binWeight
lc <- lc[order(lc$predicted), ]
lc$binWeight <- NULL
lc <- data.frame(value = c(lc$actual, lc$predicted),
                 variable = c(rep("Actual", length(lc$actual)),
                              rep("Predicted", length(lc$predicted))),
                 id = rep(seq_along(lc$actual), 2))
ggplot(lc) + geom_line(aes(x = id, y = value, color = variable))

ROC Curve Data

The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

ROC curve data can be generated for a specific data partition (validation, cross validation, or holdout) or for all the data partition using GetRocCurve and ListRocCurves.

To retrieve ROC curve information use GetRocCurve:

roc <- GetRocCurve(bestModel)
saveRDS(roc, "ROCCurveModelInsights.rds")
lc <- readRDS("ROCCurveModelInsights.rds")

You can then plot the results:

dr_dark_blue <- "#08233F"
dr_roc_green <- "#03c75f"
ValidationRocCurve <- GetRocCurve(bestModel)
ValidationRocPoints <- ValidationRocCurve[["rocPoints"]]
saveRDS(ValidationRocPoints, "ValidationRocPoints.rds")
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(ValidationRocPoints$falsePositiveRate, ValidationRocPoints$truePositiveRate,
     main = "ROC Curve",
     xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
     col = dr_roc_green,
     ylim = c(0,1), xlim = c(0,1),
     pch = 20, type = "b")
dr_dark_blue <- "#08233F"
dr_roc_green <- "#03c75f"
ValidationRocPoints <- readRDS("ValidationRocPoints.rds")
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(ValidationRocPoints$falsePositiveRate, ValidationRocPoints$truePositiveRate,
     main = "ROC Curve",
     xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
     col = dr_roc_green,
     ylim = c(0, 1), xlim = c(0, 1),
     pch = 20, type = "b")

All the available ROC curve data can be retrieved using ListRocCurves. Here again is an example to retrieve data for all the available partitions, followed by plotting the cross validation partition:

AllRocCurve <- ListRocCurves(bestModel)
CrossValidationRocPoints <- AllRocCurve[['crossValidation']][['rocPoints']]
saveRDS(CrossValidationRocPoints, 'CrossValidationRocPoints.rds')
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(CrossValidationRocPoints$falsePositiveRate, CrossValidationRocPoints$truePositiveRate,
     main = "ROC Curve",
     xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
     col = dr_roc_green,
     ylim = c(0, 1), xlim = c(0, 1),
     pch = 20, type = "b")
CrossValidationRocPoints <- readRDS("CrossValidationRocPoints.rds")
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(CrossValidationRocPoints$falsePositiveRate, CrossValidationRocPoints$truePositiveRate,
     main = "ROC Curve",
     xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
     col = dr_roc_green,
     ylim = c(0, 1), xlim = c(0, 1),
     pch = 20, type = "b")

You can also plot the ROC curve using ggplot2:

ggplot(
  ValidationRocPoints, 
  aes(x = falsePositiveRate, y = truePositiveRate)
) + geom_line()

Threshold operations

You can get the recommended threshold value with maximal F1 score. That is the same threshold that is preselected in DataRobot when you open the "ROC curve" tab.

threshold <- ValidationRocPoints$threshold[which.max(ValidationRocPoints$f1Score)]

You can also estimate metrics for different threshold values. This will produce the same results as updating the threshold on the DataRobot "ROC curve" tab.

ValidationRocPoints[ValidationRocPoints$threshold == tail(Filter(function(x) x > threshold,
                                                                 ValidationRocPoints$threshold),
                                                          1), ]

Word Cloud

The word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.

This example will show you how to obtain word cloud data and visualize it, similar to how DataRobot visualizes the word cloud in the "Model Insights" tab interface.

The visualization example here uses the modelwordcloud package.

Now let's find our word cloud:

# Find word-based models by looking for "word" modelType
wordModels <- allModels[grep("Word", lapply(allModels, `[[`, "modelType"))]
wordModel <- wordModels[[1]]
# Get word cloud
wordCloud <- GetWordCloud(project, wordModel$modelId)
saveRDS(wordCloud, "wordCloudModelInsights.rds")
library(modelwordcloud)
wordCloud <- readRDS("wordCloudModelInsights.rds")

Now we plot it!

colors <- c(
  colormap::colormap(c("#255FEC", "#2DBEF9")), 
  colormap::colormap(
    c("#FFAC9D", "#D80909"), 
    reverse = TRUE
  )
)
saveRDS(colors, "colors.rds")
# Remove stop words
wordCloud <- wordCloud[!wordCloud$isStopword, ]

# Specify colors similar to what DataRobot produces for 
# a wordcloud in Insights
colors <- readRDS("colors.rds")

# Make word cloud
suppressWarnings(
  wordcloud(words = wordCloud$ngram,
            freq = wordCloud$frequency,
            coefficients = wordCloud$coefficient,
            colors = colors,
            scale = c(3, 0.3))
)


Try the datarobot package in your browser

Any scripts or data that you put into this service are public.

datarobot documentation built on Nov. 3, 2023, 1:07 a.m.