knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Sagemaker is great for training,
but sometimes you don't want the overhead (cost and time) of
predict.sagemaker
or batch_predict
.
Luckily, AWS Sagemaker saves every model in S3, and you can download and use it locally with the right configuration.
For xgboost models (more to come in the future),
I've written sagemaker_load_model
, which loads the trained Sagemaker model
into your current R session.
Let's use the sagemaker::abalone
dataset once again,
but this time let's try classification instead of regression.
First we'll identify the classes with the highest frequency, so we can eliminate the low-variance ones.
library(sagemaker) library(dplyr) library(rsample) library(recipes) library(ggplot2) top_rings <- sagemaker::abalone %>% group_by(rings) %>% summarise(n = n()) %>% arrange(desc(n)) %>% top_n(10) top_rings
We'll use recipes
to transform the dataset into the proper format
for xgboost. We must ensure that:
rec <- recipe(rings ~ ., data = sagemaker::abalone) %>% step_filter(rings %in% top_rings$rings) %>% step_integer(all_outcomes(), zero_based = TRUE) %>% prep(training = sagemaker::abalone, retain = TRUE) abalone_class <- juice(rec) %>% select(rings, everything()) abalone_class %>% group_by(rings) %>% summarise(n = n()) %>% arrange(n)
Then we'll split into test/validation and upload to S3 for training.
split <- initial_split(abalone_class) train_path <- s3(s3_bucket(), "abalone-class-train.csv") validation_path <- s3(s3_bucket(), "abalone-class-test.csv")
write_s3(analysis(split), train_path) write_s3(assessment(split), validation_path)
Frist, we need to set the xgboost hyperparameters for multiclassification. See here for the official list of xgboost parameters.
xgb_estimator <- sagemaker_xgb_estimator() xgb_estimator$set_hyperparameters( eval_metric = "mlogloss", objective = "multi:softmax", num_class = 10L )
objective = "multi:softmax"
will return the predicted class,
while objective = "multi:softprob"
returns a tibble
with probabilities for each class^[
You can also use eval_metric = "merror"
,
although "mlogloss"
is usually better in practice.].
Next, we'll set some ranges to tune over and start training.
ranges <- list( max_depth = sagemaker_integer(3, 20), colsample_bytree = sagemaker_continuous(0, 1), subsample = sagemaker_continuous(0, 1) )
tune <- sagemaker_hyperparameter_tuner( xgb_estimator, s3_split(train_path, validation_path), ranges, max_jobs = 5 )
tune <- sagemaker_attach_tuner("sagemaker-xgboost-191201-1356")
tune
Now we can download the Sagemaker model artifact from S3 and load it into the R session.
xgb <- sagemaker_load_model(tune)
For xgboost Sagemaker models,
sagemaker_load_model
loads a
Booster
object from the xgboost Python package:
class(xgb)
To use this feature, you must have the xgboost Python package installed. You can download and install it with
# not run sagemaker::sagemaker_install_xgboost()
The Sagemaker R package loads the Booster object into the R session with reticulate. Therefore, all methods and attributes are available in R.
reticulate::py_list_attributes(xgb) %>% purrr::discard(~stringr::str_detect(., "__"))
However, you need to know the xgboost Python package to work with the local model.
The sagemaker R package also includes predict.xgboost.core.Booster
to help you easily make predictions on this object:
pred <- predict(xgb, abalone_class[, -1]) glimpse(pred)
This method returns predictions in tibbles, and it attempts to conform to the tidymodel standard.
Unfortunately,
at the moment
type ("class"
or "prob"
) cannot be supported.
If you want both probability and class predictions,
I recommend you default to objective = "multi:softprob"
or objective = "binary:logistic"
for probabilities and convert to outcomes
after the predictions are returned.
Next, we can map the data and predictions back to the true values:
ring_map <- tidy(rec, 2) %>% unnest(value) %>% select(rings_integer = integer, rings_actual = value) rings_actual <- abalone_class %>% left_join(ring_map, by = c("rings" = "rings_integer"))
rings_pred <- pred %>% left_join(ring_map, by = c(".pred" = "rings_integer"))
And calculate the confusion matrix:
table(pred = rings_pred$rings_actual, actual = rings_actual$rings_actual)
We can also see the model fit:
logs <- sagemaker_training_job_logs(tune$model_name) logs %>% pivot_longer(`train:mlogloss`:`validation:mlogloss`) %>% ggplot(aes(iteration, value, color = name)) + geom_line()
Let's also train a model using objective = "multi:softprob"
to compare the output.
xgb_estimator2 <- sagemaker_xgb_estimator() xgb_estimator2$set_hyperparameters( eval_metric = "mlogloss", objective = "multi:softprob", num_class = 10L )
tune2 <- sagemaker_hyperparameter_tuner( xgb_estimator2, s3_split(train_path, validation_path), ranges, max_jobs = 5 )
tune2 <- sagemaker_attach_tuner("sagemaker-xgboost-191201-2049")
tune2
xgb2 <- sagemaker_load_model(tune2)
pred2 <- predict(xgb2, abalone_class[, -1]) glimpse(pred2)
In this case, the outcome is a tibble with an associated probability for each class.
Then we can derive the predicted class by finding the class with the highest probability for each row (variable) of the tibble.
pred2_class <- pred2 %>% mutate(variable = row_number()) %>% pivot_longer(.pred_0:.pred_9, names_to = "name") %>% separate(name, c(NA, ".pred"), sep = "_", convert = TRUE) %>% group_by(variable) %>% filter(value == max(value)) %>% ungroup() %>% select(.pred) glimpse(pred2_class)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.