knitr::opts_chunk$set(echo = TRUE)
The main two parameters for this model are the number of committees as well as the number of neighbors (if any) to use to adjust the model predictions. We'll use two different packages for model tuning. Each will split and resample the data with different code. Their results will be very similar but will not be equal.
Before starting, we'll use the Ames housing data again.
data(ames, package = "modeldata") # model the data on the log10 scale ames$Sale_Price <- log10(ames$Sale_Price) predictors <- c("Lot_Area", "Alley", "Lot_Shape", "Neighborhood", "Bldg_Type", "Year_Built", "Total_Bsmt_SF", "Central_Air", "Gr_Liv_Area", "Bsmt_Full_Bath", "Bsmt_Half_Bath", "Full_Bath", "Half_Bath", "TotRms_AbvGrd", "Year_Sold", "Longitude", "Latitude") ames$Sale_Price <- log10(ames$Sale_Price) ames <- ames[, colnames(ames) %in% c("Sale_Price", predictors)]
To tune the model over different values of neighbors
and committees
, the train
function in the
caret
package can be used to optimize these parameters. For example, to split the data:
library(caret) set.seed(1) in_train <- createDataPartition(ames$Sale_Price, times = 1, list = FALSE) caret_train <- ames[ in_train,] caret_test <- ames[-in_train,]
We'll use basic 10-fold cross-validation to tune the models over the two parameters. Although train()
can make its own grid, we'll use a regular grid with four values per parameter:
grid <- expand.grid(committees = c(1, 10, 50, 100), neighbors = c(0, 1, 5, 9)) set.seed(2) caret_grid <- train( x = subset(caret_train, select = -Sale_Price), y = caret_train$Sale_Price, method = "cubist", tuneGrid = grid, trControl = trainControl(method = "cv") ) caret_grid
Note that the x/y interface was used. This keeps factor predictors intact; if the formula method had been used, factor predictors like Neighborhood
would have been converted to binary indicator columns. That would not cause an error, but Cubist usually works better without the indicators columns.
The next figure shows the profiles of the tuning parameters produced using ggplot(caret_grid)
.
ggplot(caret_grid) + theme(legend.position = "top")
The caret_grid
object selected and fit the final model (with the best results). The test data are predicted using:
predict(caret_grid, head(caret_test))
The tidymodels packages have a slightly different approach to package development. Whereas caret
contains a large number of functions, tidymodels splits the code into small packages that do a few common tasks (e.g. resampling, performance estimation). To start, load the tidymodels
package and this will attach the main set of core packages:
library(tidymodels) tidymodels_prefer()
tidymodels_prefer()
resolves the naming conflicts between it and caret
functions. For example, invoking sensitivity
will now point towards the tidymodels version (but the other function can be used via caret::sensitivity()
).
If you are new to tidymodels, we suggest taking a look at tidymodels.org
or the book Tidy Modeling with R.
We'll split the data into training and test data sets, then use the vfold_cv()
function to create cross-validation folds. There are stored in a tibble object( basically a data frame):
set.seed(3) split <- initial_split(ames, strata = Sale_Price) tm_train <- training(split) tm_test <- testing(split) set.seed(4) cv_folds <- vfold_cv(tm_train, strata = Sale_Price) cv_folds
To use this model, we define the model specification object and tag which parameters that we want to tune for this model (with a value of tune()
). the package that contains the model functions for rule-based models is called rules
; we load that first.
library(rules) cubist_spec <- cubist_rules(committees = tune(), neighbors = tune()) %>% set_engine("Cubist") cubist_spec
Now we can use the tune_grid()
function to estimate performance for all of the parameters in the grid
data frame:
tm_grid <- cubist_spec %>% tune_grid(Sale_Price ~ ., resamples = cv_folds, grid = grid) tm_grid
The results can be sown as a table or a plot:
collect_metrics(tm_grid) autoplot(tm_grid, metric = "rmse")
We can select a parameter combination as the final configuration and update our model specification object with those values.
cubist_fit <- cubist_spec %>% finalize_model(select_best(tm_grid, metric = "rmse")) %>% fit(Sale_Price ~ ., data = tm_train) predict(cubist_fit, head(tm_test))
This can be more easily done with last_fit()
:
cubist_results <- cubist_spec %>% finalize_model(select_best(tm_grid, metric = "rmse")) %>% last_fit(Sale_Price ~ ., split = split) cubist_results # test set results: collect_metrics(cubist_results) # test set predictions: collect_predictions(cubist_results)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.