Tree-based models are amazing. Here's a very simple vignette demonstrating how to use Artisanal Machine Learning's Random Forest and GBM models.

Read Abalone Data

library(ggplot2)
library(rprojroot)
abalone_data_file = system.file("external_data", "abalone_data.RDS", package="ArtisanalMachineLearning", mustWork=TRUE)
abalone_data = readRDS(abalone_data_file)
set.seed(1337)
library(ArtisanalMachineLearning)

This example will be using the 'Abalone' dataset that I robbed from the internet here: https://archive.ics.uci.edu/ml/datasets/abalone where we try to predict the age of abalones from measured characteristics.

dim(abalone_data$data)
summary(abalone_data$data)

The semi-large data set has many numeric columns and a numeric response that takes integer values from 1-29.

ggplot(data=data.frame(response=abalone_data$response), aes(response)) + 
    geom_histogram(breaks=seq(0, 30, by = 1), 
                   col="grey", 
                   fill="blue") + 
    labs(x="", y="Count", title="Histogram of Response") + 
    theme_bw()

Random Forest Model

random_forest = aml_random_forest(data=abalone_data$data, 
                                  response=abalone_data$response, 
                                  b=200, 
                                  m=6, 
                                  evaluation_criterion=sum_of_squares, 
                                  min_obs=5, 
                                  max_depth=16, 
                                  verbose=FALSE)
# Do a cooking show trick and bring out an already baked rf
random_forest = readRDS(file.path(find_root('DESCRIPTION'), 'data/random_forest.RDS'))

Now that we have a random forest model, let's simply verify that it's fitting a better-than-garbage model on the training data.

random_forest_predictions = predict_all(random_forest, abalone_data$data, n_trees=200)

SSE on training data

sum((abalone_data$response - random_forest_predictions)^2) / length(abalone_data$response)

Comparison of predicted and actual for Random Forest

plotting_data_rf = data.frame(predicted=random_forest_predictions, 
                               actual=abalone_data$response, 
                               Difference=abs(random_forest_predictions - abalone_data$response))

ggplot(plotting_data_rf, aes(x=actual, y=predicted, color=Difference)) +
    geom_point() + 
    scale_y_continuous(limits = c(-1, 29)) + 
    geom_jitter() + 
    scale_color_viridis() + 
    geom_abline(intercept = 0, slope = 1, color="black", size=1.5) + 
    labs(x="Actual", y="Predicted", title="Actual vs Predicted") + 
    theme_bw()

Not bad! This model is clearly picking up some signal. Let's try out a small GBM now just for kicks.

GBM Model

gbm = aml_gbm(abalone_data$data, 
              abalone_data$response, 
              learning_rate=.1, 
              n_trees=50, 
              evaluation_criterion=sum_of_squares, 
              min_obs=10, 
              max_depth=4, 
              verbose=FALSE)
# Do a cooking show trick and bring out an already baked gbm
gbm = readRDS(file.path(find_root('DESCRIPTION'), 'data/gbm.RDS'))

Circa the RF model, let's see if this picks up any signal at all on the training data.

gbm_predictions = predict_all(gbm, abalone_data$data, n_trees=50)

SSE on training data

sum((abalone_data$response - gbm_predictions)^2) / length(abalone_data$response)

Comparison of predicted and actual for GBM

plotting_data_gbm = data.frame(predicted=gbm_predictions, 
                               actual=abalone_data$response, 
                               Difference=abs(gbm_predictions - abalone_data$response))

ggplot(plotting_data_gbm, aes(x=actual, y=predicted, color=Difference)) +
    geom_point() + 
    scale_y_continuous(limits = c(-1, 29)) + 
    geom_jitter() + 
    scale_color_viridis() + 
    geom_abline(intercept = 0, slope = 1, color="black", size=1.5) + 
    labs(x="Actual", y="Predicted", title="Actual vs Predicted") + 
    theme_bw()

Conclusion

The RF model is outperforming the GBM, but the GBM is significantly smaller and the author didn't spend much time tuning the hyperparameters ¯_(ツ)_/¯

Also, this illustration only includes looking at statistics on the training data set, so we definitely can't make huge conclusions. The author simply wanted to demonstrate these hand-crafted models were producing better-than-trash results, and that has been achieved.



jmwerner/ArtisanalMachineLearning documentation built on Jan. 7, 2020, 1:50 a.m.