knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Since we are training a classifier and the giving it images to work on, the way that we train the classifier has impact on how good our results will be.
I will show several ways the classifier can be trained and how to check if the results are good.
library(clasifierrr)
params_df <- tibble::tibble( file = c( system.file( "extdata", "tiny_4T1-shNT-1_layer1.png", package = "clasifierrr"), system.file( "extdata", "tiny_4T1-shNT-1_layer2.png", package = "clasifierrr")), classif = c("spheroid", "bg"), related_file = system.file( "extdata", "tiny_4T1-shNT-1.png", package = "clasifierrr") ) params_df
Well ... a semi-good measure of prediction accuracy is the out of bag prediction error (OOB for short) and it is posible to modulate several parameters that can impact that.
library(ranger) train_params_grid <- as.data.frame(expand.grid( num.trees = c(50, 100, 200), min.node.size = c(1, 5, 20, 50), max.depth = c(0, 5, 10, 50, 200))) # Since we are good scientists we will run it in triplicate ... train_params_grid <- rbind(train_params_grid,train_params_grid,train_params_grid)
This makes a series of parameters that we will go though and give a model for each of their combinations
trainset <- build_train_multi( params_df, train_size_each = 5000, filter_widths = c(3,5)) # This makes a smaller dataset for the purpose of this tutorial ... small_trainset <- trainset[sample(1:nrow(trainset), 1000),]
forests <- furrr::future_pmap( train_params_grid, function(num.trees, min.node.size, max.depth) { ranger( pixel_class ~ ., data = small_trainset, num.trees = num.trees, importance = "impurity", min.node.size = min.node.size, max.depth = max.depth)}, .progress = interactive()) train_params_grid$oob_error <- purrr::map_dbl(forests, ~ .x$prediction.error)
these are just the series of plots I would do to ceck which one works better...
require(ggplot2) g <- ggplot(train_params_grid, aes(y = oob_error, x = num.trees, colour = factor(min.node.size), fill = factor(min.node.size))) + geom_point() + geom_smooth(alpha = 0.1) + facet_wrap(~ max.depth, labeller = label_both) + geom_point( data = train_params_grid[,-3], alpha = 0.1, colour = "#666666", aes(y = oob_error, x = num.trees), inherit.aes = FALSE) + theme_bw() suppressWarnings(print(g))
Now lets take a look at the prediction times ...
img <- readImageBw(params_df$related_file[[1]]) dims_use <- dim(img) feats <- calc_features(img, filter_widths = c(3,5)) my_class_fun <- function(classifier) { tt <- system.time({ suppressMessages({ class_img <- classify_img( feature_frame = feats, classifier = classifier, dims = dims_use, class_highlight = "bg") }) }) return(list(class_img, tt)) } class_results <- furrr::future_map( forests, my_class_fun, .progress = interactive())
train_params_grid$predict_time <- purrr::map_dbl(class_results, ~.x[[2]][[3]]) ggplot(train_params_grid, aes(y = predict_time, x = num.trees, colour = factor(min.node.size), group = interaction(min.node.size, num.trees))) + geom_boxplot() + facet_wrap(~ max.depth, labeller = label_both) + theme_bw() ggplot(train_params_grid, aes(y = predict_time, x = factor(min.node.size), colour = factor(max.depth), group = interaction(min.node.size, max.depth))) + geom_boxplot() + theme_bw()
We can notice that increasing min.node.size
does not increase prediction
times, whilst num.trees
definitely does.
ggplot(train_params_grid, aes(y = oob_error, x = predict_time, colour = factor(num.trees))) + geom_point() + theme_bw()
From this plot we can see that larger num.trees
definitely make the prediction
slower and not necessarily better. Nonetheless, it is more consistent across
the other parameters
ggplot(train_params_grid, aes(y = oob_error, x = predict_time, colour = factor(min.node.size))) + geom_point() + theme_bw()
And form this plot we can see that higher oob errors are seen with larger
min.node.size
and a value of 5 seems optimal.
ggplot(train_params_grid, aes(y = oob_error, x = predict_time, colour = factor(max.depth))) + geom_density2d(alpha = 0.3) + theme_bw()
Wen can see that slower models dont necessarily makes predictions better!
caret
is a package that helps you benchmark your classifiers, it implements a
a lot of models to choose from, and we can select them to train our classifier.
This benchmarking is done by cross validation in several ways, please read their documentation for more details.
In this case, we will use it to train the default classifier used by our package.
Which is a random forest as implemented by the ranger
package
library(caret) ctrl <- trainControl(method = "repeatedcv", number = 2, repeats = 5) model_ranger <- train( pixel_class~., data = small_trainset, method = "ranger", trControl = ctrl)
Once the model is trained, we can see how it performed by checking the output
of thenameofyourmodel$results
model_ranger$results
Furthermore, we can compare across different machine learning algorithms !
suppressWarnings({ model_glm <- train( pixel_class~., data = small_trainset, method = "glm", trControl = ctrl) }) model_glm$results
And by comparing the values of Kappa
and Accuracy
, we can decide which model
works better for our data! Details on how to use the model later are found in
our other vignette called alternative_classifiers
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.