Stranded Model Tutorial"
In NHSRdatasets: NHS and Healthcare-Related Data for Education and Training

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette details why the stranded_model dataset was created, how to load it, and gives examples of use with the caret Machine Learning library.

The dataset contains:

stranded.label: a character metric to indicate whether the patient is stranded, or not
age: Integer - the age of the patient on admission to hospital
care.home.referral: Integer - flag to indicate referred from care home
medicallysafe: Integer - flag to indicate whether the patient is medically safe e.g. safe to be discharged but hasn't been
hcop: Integer - flag to indicate whether the patient is in a Health Care for Older People area
mental_health_care: Integer - flag to indicate mental health care provision
period_of_previous_care: Integer - flag to indicate previous periods of care
admit_date: Date - admit date
frailty_index: Character - specifying frailty type, if frail

First, load the data and inspect it

library(NHSRdatasets)
library(dplyr)
library(ggplot2)
library(caret)
library(rsample)
library(varhandle)

data("stranded_data")
glimpse(stranded_data)
prop.table(table(stranded_data$stranded.label))

This is good, it shows a relatively even split between the not stranded and stranded labels. Please refer to the webinar on Advanced Modelling to look at how you can deal with classification imbalance using techniques such as SMOTE (Synthetic Minority Oversampling Technique Estimation) and ROSE (Random Oversampling Estimation), to name a few.

Feature engineering

The next step will be to decide which features need to be engineered for our machine learning model. We will drop the admit_date and recode the frailty index, and perhaps allocate the age into age bands.

stranded_data <- stranded_data %>% 
  dplyr::mutate(stranded.label=factor(stranded.label)) %>% 
  dplyr::select(everything(), -c(admit_date))

Next, I will select the categorical variables and make these into dummy variables, i.e. a numerical encoding of a categorical variable:

cats <- select_if(stranded_data, is.character)
cat_dummy <- varhandle::to.dummy(cats$frailty_index, "frail_ind") 
#Converts the frailty index column to dummy encoding and sets a column called "frail_ind" prefix
cat_dummy <- cat_dummy %>% 
  as.data.frame() %>% 
  dplyr::select(-frail_ind.No_index_item) #Drop the field of interest
# Drop the frailty index from the stranded data frame and bind on our new encoding categorical variables
stranded_data <- stranded_data %>% 
  dplyr::select(-frailty_index) %>% 
  bind_cols(cat_dummy) %>% na.omit(.)

The data is now ready for splitting into a simple train and validation split, to do the machine learning on the set.

Splitting the data

The next step is to create a simple hold out train/test split:

split <- rsample::initial_split(stranded_data, prop = 3/4)
train <- rsample::training(split)
test <- rsample::testing(split)

Create simple Logistic Regression Model to classify stranded patients

The next step will be to create a stranded classification model, in CARET:

set.seed(123)
glm_class_mod <- caret::train(factor(stranded.label) ~ ., data = train, 
                 method = "glm")
print(glm_class_mod)

This is a very basic model and could be improved by model choice, hyperparameter selection, different resampling strategies, etc.

Predicting the test set to validate model

Next, we will use the test dataset to see how our model will perform in the wild:

preds <- predict(glm_class_mod, newdata = test) # Predict class
pred_prob <- predict(glm_class_mod, newdata = test, type="prob") #Predict probs

# Join prediction on to actual test data frame and evaluate in confusion matrix

predicted <- data.frame(preds, pred_prob)
test <- test %>% 
  bind_cols(predicted) %>% 
  dplyr::rename(pred_class=preds)

glimpse(test)

Evaluating with confusion matrix

The final step is to evaluate the model:

caret::confusionMatrix(test$stranded.label, test$pred_class, positive="Stranded")

The model performs relatively well and could be improved by better predictors, a bigger sample and class imbalance techniques.

Conclusion

This dataset can be used for a number of classification problems and can be the NHS's equivalent to the iris dataset for classification, albeit this only works for binary classification problems.