knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

glmPredict

glmPredict extends the functionality of the broom::augment.glm function by making predictions of the target variable class. The glm_predict function makes these predictions. It requires a fitted glm model object with family=binomial or family=quasibinomial as input, as well as the name of the target variable. This is analogous to making predictions on Logistic Regression model.

The class is predicted by comparing the response (the predicted probability of the positive class) to a threshold, 0.5 by default. The threshold argument can be changed to specify a higher or lower threshold for predicting the positivei class. By default, the model will predict on the data present in the glm model, but a different dataset with the same feature variables can be passed to the function with the newdata argument to make predictions on a new dataset.

This package also comes with various functions for evaluating a set of predictions. Accuracy can be evaluated with the accuracy_score function, precision can be calculated with precision_score, and recall_score calculates the recall of the predictions.

The glmPredict package also comes equipped with a toy dataset, cancer_clean, to use and test the functionality of glm_predict().

You can install the development version from GitHub with:

install.packages("devtools")
devtools::install_github("janetxinli/glmPredict")

Basic Usage

library(glmPredict)

You can learn more about modelling in R and Generalised Linear Models here. For this demo, we'll be working with the cancer_clean dataset. The cancer_clean dataset is a simplified version of the Breast Cancer Wisconsin Dataset.

You can learn more about the dataset by typing ?cancer_clean into your Rstudio console. Essentially, the dataset is a tibble with 569 rows and 11 columns. Each row represents measurements from a single breast mass biopsy image, and each column is a different metric describing the nuclei present in the image. The malignant column describes the final diagnosis of each sample, where 1 = malignant and 0 = benign.

First, we'll start by subsetting cancer_clean by row. cancer_subset will contain 75% of the rows in the original dataset, and cancer_new will contain the other 25%.

set.seed(123)
subset_len <- floor(0.75 * nrow(cancer_clean))
subset_ind <- sample(seq_len(nrow(cancer_clean)), size=subset_len)
cancer_subset <- cancer_clean[subset_ind,]
cancer_new <- cancer_clean[-subset_ind,]

Say we want to predict whether a sample will be classified as malignant or benign (y) by taking into account the mean texture and mean radius (x) of nuclei in each image. We can create a glm with the cancer_subset dataset and the formula y ~ x.

(cancer_model <- glm(malignant ~ texture_mean + radius_mean, data=cancer_subset, family="binomial"))

Now if we want to see which samples in cancer_subset our model predicts are malignant, we can use the glm_predict function:

(preds <- glm_predict(cancer_model, "malignant"))

Under the hood, glm_predict() runs augment.glm(type.predict="response") on the model. This returns the predicted probability of each sample being in the positive class, and glm_predict() compares this probability to a certain threshold to classify each sample as either positive (malignant) or negative (benign). The predictions are appended to the augment.glm tibble, and returned from the function.

The output of glm_predict() is the augment object from augment.glm, with an extra column .prediction containing the predicted label for each example.

Predicting on new data

We can make predictions on new data that isn't in the glm model object, as well. We didn't use cancer_new when creating our model, but what if we want to see what the model predicts for those observations? In this case, we can pass cancer_new by specifying the newdata argument in the glm_predict function:

(preds_new <- glm_predict(cancer_model, "malignant", newdata=cancer_new))

Changing threshold

Most often, a probability threshold of 0.5 is used to predict the positive class, but you might want to adjust this threshold if you want to be more or less strict with your predictions. In our case, the positive class is serious, and we would rather have more false positives than false negatives. We could decrease the threshold so we predict a sample as malignant more often. This can be done by specifying the threshold argument:

glm_predict(cancer_model, "malignant", threshold=0.4)

Evaluating predictions

The correctness of the predictions can be evaluated with several metrics, such as accuracy, prediction and recall. glmPredict provides functions to calculate these three metrics: accuracy_score, precision_score and recall_score. To use these functions, we need to compare the "ground truth" labels to our predictions given by glm_predict.

accuracy_score(cancer_subset[["malignant"]], preds[[".prediction"]])
precision_score(cancer_subset[["malignant"]], preds[[".prediction"]])
recall_score(cancer_subset[["malignant"]], preds[[".prediction"]])


janetxinli/glmPredict documentation built on Jan. 1, 2021, 4:28 a.m.