knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
glmPredict extends the functionality of the broom::augment.glm
function by making predictions of the target variable class. The glm_predict
function makes these predictions. It requires a fitted glm
model object with family=binomial
or family=quasibinomial
as input, as well as the name of the target variable. This is analogous to making predictions on Logistic Regression model.
The class is predicted by comparing the response (the predicted probability of the positive class) to a threshold, 0.5 by default. The threshold
argument can be changed to specify a higher or lower threshold for predicting the positivei class. By default, the model will predict on the data present in the glm
model, but a different dataset with the same feature variables can be passed to the function with the newdata
argument to make predictions on a new dataset.
This package also comes with various functions for evaluating a set of predictions. Accuracy can be evaluated with the accuracy_score
function, precision can be calculated with precision_score
, and recall_score
calculates the recall of the predictions.
The glmPredict
package also comes equipped with a toy dataset, cancer_clean
, to use and test the functionality of glm_predict()
.
You can install the development version from GitHub with:
install.packages("devtools") devtools::install_github("janetxinli/glmPredict")
library(glmPredict)
You can learn more about modelling in R and Generalised Linear Models here. For this demo, we'll be working with the cancer_clean
dataset. The cancer_clean
dataset is a simplified version of the Breast Cancer Wisconsin Dataset.
You can learn more about the dataset by typing ?cancer_clean
into your Rstudio console. Essentially, the dataset is a tibble with 569 rows and 11 columns. Each row represents measurements from a single breast mass biopsy image, and each column is a different metric describing the nuclei present in the image. The malignant
column describes the final diagnosis of each sample, where 1 = malignant and 0 = benign.
First, we'll start by subsetting cancer_clean
by row. cancer_subset
will contain 75% of the rows in the original dataset, and cancer_new
will contain the other 25%.
set.seed(123) subset_len <- floor(0.75 * nrow(cancer_clean)) subset_ind <- sample(seq_len(nrow(cancer_clean)), size=subset_len) cancer_subset <- cancer_clean[subset_ind,] cancer_new <- cancer_clean[-subset_ind,]
Say we want to predict whether a sample will be classified as malignant or benign (y) by taking into account the mean texture and mean radius (x) of nuclei in each image. We can create a glm
with the cancer_subset
dataset and the formula y ~ x
.
(cancer_model <- glm(malignant ~ texture_mean + radius_mean, data=cancer_subset, family="binomial"))
Now if we want to see which samples in cancer_subset
our model predicts are malignant, we can use the glm_predict
function:
(preds <- glm_predict(cancer_model, "malignant"))
Under the hood, glm_predict()
runs augment.glm(type.predict="response")
on the model. This returns the predicted probability of each sample being in the positive class, and glm_predict()
compares this probability to a certain threshold to classify each sample as either positive (malignant) or negative (benign). The predictions are appended to the augment.glm
tibble, and returned from the function.
The output of glm_predict()
is the augment
object from augment.glm
, with an extra column .prediction
containing the predicted label for each example.
We can make predictions on new data that isn't in the glm
model object, as well. We didn't use cancer_new
when creating our model, but what if we want to see what the model predicts for those observations? In this case, we can pass cancer_new
by specifying the newdata
argument in the glm_predict
function:
(preds_new <- glm_predict(cancer_model, "malignant", newdata=cancer_new))
Most often, a probability threshold of 0.5 is used to predict the positive class, but you might want to adjust this threshold if you want to be more or less strict with your predictions. In our case, the positive class is serious, and we would rather have more false positives than false negatives. We could decrease the threshold so we predict a sample as malignant more often. This can be done by specifying the threshold
argument:
glm_predict(cancer_model, "malignant", threshold=0.4)
The correctness of the predictions can be evaluated with several metrics, such as accuracy, prediction and recall. glmPredict
provides functions to calculate these three metrics: accuracy_score
, precision_score
and recall_score
. To use these functions, we need to compare the "ground truth" labels to our predictions given by glm_predict
.
accuracy_score(cancer_subset[["malignant"]], preds[[".prediction"]])
precision_score(cancer_subset[["malignant"]], preds[[".prediction"]])
recall_score(cancer_subset[["malignant"]], preds[[".prediction"]])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.