knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

glmPredict

Lifecycle: experimental License: MIT

glmPredict extends the functionality of the augment.glm function of the broom package to make and evaluate class predictions for a dataset with a Generalised Linear Model. This package was created for an assignment in STAT545B.

Installation

You can install the development version from GitHub with:

install.packages("devtools")
devtools::install_github("janetxinli/glmPredict")

Examples

We're going to predict the malignancy of samples in a cleaned version of the Breast Cancer Wisconsin Dataset by creating a simple Generalised Linear Model. The dataset, cancer_clean, is available in the glmPredict package, where it has been slightly shrunk down from the original. The final diagnosis of the samples are encoded by the malignant column, where a value of 1 means the sample was diagnosed as malignant, and a sample of 0 means the sample was benign.

First, we need to make the glm, with family=binomial or family=quasibinomial to specify the response as a factor. The data will be subsetted as well, so we have a set of unseen data to pass into glm_predict later.

library(glmPredict)

# Start by subsetting the data
set.seed(123)
subset_len <- floor(0.75 * nrow(cancer_clean))
subset_ind <- sample(seq_len(nrow(cancer_clean)), size=subset_len)
cancer_subset <- cancer_clean[subset_ind,]
cancer_new <- cancer_clean[-subset_ind,]

# Create a glm with cancer_subset
(cancer_model <- glm(malignant ~ texture_mean + radius_mean, data=cancer_subset, family="binomial"))

Then, we can use the cancer_model to make predictions, setting the threshold for predicting the positive class (malignant in this case) to 0.6 for example purposes:

(cancer_predictions <- glm_predict(cancer_model, "malignant", threshold=0.6))

By default, glm_predict considers the positive class label to be 1 and the negative class to be 0, which is the standard for binary classification problems, but you can provide different labels using the pos_label and neg_label arguments. Note that the y value label must still be between 0 and 1, which is also a limitation of the glm function.

The function returns the augment object, with an additional .prediction column containing the predicted class for each observation/example.

cancer_predictions[[".prediction"]]

You can also pass a different dataset (one not used to create the glm) to make predictions with glm_predict. To do so, you just need to specify the dataset with the newdata argument.

glm_predict(cancer_model, "malignant", threshold=0.6, newdata=cancer_new)

There are several metrics for evaluating predictons in a binary classification problem, such as accuracy, precision and recall. glmPredict provides functions to calculate these three scores: accuracy_score, precision_score and recall_score. You can score a set of predictions by providing the true labels and predicted labels to one of these functions, for example:

accuracy_score(cancer_subset[["malignant"]], cancer_predictions[[".prediction"]])

You can also save the true labels and predictions as variables and call one of the scoring functions on the individual vectors:

true_labels <- cancer_subset[["malignant"]]
predicted_labels <- cancer_predictions[[".prediction"]]
precision_score(true_labels, predicted_labels)
recall_score(true_labels, predicted_labels)

Code of Conduct

Please note that the glmPredict project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.


Exercise 1.1: My Development

I started by using devtools to set up the structure of my package:

library(devtools)
create_package("~/Documents/school/masters/stat545/glmPredict")

I set up Git manually through the command line. Then, I opened up the glmPredict directory in Rstudio and created an R file for my function with use_r("predict").

I copied my function from Assignment 1b directly into the created file, R/predict.R, and made some changes that I thought would make it more usable. Since my function has a few dependencies, I used the use_package function to explicitly tell devtools that these are required:

use_package("broom")
use_package("dplyr")
use_package("stats")
use_package("datateachr")
use_pipe()

I also changed the function calls in R/predict.R to explicitly call the functions from their respective packages (i.e. changed as.formula(f) to stats::as.formula(f)).

Next, I added the documentation skeleton to my file with Code > Insert Roxygen Skeleton, and added some information about the function parameters, return value and a couple of example function calls. Then, I added licensing information and aggregated the documentation:

use_mit_license("Janet Li")
document()

At this point, I wanted to check to make sure that I was on the right track, so I used load_all() to do a quick manual test of the function. Everything was working as expected, so I ran check()... and got an error! Turns out the @examples portion of the roxygen2 skeleton actually runs the code, and I was using variable names that didn't exist in my environment.

To ensure that the examples (and my future tests) are able to run, I used use_data_raw(name="cancer_clean") to create an R script to clean a dataset to data-raw/cancer_clean.R. After adding my code to this file, I ran usethis::use_data(cancer_clean, overwrite = TRUE) to add the .rda object to the data directory, then I documented the dataset in R/data.R.

To add some tests to my package, I used use_test() and moved over my tests from Assignment 1b to the tests/testthat/test-predict.R. After running the tests manually, I used test() to run the automated test. Everything passed!

To create a vignette, I used use_vignette("glmPredict"), and added some more detailed documentation to vignettes/vignette.Rmd.

Finally, I added a code of conduct and created a package website with:

use_code_of_conduct()
use_pkgdown()
build_site()

To render this README file, I used build_readme().



janetxinli/glmPredict documentation built on Jan. 1, 2021, 4:28 a.m.