knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
glmPredict extends the functionality of the augment.glm
function of the broom
package to make and evaluate class predictions for a dataset with a Generalised Linear Model. This package was created for an assignment in STAT545B.
You can install the development version from GitHub with:
install.packages("devtools") devtools::install_github("janetxinli/glmPredict")
We're going to predict the malignancy of samples in a cleaned version of the Breast Cancer Wisconsin Dataset by creating a simple Generalised Linear Model. The dataset, cancer_clean
, is available in the glmPredict
package, where it has been slightly shrunk down from the original. The final diagnosis of the samples are encoded by the malignant
column, where a value of 1 means the sample was diagnosed as malignant, and a sample of 0 means the sample was benign.
First, we need to make the glm
, with family=binomial
or family=quasibinomial
to specify the response as a factor. The data will be subsetted as well, so we have a set of unseen data to pass into glm_predict
later.
library(glmPredict) # Start by subsetting the data set.seed(123) subset_len <- floor(0.75 * nrow(cancer_clean)) subset_ind <- sample(seq_len(nrow(cancer_clean)), size=subset_len) cancer_subset <- cancer_clean[subset_ind,] cancer_new <- cancer_clean[-subset_ind,] # Create a glm with cancer_subset (cancer_model <- glm(malignant ~ texture_mean + radius_mean, data=cancer_subset, family="binomial"))
Then, we can use the cancer_model
to make predictions, setting the threshold for predicting the positive class (malignant in this case) to 0.6 for example purposes:
(cancer_predictions <- glm_predict(cancer_model, "malignant", threshold=0.6))
By default, glm_predict
considers the positive class label to be 1
and the negative class to be 0
, which is the standard for binary classification problems, but you can provide different labels using the pos_label
and neg_label
arguments. Note that the y value label must still be between 0 and 1, which is also a limitation of the glm
function.
The function returns the augment
object, with an additional .prediction
column containing the predicted class for each observation/example.
cancer_predictions[[".prediction"]]
You can also pass a different dataset (one not used to create the glm
) to make predictions with glm_predict
. To do so, you just need to specify the dataset with the newdata
argument.
glm_predict(cancer_model, "malignant", threshold=0.6, newdata=cancer_new)
There are several metrics for evaluating predictons in a binary classification problem, such as accuracy, precision and recall. glmPredict
provides functions to calculate these three scores: accuracy_score
, precision_score
and recall_score
. You can score a set of predictions by providing the true labels and predicted labels to one of these functions, for example:
accuracy_score(cancer_subset[["malignant"]], cancer_predictions[[".prediction"]])
You can also save the true labels and predictions as variables and call one of the scoring functions on the individual vectors:
true_labels <- cancer_subset[["malignant"]] predicted_labels <- cancer_predictions[[".prediction"]] precision_score(true_labels, predicted_labels) recall_score(true_labels, predicted_labels)
Please note that the glmPredict project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
I started by using devtools to set up the structure of my package:
library(devtools) create_package("~/Documents/school/masters/stat545/glmPredict")
I set up Git manually through the command line. Then, I opened up the glmPredict directory in Rstudio and created an R file for my function with use_r("predict")
.
I copied my function from Assignment 1b directly into the created file, R/predict.R, and made some changes that I thought would make it more usable. Since my function has a few dependencies, I used the use_package function to explicitly tell devtools that these are required:
use_package("broom") use_package("dplyr") use_package("stats") use_package("datateachr") use_pipe()
I also changed the function calls in R/predict.R
to explicitly call the functions from their respective packages (i.e. changed as.formula(f)
to stats::as.formula(f)
).
Next, I added the documentation skeleton to my file with Code > Insert Roxygen Skeleton
, and added some information about the function parameters, return value and a couple of example function calls. Then, I added licensing information and aggregated the documentation:
use_mit_license("Janet Li") document()
At this point, I wanted to check to make sure that I was on the right track, so I used load_all()
to do a quick manual test of the function. Everything was working as expected, so I ran check()
... and got an error! Turns out the @examples
portion of the roxygen2 skeleton actually runs the code, and I was using variable names that didn't exist in my environment.
To ensure that the examples (and my future tests) are able to run, I used use_data_raw(name="cancer_clean")
to create an R script to clean a dataset to data-raw/cancer_clean.R
. After adding my code to this file, I ran usethis::use_data(cancer_clean, overwrite = TRUE)
to add the .rda
object to the data directory, then I documented the dataset in R/data.R
.
To add some tests to my package, I used use_test()
and moved over my tests from Assignment 1b to the tests/testthat/test-predict.R
. After running the tests manually, I used test()
to run the automated test. Everything passed!
To create a vignette, I used use_vignette("glmPredict")
, and added some more detailed documentation to vignettes/vignette.Rmd
.
Finally, I added a code of conduct and created a package website with:
use_code_of_conduct() use_pkgdown() build_site()
To render this README file, I used build_readme()
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.