In christophM/interpretable-ml-book: Interpretable machine learning

devtools::load_all()

Datasets {#data}

Throughout the book, all models and techniques are applied to real datasets that are freely available online. We will use different datasets for different tasks: Classification, regression and text classification.

Bike Rentals (Regression) {#bike-data}

This dataset contains daily counts of rented bicycles from the bicycle rental company Capital-Bikeshare in Washington D.C., along with weather and seasonal information. The data was kindly made openly available by Capital-Bikeshare. Fanaee-T and Gama (2013)[^Fanaee] added weather data and season information. The goal is to predict how many bikes will be rented depending on the weather and the day. The data can be downloaded from the UCI Machine Learning Repository.

New features were added to the dataset and not all original features were used for the examples in this book. Here is the list of features that were used:

Count of bicycles including both casual and registered users. The count is used as the target in the regression task.
The season, either spring, summer, fall or winter.
Indicator whether the day was a holiday or not.
The year, either 2011 or 2012.
Number of days since the 01.01.2011 (the first day in the dataset). This feature was introduced to take account of the trend over time.
Indicator whether the day was a working day or weekend.
The weather situation on that day. One of:
- clear, few clouds, partly cloudy, cloudy
- mist + clouds, mist + broken clouds, mist + few clouds, mist
- light snow, light rain + thunderstorm + scattered clouds, light rain + scattered clouds
- heavy rain + ice pallets + thunderstorm + mist, snow + mist
Temperature in degrees Celsius.
Relative humidity in percent (0 to 100).
Wind speed in km per hour.

For the examples in this book, the data has been slightly processed. You can find the processing R-script in the book's GitHub repository together with the final RData file.

YouTube Spam Comments (Text Classification) {#spam-data}

As an example for text classification we work with 1956 comments from 5 different YouTube videos. Thankfully, the authors who used this dataset in an article on spam classification made the data freely available (Alberto, Lochter, and Almeida (2015)[^Alberto]).

The comments were collected via the YouTube API from five of the ten most viewed videos on YouTube in the first half of 2015. All 5 are music videos. One of them is "Gangnam Style" by Korean artist Psy. The other artists were Katy Perry, LMFAO, Eminem, and Shakira.

Checkout some of the comments. The comments were manually labeled as spam or legitimate. Spam was coded with a "1" and legitimate comments with a "0".

library("kableExtra")
data(ycomments)
tab = knitr::kable(ycomments[1:10, c('CONTENT', 'CLASS')], booktabs = TRUE, caption = "Sample of comments from the YouTube Spam dataset")
if (is.pdf) tab = tab %>% column_spec(1, width = "10cm")
tab

You can also go to YouTube and take a look at the comment section. But please do not get caught in YouTube hell and end up watching videos of monkeys stealing and drinking cocktails from tourists on the beach. The Google Spam detector has also probably changed a lot since 2015.

Watch the view-record breaking video "Gangnam Style" here.

If you want to play around with the data, you can find the RData file along with the R-script with some convenience functions in the book's GitHub repository.

Risk Factors for Cervical Cancer (Classification) {#cervical}

The cervical cancer dataset contains indicators and risk factors for predicting whether a woman will get cervical cancer. The features include demographic data (such as age), lifestyle, and medical history. The data can be downloaded from the UCI Machine Learning repository and is described by Fernandes, Cardoso, and Fernandes (2017)[^Fernandes].

The subset of data features used in the book's examples are:

Age in years
Number of sexual partners
First sexual intercourse (age in years)
Number of pregnancies
Smoking yes or no
Smoking (in years)
Hormonal contraceptives yes or no
Hormonal contraceptives (in years)
Intrauterine device yes or no (IUD)
Number of years with an intrauterine device (IUD)
Has patient ever had a sexually transmitted disease (STD) yes or no
Number of STD diagnoses
Time since first STD diagnosis
Time since last STD diagnosis
The biopsy results "Healthy" or "Cancer". Target outcome.

The biopsy serves as the gold standard for diagnosing cervical cancer. For the examples in this book, the biopsy outcome was used as the target. Missing values for each column were imputed by the mode (most frequent value), which is probably a bad solution, since the true answer could be correlated with the probability that a value is missing. There is probably a bias because the questions are of a very private nature. But this is not a book about missing data imputation, so the mode imputation will have to suffice for the examples.

To reproduce the examples of this book with this dataset, find the preprocessing R-script and the final RData file in the book's GitHub repository.

[^Fanaee]: Fanaee-T, Hadi, and Joao Gama. "Event labeling combining ensemble detectors and background knowledge." Progress in Artificial Intelligence. Springer Berlin Heidelberg, 1–15. https://doi.org/10.1007/s13748-013-0040-3. (2013).

[^Alberto]: Alberto, Túlio C, Johannes V Lochter, and Tiago A Almeida. "Tubespam: comment spam filtering on YouTube." In Machine Learning and Applications (Icmla), Ieee 14th International Conference on, 138–43. IEEE. (2015).

[^Fernandes]: Fernandes, Kelwin, Jaime S Cardoso, and Jessica Fernandes. "Transfer learning with partial observability applied to cervical cancer screening." In Iberian Conference on Pattern Recognition and Image Analysis, 243–50. Springer. (2017).

christophM/interpretable-ml-book documentation built on March 10, 2024, 10:34 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com