knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(textclassificationexamples) library(ggplot2) library(rpart)
This package is intended to provide users with datasets and helper functions necessary to work through basic text classification and modeling examples. The vignette provides some additional introduction and background.
The Spam example lets students work with email subject lines from a model eliciting activity that was created by Joan Garfield and colleagues at the University of Minnesota (see https://serc.carleton.edu/sp/library/mea/index.html).
The clickbait example lets students work through a different classification problem using headlines from sites known to host clickbait and legitimate news sites. These headlines are derived from Kaggle (see XX).
The intent is to extract a variety of structural features from headlines associated with either valid news sources or clickbait articles to create a model/ classification scheme that accurately classifies headlines based on the clickbait variable (TRUE/FALSE).
There are three available datasets for use: articles, articles_clean, and sample_articles.
articles contains the entire dataset provided by the source (see documentation for more info). articles_clean is a marginally smaller subset of articles filtered (imperfectly) for headlines with explicit language, with sample_articles being a 2000 row sample of this subset. sample_articles contains 1000 rows of clickbait articles, and 1000 rows of news headlines. This small set can be converted to a testing and training set for modeling purposes.
head(sample_articles)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.