spam | R Documentation |
The concept of unsolicited commercial e-mail, or "spam", is diverse and includes such examples as advertisements for products or web sites, get rich quick schemes, chain letters, and pornography. This is a collection of spam and non-spam e-mails assembled by George Forman at Hewlett-Packard in June and July of 1999. Forman, together with a team of collaborators, also extracted 57 numeric features from the e-mails that could potentially be used to classify the e-mails.
Note that this is a personal collection, and thus some of the features are highly specific (e.g., the name "George", the phone number 650-857-7835, etc.).
y
is equal to 1 if spam, 0 if not
X
is a matrix with 3000 rows and 57 columns:
48 continuous features of the form word_freq_WORD
that record the percent
of words in the e-mail that match WORD. For example, if word_freq_you
equals 1.43, it means that 1.43% of words in the e-mail are "you".
6 continuous features of the form char_freq_CHAR
that record the percent
of characters in the e-mail that match CHAR.
capital_run_length_average
: average length of uninterrupted sequences of
capital letters
capital_run_length_longest
: length of longest uninterrupted sequence of
capital letters
capital_run_length_total
: sum of length of uninterrupted sequences of
capital letters (i.e., the total number of capital letters in the e-mail)
Xtest
and ytest
: 1601 additional instances. Training and testing sets were
sampled at random from the original data set, which contained 4601 instances.
I obtained this data set from the UCI Machine Learning Repository. The data set was originally created by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt at Hewlett-Packard Labs in Palo Alto, CA.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.