spam: Classification of e-mail into spam

spamR Documentation

Classification of e-mail into spam

Description

The concept of unsolicited commercial e-mail, or “spam”, is diverse and includes such examples as advertisements for products or web sites, get rich quick schemes, chain letters, and pornography. This is a collection of spam and non-spam e-mails assembled by George Forman at Hewlett-Packard in June and July of 1999. Forman, together with a team of collaborators, also extracted 57 numeric features from the e-mails that could potentially be used to classify the e-mails.

Note that this is a personal collection, and thus some of the features are highly specific (e.g., the name “George”, the phone number 650-857-7835, etc.).

Dimensions

  • n = 3,000 observations

  • p = 57 features

Outcome

  • y is equal to 1 if spam, 0 if not

Features

  • 48 continuous features of the form word\_freq\_WORD that record the percent of words in the e-mail that match WORD. For example, if word\_freq\_you equals 1.43, it means that 1.43% of words in the e-mail are “you”.

  • 6 continuous features of the form char\_freq\_CHAR that record the percent of characters in the e-mail that match CHAR.

  • capital\_run\_length\_average: average length of uninterrupted sequences of capital letters

  • capital\_run\_length\_longest: length of longest uninterrupted sequence of capital letters

  • capital\_run\_length\_total: sum of length of uninterrupted sequences of capital letters (i.e., the total number of capital letters in the e-mail)

Prediction set

  • The objects Xtest and ytest contain 1601 additional instances. Training and testing sets were sampled at random from the original data set, which contained 4601 instances.

References

I obtained this data set from the UCI Machine Learning Repository. The data set was originally created by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt at Hewlett-Packard Labs in Palo Alto, CA.


pbreheny/hdrm documentation built on Jan. 17, 2024, 8:53 p.m.