Building a junk mail classifier based on word and character frequencies

1 | ```
data("JUNK")
``` |

A data frame with 4601 observations on the following 58 variables.

`Junk`

a factor with levels

`Junk`

`Safe`

`make`

a numeric vector, the percentage (0-100) of words in the email that are the word

`make`

`address`

a numeric vector

`all`

a numeric vector

`X3d`

a numeric vector, the percentage (0-100) of words in the email that are the word

`3d`

`our`

a numeric vector

`over`

a numeric vector

`remove`

a numeric vector

`internet`

a numeric vector

`order`

a numeric vector

`mail`

a numeric vector

`receive`

a numeric vector

`will`

a numeric vector

`people`

a numeric vector

`report`

a numeric vector

`addresses`

a numeric vector

`free`

a numeric vector

`business`

a numeric vector

`email`

a numeric vector

`you`

a numeric vector

`credit`

a numeric vector

`your`

a numeric vector

`font`

a numeric vector

`X000`

a numeric vector, the percentage (0-100) of words in the email that are the word

`000`

`money`

a numeric vector

`hp`

a numeric vector

`hpl`

a numeric vector

`george`

a numeric vector

`X650`

a numeric vector

`lab`

a numeric vector

`labs`

a numeric vector

`telnet`

a numeric vector

`X857`

a numeric vector

`data`

a numeric vector

`X415`

a numeric vector

`X85`

a numeric vector

`technology`

a numeric vector

`X1999`

a numeric vector

`parts`

a numeric vector

`pm`

a numeric vector

`direct`

a numeric vector

`cs`

a numeric vector

`meeting`

a numeric vector

`original`

a numeric vector

`project`

a numeric vector

`re`

a numeric vector

`edu`

a numeric vector

`table`

a numeric vector

`conference`

a numeric vector

`semicolon`

a numeric vector, the percentage (0-100) of characters in the email that are semicolons

`parenthesis`

a numeric vector

`bracket`

a numeric vector

`exclamation`

a numeric vector

`dollarsign`

a numeric vector

`hashtag`

a numeric vector

`capital_run_length_average`

a numeric vector, average length of uninterrupted sequence of capital letters

`capital_run_length_longest`

a numeric vector, length of longest uninterrupted sequence of capital letters

`capital_run_length_total`

a numeric vector, total number of capital letters in the email

The collection of junk emails came from the postmaster and individuals who classified the email as junk. The collection of safe emails were from work and personal emails. Note that most of the variables are percents and can vary from 0-100, though most values are much less than 1 (1%).

Adapted from the Spambase Data Set at the UCI data repository https://archive.ics.uci.edu/ml/datasets/Spambase. Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt; Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. Donor: George Forman (gforman at nospam hpl.hp.com)

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.