Description Usage Format Details Source

Building a junk mail classifier based on word and character frequencies

1 | ```
data("JUNK")
``` |

A data frame with 4601 observations on the following 58 variables.

`Junk`

a factor with levels

`Junk`

`Safe`

`make`

a numeric vector, the percentage (0-100) of words in the email that are the word

`make`

`address`

a numeric vector

`all`

a numeric vector

`X3d`

a numeric vector, the percentage (0-100) of words in the email that are the word

`3d`

`our`

a numeric vector

`over`

a numeric vector

`remove`

a numeric vector

`internet`

a numeric vector

`order`

a numeric vector

`mail`

a numeric vector

`receive`

a numeric vector

`will`

a numeric vector

`people`

a numeric vector

`report`

a numeric vector

`addresses`

a numeric vector

`free`

a numeric vector

`business`

a numeric vector

`email`

a numeric vector

`you`

a numeric vector

`credit`

a numeric vector

`your`

a numeric vector

`font`

a numeric vector

`X000`

a numeric vector, the percentage (0-100) of words in the email that are the word

`000`

`money`

a numeric vector

`hp`

a numeric vector

`hpl`

a numeric vector

`george`

a numeric vector

`X650`

a numeric vector

`lab`

a numeric vector

`labs`

a numeric vector

`telnet`

a numeric vector

`X857`

a numeric vector

`data`

a numeric vector

`X415`

a numeric vector

`X85`

a numeric vector

`technology`

a numeric vector

`X1999`

a numeric vector

`parts`

a numeric vector

`pm`

a numeric vector

`direct`

a numeric vector

`cs`

a numeric vector

`meeting`

a numeric vector

`original`

a numeric vector

`project`

a numeric vector

`re`

a numeric vector

`edu`

a numeric vector

`table`

a numeric vector

`conference`

a numeric vector

`semicolon`

a numeric vector, the percentage (0-100) of characters in the email that are semicolons

`parenthesis`

a numeric vector

`bracket`

a numeric vector

`exclamation`

a numeric vector

`dollarsign`

a numeric vector

`hashtag`

a numeric vector

`capital_run_length_average`

a numeric vector, average length of uninterrupted sequence of capital letters

`capital_run_length_longest`

a numeric vector, length of longest uninterrupted sequence of capital letters

`capital_run_length_total`

a numeric vector, total number of capital letters in the email

The collection of junk emails came from the postmaster and individuals who classified the email as junk. The collection of safe emails were from work and personal emails. Note that most of the variables are percents and can vary from 0-100, though most values are much less than 1 (1%).

Adapted from the Spambase Data Set at the UCI data repository https://archive.ics.uci.edu/ml/datasets/Spambase. Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt; Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. Donor: George Forman (gforman at nospam hpl.hp.com)

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.