Description Usage Format Details Source
Building a junk mail classifier based on word and character frequencies
1 | data("JUNK")
|
A data frame with 4601 observations on the following 58 variables.
Junk
a factor with levels Junk
Safe
make
a numeric vector, the percentage (0-100) of words in the email that are the word make
address
a numeric vector
all
a numeric vector
X3d
a numeric vector, the percentage (0-100) of words in the email that are the word 3d
our
a numeric vector
over
a numeric vector
remove
a numeric vector
internet
a numeric vector
order
a numeric vector
mail
a numeric vector
receive
a numeric vector
will
a numeric vector
people
a numeric vector
report
a numeric vector
addresses
a numeric vector
free
a numeric vector
business
a numeric vector
email
a numeric vector
you
a numeric vector
credit
a numeric vector
your
a numeric vector
font
a numeric vector
X000
a numeric vector, the percentage (0-100) of words in the email that are the word 000
money
a numeric vector
hp
a numeric vector
hpl
a numeric vector
george
a numeric vector
X650
a numeric vector
lab
a numeric vector
labs
a numeric vector
telnet
a numeric vector
X857
a numeric vector
data
a numeric vector
X415
a numeric vector
X85
a numeric vector
technology
a numeric vector
X1999
a numeric vector
parts
a numeric vector
pm
a numeric vector
direct
a numeric vector
cs
a numeric vector
meeting
a numeric vector
original
a numeric vector
project
a numeric vector
re
a numeric vector
edu
a numeric vector
table
a numeric vector
conference
a numeric vector
semicolon
a numeric vector, the percentage (0-100) of characters in the email that are semicolons
parenthesis
a numeric vector
bracket
a numeric vector
exclamation
a numeric vector
dollarsign
a numeric vector
hashtag
a numeric vector
capital_run_length_average
a numeric vector, average length of uninterrupted sequence of capital letters
capital_run_length_longest
a numeric vector, length of longest uninterrupted sequence of capital letters
capital_run_length_total
a numeric vector, total number of capital letters in the email
The collection of junk emails came from the postmaster and individuals who classified the email as junk. The collection of safe emails were from work and personal emails. Note that most of the variables are percents and can vary from 0-100, though most values are much less than 1 (1%).
Adapted from the Spambase Data Set at the UCI data repository https://archive.ics.uci.edu/ml/datasets/Spambase. Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt; Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. Donor: George Forman (gforman at nospam hpl.hp.com)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.