| JUNK | R Documentation |
Building a junk mail classifier based on word and character frequencies
data("JUNK")
A data frame with 4601 observations on the following 58 variables.
Junka factor with levels Junk Safe
makea numeric vector, the percentage (0-100) of words in the email that are the word make
addressa numeric vector
alla numeric vector
X3da numeric vector, the percentage (0-100) of words in the email that are the word 3d
oura numeric vector
overa numeric vector
removea numeric vector
interneta numeric vector
ordera numeric vector
maila numeric vector
receivea numeric vector
willa numeric vector
peoplea numeric vector
reporta numeric vector
addressesa numeric vector
freea numeric vector
businessa numeric vector
emaila numeric vector
youa numeric vector
credita numeric vector
youra numeric vector
fonta numeric vector
X000a numeric vector, the percentage (0-100) of words in the email that are the word 000
moneya numeric vector
hpa numeric vector
hpla numeric vector
georgea numeric vector
X650a numeric vector
laba numeric vector
labsa numeric vector
telneta numeric vector
X857a numeric vector
dataa numeric vector
X415a numeric vector
X85a numeric vector
technologya numeric vector
X1999a numeric vector
partsa numeric vector
pma numeric vector
directa numeric vector
csa numeric vector
meetinga numeric vector
originala numeric vector
projecta numeric vector
rea numeric vector
edua numeric vector
tablea numeric vector
conferencea numeric vector
semicolona numeric vector, the percentage (0-100) of characters in the email that are semicolons
parenthesisa numeric vector
bracketa numeric vector
exclamationa numeric vector
dollarsigna numeric vector
hashtaga numeric vector
capital_run_length_averagea numeric vector, average length of uninterrupted sequence of capital letters
capital_run_length_longesta numeric vector, length of longest uninterrupted sequence of capital letters
capital_run_length_totala numeric vector, total number of capital letters in the email
The collection of junk emails came from the postmaster and individuals who classified the email as junk. The collection of safe emails were from work and personal emails. Note that most of the variables are percents and can vary from 0-100, though most values are much less than 1 (1%).
Adapted from the Spambase Data Set at the UCI data repository https://archive.ics.uci.edu/ml/datasets/Spambase. Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt; Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. Donor: George Forman (gforman at nospam hpl.hp.com)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.