Junk-mail dataset

Share:

Description

Building a junk mail classifier based on word and character frequencies

Usage

1
data("JUNK")

Format

A data frame with 4601 observations on the following 58 variables.

Junk

a factor with levels Junk Safe

make

a numeric vector, the percentage (0-100) of words in the email that are the word make

address

a numeric vector

all

a numeric vector

X3d

a numeric vector, the percentage (0-100) of words in the email that are the word 3d

our

a numeric vector

over

a numeric vector

remove

a numeric vector

internet

a numeric vector

order

a numeric vector

mail

a numeric vector

receive

a numeric vector

will

a numeric vector

people

a numeric vector

report

a numeric vector

addresses

a numeric vector

free

a numeric vector

business

a numeric vector

email

a numeric vector

you

a numeric vector

credit

a numeric vector

your

a numeric vector

font

a numeric vector

X000

a numeric vector, the percentage (0-100) of words in the email that are the word 000

money

a numeric vector

hp

a numeric vector

hpl

a numeric vector

george

a numeric vector

X650

a numeric vector

lab

a numeric vector

labs

a numeric vector

telnet

a numeric vector

X857

a numeric vector

data

a numeric vector

X415

a numeric vector

X85

a numeric vector

technology

a numeric vector

X1999

a numeric vector

parts

a numeric vector

pm

a numeric vector

direct

a numeric vector

cs

a numeric vector

meeting

a numeric vector

original

a numeric vector

project

a numeric vector

re

a numeric vector

edu

a numeric vector

table

a numeric vector

conference

a numeric vector

semicolon

a numeric vector, the percentage (0-100) of characters in the email that are semicolons

parenthesis

a numeric vector

bracket

a numeric vector

exclamation

a numeric vector

dollarsign

a numeric vector

hashtag

a numeric vector

capital_run_length_average

a numeric vector, average length of uninterrupted sequence of capital letters

capital_run_length_longest

a numeric vector, length of longest uninterrupted sequence of capital letters

capital_run_length_total

a numeric vector, total number of capital letters in the email

Details

The collection of junk emails came from the postmaster and individuals who classified the email as junk. The collection of safe emails were from work and personal emails. Note that most of the variables are percents and can vary from 0-100, though most values are much less than 1 (1%).

Source

Adapted from the Spambase Data Set at the UCI data repository https://archive.ics.uci.edu/ml/datasets/Spambase. Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt; Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. Donor: George Forman (gforman at nospam hpl.hp.com)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.