Email Spam Data

Share:

Description

SPAM E-mail Database. See Details below.

Usage

1

Format

A data frame with 4601 observations on the following 58 variables.

A.1

a numeric vector

A.2

a numeric vector

A.3

a numeric vector

A.4

a numeric vector

A.5

a numeric vector

A.6

a numeric vector

A.7

a numeric vector

A.8

a numeric vector

A.9

a numeric vector

A.10

a numeric vector

A.11

a numeric vector

A.12

a numeric vector

A.13

a numeric vector

A.14

a numeric vector

A.15

a numeric vector

A.16

a numeric vector

A.17

a numeric vector

A.18

a numeric vector

A.19

a numeric vector

A.20

a numeric vector

A.21

a numeric vector

A.22

a numeric vector

A.23

a numeric vector

A.24

a numeric vector

A.25

a numeric vector

A.26

a numeric vector

A.27

a numeric vector

A.28

a numeric vector

A.29

a numeric vector

A.30

a numeric vector

A.31

a numeric vector

A.32

a numeric vector

A.33

a numeric vector

A.34

a numeric vector

A.35

a numeric vector

A.36

a numeric vector

A.37

a numeric vector

A.38

a numeric vector

A.39

a numeric vector

A.40

a numeric vector

A.41

a numeric vector

A.42

a numeric vector

A.43

a numeric vector

A.44

a numeric vector

A.45

a numeric vector

A.46

a numeric vector

A.47

a numeric vector

A.48

a numeric vector

A.49

a numeric vector

A.50

a numeric vector

A.51

a numeric vector

A.52

a numeric vector

A.53

a numeric vector

A.54

a numeric vector

A.55

a numeric vector

A.56

a numeric vector

A.57

a numeric vector

spam

Factor w/ 2 levels "email", "spam"

Details

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998.

Attribute Information: The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word\_freq\_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char\_freq\_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital\_run\_length\_average = average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital\_run\_length\_longest = length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital\_run\_length\_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Source

(a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304 (b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835 (c) Generated: June-July 1999

References

http://www.ics.uci.edu/~mlearn/MLRepository.html

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
head(str(spam))
## Not run: 
if(require(prim)){ # This uses too much time!
   X <- spam[,1:57]
   Y <- ifelse(spam$spam=="spam", 1, 0)
   spam.prim1 <- prim.box(X, Y, threshold.type=1,  verbose=TRUE)
   summary(spam.prim1)
} # use of prim package.

## End(Not run)
# The following example uses too much time and must be put inside a
# dontrun construction. Also summary(spam.earth) killed the R process
# ...
## Not run: 
if(require(earth)){
   spam.earth <- earth(spam[, 1:57], spam$spam,
         glm=list(family=binomial),
         trace=1, keepxy=TRUE, degree=1, nfold=10)
   summary(spam.earth)
} # use of earth package

## End(Not run) # end of dontrun block
if(require(mda)){
 spam.mars <- mars(spam[, 1:57],
                   ifelse(spam$spam=="spam", 1, 0),
                   degree=1, nk=50, trace.mars=TRUE)
 summary(spam.mars)
} # end require(mda) block