Description Usage Format Details Source References Examples
SPAM E-mail Database. See Details below.
1 |
A data frame with 4601 observations on the following 58 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Factor w/ 2 levels "email", "spam"
The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998.
Attribute Information: The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:
48 continuous real [0,100] attributes of type word\_freq\_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
6 continuous real [0,100] attributes of type char\_freq\_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail
1 continuous real [1,...] attribute of type capital\_run\_length\_average = average length of uninterrupted sequences of capital letters
1 continuous integer [1,...] attribute of type capital\_run\_length\_longest = length of longest uninterrupted sequence of capital letters
1 continuous integer [1,...] attribute of type capital\_run\_length\_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail
1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
(a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304 (b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835 (c) Generated: June-July 1999
http://www.ics.uci.edu/~mlearn/MLRepository.html
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | head(str(spam))
## Not run:
if(require(prim)){ # This uses too much time!
X <- spam[,1:57]
Y <- ifelse(spam$spam=="spam", 1, 0)
spam.prim1 <- prim.box(X, Y, threshold.type=1, verbose=TRUE)
summary(spam.prim1)
} # use of prim package.
## End(Not run)
# The following example uses too much time and must be put inside a
# dontrun construction. Also summary(spam.earth) killed the R process
# ...
## Not run:
if(require(earth)){
spam.earth <- earth(spam[, 1:57], spam$spam,
glm=list(family=binomial),
trace=1, keepxy=TRUE, degree=1, nfold=10)
summary(spam.earth)
} # use of earth package
## End(Not run) # end of dontrun block
if(require(mda)){
spam.mars <- mars(spam[, 1:57],
ifelse(spam$spam=="spam", 1, 0),
degree=1, nk=50, trace.mars=TRUE)
summary(spam.mars)
} # end require(mda) block
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.