E-mails from DBWorld mailing list

Share:

Description

The dataset contains n= 64 bodies of e-mails in binary bag-of-words representation which Filannino manually collected from DBWorld mailing list. DBWorld mailing list announces conferences, jobs, books, software and grants. Filannino applied supervised learning algorithm to classify e-mails between “announces of conferences” and “everything else”. Out of 64 e-mails, 29 are about conference announcements and 35 are not.

Every e-mail is represented as a vector containing p binary values, where p is the size of the vocabulary extracted from the entire corpus with some constraints: the common words such as “the”, “is” or “which”, so-called stop words, and words that have less than 3 characters or more than 30 chracters are removed from the dataset. The entry of the vector is 1 if the corresponding word belongs to the e-mail and 0 otherwise. The number of unique words in the dataset is p=4702. The dataset is originally from the UCI Machine Learning Repository DBWorldData.

rawDBWorld is a list of 64 objects containing the original E-mails.

Usage

1
2

Details

See Bache K, Lichman M (2013). for details of the data descriptions. The original dataset is freely available from USIMachine Learning Repository website http://archive.ics.uci.edu/ml/datasets/DBWorld+e-mails

Author(s)

Yumi Kondo <y.kondo@stat.ubc.ca>

References

Bache K, Lichman M (2013). UCI Machine Learning Repository." http://archive.ics.uci.edu/ml/datasets

Filannino, M., (2011). 'DBWorld e-mail classification using a very small corpus', Project of Machine Learning course, University of Manchester.

Examples

1
2
3
4
5
## Not run: 
data(DBWorld)
data(rawDBWorld)

## End(Not run)