authorship: Word counts for books by various author

Description Usage Format Details Source References Examples

Description

The authorship data set is a data.frame of 69 numeric columns and 2 factor columns. Each row contains data giving counts of each word. The last two columns are providing the id of the book and its author.

Usage

1

Format

A data frame with 840 observations on the following 71 variables.

str.a

a numeric vector that represent the number of times that the string 'a' appeared

str.all

a numeric vector that represent the number of times that the string 'all' appeared

str.also

a numeric vector that represent the number of times that the string 'also' appeared

str.an

a numeric vector that represent the number of times that the string 'an' appeared

str.and

...

str.any

...

str.are

...

str.as

...

str.at

...

str.be

...

str.been

...

str.but

...

str.by

...

str.can

...

str.do

...

str.down

...

str.even

...

str.every

...

str.for

...

str.from

...

str.had

...

str.has

...

str.have

...

str.her

...

str.his

...

str.if

...

str.in

...

str.into

...

str.is

...

str.it

...

str.its

...

str.may

...

str.more

...

str.must

...

str.my

...

str.no

...

str.not

...

str.now

...

str.of

...

str.on

...

str.one

...

str.only

...

str.or

...

str.our

...

str.should

...

str.so

...

str.some

...

str.such

...

str.than

...

str.that

...

str.the

...

str.their

...

str.then

...

str.there

...

str.things

...

str.this

...

str.to

...

str.up

...

str.upon

...

str.was

...

str.were

...

str.what

...

str.when

...

str.which

...

str.who

...

str.will

...

str.with

...

str.would

...

str.your

...

BookID

a factor with levels b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 which are corresponding to the book id

Author

a factor with levels Austen London Milton Shakespeare which are corresponding to the author id

Details

This dataset is used to illustrate text classification.

Source

Jeffrey S. Simonoff Analyzing Categorical Data

References

~~ possibly secondary sources and usages ~~

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
authorship2 <- authorship

# transform data in order to have percentage
authorship2[, 1:69] <- authorship2[, 1:69] / rowSums(authorship2[, 1:69] )

# create the model 
authorship.som.init  <- som ( formula = ~ .
	, data = authorship2
	, neighborhood = "gaussian"
	, grid = grid ( xdim = 20 , ydim = 20 , type = "hexagonal" ) 
	)

# train the network
authorship.som <- learn( authorship.som.init , number.iter = 1000, max.alpha = 0.5, min.alpha = .001, max.rayon = 5 , step.eval.si = 100)

summary(authorship.som)

plot(authorship.som, "energy")

#--- see the distribution on the map
plot(authorship.som, "effectif", cex.label = 0)

harrysouthworth/kohonen documentation built on May 17, 2019, 3:03 p.m.