authorship: Word counts for books by various author
In harrysouthworth/kohonen: Self-organizing Maps for Data Classification

Description Usage Format Details Source References Examples

The authorship data set is a data.frame of 69 numeric columns and 2 factor columns. Each row contains data giving counts of each word. The last two columns are providing the id of the book and its author.

1	data(authorship)

A data frame with 840 observations on the following 71 variables.

str.a: a numeric vector that represent the number of times that the string 'a' appeared
str.all: a numeric vector that represent the number of times that the string 'all' appeared
str.also: a numeric vector that represent the number of times that the string 'also' appeared
str.an: a numeric vector that represent the number of times that the string 'an' appeared
str.and: ...
str.any: ...
str.are: ...
str.as: ...
str.at: ...
str.be: ...
str.been: ...
str.but: ...
str.by: ...
str.can: ...
str.do: ...
str.down: ...
str.even: ...
str.every: ...
str.for: ...
str.from: ...
str.had: ...
str.has: ...
str.have: ...
str.her: ...
str.his: ...
str.if: ...
str.in: ...
str.into: ...
str.is: ...
str.it: ...
str.its: ...
str.may: ...
str.more: ...
str.must: ...
str.my: ...
str.no: ...
str.not: ...
str.now: ...
str.of: ...
str.on: ...
str.one: ...
str.only: ...
str.or: ...
str.our: ...
str.should: ...
str.so: ...
str.some: ...
str.such: ...
str.than: ...
str.that: ...
str.the: ...
str.their: ...
str.then: ...
str.there: ...
str.things: ...
str.this: ...
str.to: ...
str.up: ...
str.upon: ...
str.was: ...
str.were: ...
str.what: ...
str.when: ...
str.which: ...
str.who: ...
str.will: ...
str.with: ...
str.would: ...
str.your: ...
BookID: a factor with levels b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 which are corresponding to the book id
Author: a factor with levels Austen London Milton Shakespeare which are corresponding to the author id

This dataset is used to illustrate text classification.

Jeffrey S. Simonoff Analyzing Categorical Data

~~ possibly secondary sources and usages ~~

authorship2 <- authorship

# transform data in order to have percentage
authorship2[, 1:69] <- authorship2[, 1:69] / rowSums(authorship2[, 1:69] )

# create the model 
authorship.som.init  <- som ( formula = ~ .
	, data = authorship2
	, neighborhood = "gaussian"
	, grid = grid ( xdim = 20 , ydim = 20 , type = "hexagonal" ) 
	)

# train the network
authorship.som <- learn( authorship.som.init , number.iter = 1000, max.alpha = 0.5, min.alpha = .001, max.rayon = 5 , step.eval.si = 100)

summary(authorship.som)

plot(authorship.som, "energy")

#--- see the distribution on the map
plot(authorship.som, "effectif", cex.label = 0)