Relative frequencies of tag trigrams is selected Spanish texts

Description

Relative frequencies of the 120 most frequent tag trigrams in 15 texts contributed by 3 authors.

Usage

1

Format

A data frame with 120 observations on 15 variables documented in spanishMeta.

References

Spassova, M. S. (2006) Las marcas sintacticas de atribucion forense de autoria de textos escritos en espanol, Masters thesis, Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra, Barcelona.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
## Not run: 
data(spanish)
data(spanishMeta)

# principal components analysis

spanish.t = t(spanish)
spanish.pca = prcomp(spanish.t, center = TRUE, scale = TRUE)
spanish.x = data.frame(spanish.pca$x)
spanish.x = spanish.x[order(rownames(spanish.x)), ]

library(lattice)
splom(~spanish.x[ , 1:3], groups = spanishMeta$Author)

# linear discriminant analysis

library(MASS)
spanish.pca.lda = lda(spanish.x[ , 1:8], spanishMeta$Author)
plot(spanish.pca.lda)

# cross-validation

n = 8
spanish.t = spanish.t[order(rownames(spanish.t)), ]
predictedClasses = rep("", 15)
for (i in 1:15) {
  training = spanish.t[-i,]                     
  trainingAuthor = spanishMeta[-i,]$Author
  training.pca = prcomp(training, center=TRUE, scale=TRUE)
  training.x = data.frame(training.pca$x)
  training.x = training.x[order(rownames(training.x)), ]
  training.pca.lda = lda(training[ , 1:n], trainingAuthor)
  predictedClasses[i] = 
  as.character(predict(training.pca.lda, spanish.t[ , 1:n])$class[i])  
}

ncorrect = sum(predictedClasses==as.character(spanishMeta$Author))
ncorrect
sum(dbinom(ncorrect:15, 15, 1/3))

## End(Not run)

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.