knitr::opts_chunk$set(echo = TRUE)
When we are using statistical methods we should adhere to statistical inference rules.
Summarizing them in few points:
Because of that we will now spend some time on talking about statistical methods used in stylometry (to be precise in the stylo package).
In stylometry we are investigating (dis)similarities between texts using (mainly) frequency analysis. This process can be divided into steps:
The stylo package we will be using assumes that all texts are grouped by primary and secondary attributes, e.g. author and text title or magazine title and year, etc. It also assumes that texts are simple text files and these attributes can be obtained by splitting file name on the first occurrence of an underscore.
Because on one hand we want to be able to analyze our texts in a flexible and on the other we would prefer to avoid renaming files by hand a special database and small R package were provided.
See the "Important commands" document.
Choosing and extracting features is quite a wide topic. It can be divided into:
Feature is a primary unit of our analysis. Typically features are:
but to be honest you can choose any feature you want (and able to extract from text), e.g. "noun - adjective pairs" (if you have POS tagged corpus) or "sentence length".
When you are choosing a feature, you decide what defines "the text style" for you.
Lets think about it for a while - please group figures below by a common style (there are no bad answers!):
plot( c(1, 2, 3, 4, 1, 2, 3, 4), c(0, 0, 0, 0, 1, 1, 1, 1), type = 'p', col = c(1, 1, 1, 1, 2, 2, 2, 2), pch = c(1, 1, 0, 0, 1, 1, 0, 0), cex = c(2, 1, 2, 1, 2, 1, 2, 1) * 9, xlim = c(0.5, 4.5), ylim = c(-0.5, 1.5), xaxt = 'n', yaxt = 'n', xlab = '', ylab = '', bty = 'n' )
Taking apart philosophical deliberations according to the stylo package how-to:
When we extracted features lists from our texts, we can compute their frequencies. We can do it:
As a result we end up with a table of frequencies of each feature in each text.
It will look like (for sample 7 words and 3 texts):
m = matrix(c(4.35, 4.89, 3.81, 2.22, 1.86, 2.32, 1.24, 1.5, 1.29, 2.87, 2.92, 2.8, 1.27, 1.6, 1.25, 1.88, 2, 2.03, 1.7, 1.3, 1.4), nrow = 3, ncol = 7) colnames(m) = c('the', 'to', 'and', 'a', 'of', 'he', 'was') rownames(m) = c('text1', 'text2', 'text3') m
We might have many concerns regarding the frequency table of all features:
According to all these concerns we can tune up the features frequency table, by e.g.:
Resulting table will be a basis for statistical analysis.
Most stylometry analysis base on:
so at first we should understand what "a distance between texts" means.
The most important think to understand is that every feature will create a separate dimension in "the text style space".
To make it easier to understand lets consider a 2-dimensional case:
A sample data:
m = matrix(c(c(8, 18, 15, 17), c(16, 17, 20, 13)), nrow = 4, ncol = 2, dimnames = list(c('msg1', 'msg2', 'msg3', 'msg4'), c('dots', 'dashes'))) m plot(m, type = 'p', xlim = c(0, 20), ylim = c(0, 20)) text(m, labels = rownames(m), pos = 1)
As we can see we didn't put all frequencies on one dimension but we have separate dimension for each feature.
Now the question is how to compute distances between texts?
r
mm = m
mm[, 1] = (m[, 1] - mean(m[, 1])) / sd(m[, 1])
mm[, 2] = (m[, 2] - mean(m[, 2])) / sd(m[, 2])
plot(mm, type = 'p')
text(mm, labels = rownames(mm), pos = c(1,1,1,3))
Lets compare differences:
base = cbind(expand.grid(m[, 1], m[, 1]), expand.grid(m[, 2], m[, 2]), expand.grid(rownames(m), rownames(m)))[c(2:4, 7:8, 12), ] rownames(base) = paste(base[, 5], base[, 6], sep = '-') base = base[, -(5:6)] colnames(base) = c('dotsA', 'dotsB', 'dashesA', 'dashesB') dts = sd(m[, 'dots']) dhs = sd(m[, 'dashes']) r = rbind( sqrt((base[, 1] - base[, 2])^2 + (base[, 3] - base[, 4])^2), abs(base[, 1] - base[, 2]) + abs(base[, 3] - base[, 4]), 0.5 * (abs((base[, 1] - base[, 2]) / dts) + abs((base[, 3] - base[, 4]) / dhs)) ) colnames(r) = rownames(base) rownames(r) = c('Euclidian', 'Manhattan', 'Cl. Delta') round(r, 3)
The same values as a relation to the shortest distance (separately for each method):
round(r / matrix(c(min(r[1, ]), min(r[2, ]), min(r[3, ])), nrow = 3, ncol = 6), 3)
What we can see:
What is the right way to measure distance then?
Employing PCA on our test data set will give us:
par(mfrow = c(1, 2)) # base data plot(m, type = 'p', xlim = c(0, 20), ylim = c(0, 20)) text(m, labels = rownames(m), pos = 1) # pca pca = prcomp(m, scale = FALSE, center = FALSE) mpca = cbind( PC1 = m[, 'dots'] * pca$rotation['dots', 'PC1'] + m[, 'dashes'] * pca$rotation['dashes', 'PC1'], PC2 = m[, 'dots'] * pca$rotation['dots', 'PC2'] + m[, 'dashes'] * pca$rotation['dashes', 'PC2'] ) plot(mpca, type = 'p', xlim = c(-25, -15), ylim = c(-5, 5)) text(mpca, labels = rownames(mpca), pos = 4) par(mfrow = c(1, 1)) summary(pca)
And now kingdom for a meaningful name for Principal and Secondary Component :-)
Having computed distances between each pair of documents we can try to build up a tree (dendrogram) in which similar texts will be close to each other.
Again there are many different methods which may result in different structure of the tree and there is no clear rule which one is the best.
Good summary is available here
Our features frequency table typically has much more dimensions that we can imagine (I am completely unable to imagine more then 4 dimensions and typically we have hundreds of them). MDS method tries to deal with that problem by:
Reducing dimensionality of our features frequency table to 2 makes it possible to draw a nice plot. Looking on the plot we can draw some conclusions about similarities of examined texts but we should remember this is extremely weak evidence.
In classification methods we try to guess some text trait (e.g. authorship) by comparing it to texts with known trait values.
We call the set of texts with a known trait value a training set and the set of unknown texts a test test.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.