knitr::opts_chunk$set(echo = TRUE)

Introduction

When we are using statistical methods we should adhere to statistical inference rules.

Summarizing them in few points:

Because of that we will now spend some time on talking about statistical methods used in stylometry (to be precise in the stylo package).

Stylometry analysis

In stylometry we are investigating (dis)similarities between texts using (mainly) frequency analysis. This process can be divided into steps:

  1. Data gathering
  2. Choosing and extracting features
  3. Performing statistical analysis
  4. Investigating results

Data gathering

The stylo package we will be using assumes that all texts are grouped by primary and secondary attributes, e.g. author and text title or magazine title and year, etc. It also assumes that texts are simple text files and these attributes can be obtained by splitting file name on the first occurrence of an underscore.

Because on one hand we want to be able to analyze our texts in a flexible and on the other we would prefer to avoid renaming files by hand a special database and small R package were provided.

Available texts

Gathering texts

See the "Important commands" document.

Features

Choosing and extracting features is quite a wide topic. It can be divided into:

  1. Choosing features
  2. Extracting features from texts (we will skip it as it is purely technical)
  3. Computing features frequencies
  4. Tuning up features frequency table

Choosing features

Feature is a primary unit of our analysis. Typically features are:

but to be honest you can choose any feature you want (and able to extract from text), e.g. "noun - adjective pairs" (if you have POS tagged corpus) or "sentence length".

When you are choosing a feature, you decide what defines "the text style" for you.

Lets think about it for a while - please group figures below by a common style (there are no bad answers!):

plot(
  c(1, 2, 3, 4, 1, 2, 3, 4),
  c(0, 0, 0, 0, 1, 1, 1, 1),
  type = 'p',
  col = c(1, 1, 1, 1, 2, 2, 2, 2),
  pch = c(1, 1, 0, 0, 1, 1, 0, 0),
  cex = c(2, 1, 2, 1, 2, 1, 2, 1) * 9,
  xlim = c(0.5, 4.5),
  ylim = c(-0.5, 1.5),
  xaxt = 'n', yaxt = 'n', xlab = '', ylab = '', bty = 'n'
)

Technical hints

Taking apart philosophical deliberations according to the stylo package how-to:

Computing features frequencies

When we extracted features lists from our texts, we can compute their frequencies. We can do it:

As a result we end up with a table of frequencies of each feature in each text.

It will look like (for sample 7 words and 3 texts):

m = matrix(c(4.35, 4.89, 3.81, 2.22, 1.86, 2.32, 1.24, 1.5, 1.29, 2.87, 2.92, 2.8, 1.27, 1.6, 1.25, 1.88, 2, 2.03, 1.7, 1.3, 1.4), nrow = 3, ncol = 7)
colnames(m) = c('the', 'to', 'and', 'a', 'of', 'he', 'was')
rownames(m) = c('text1', 'text2', 'text3')
m

Tuning up features frequency table

We might have many concerns regarding the frequency table of all features:

According to all these concerns we can tune up the features frequency table, by e.g.:

Resulting table will be a basis for statistical analysis.

Statistical analysis

Most stylometry analysis base on:

so at first we should understand what "a distance between texts" means.

Computing distances

The most important think to understand is that every feature will create a separate dimension in "the text style space".

To make it easier to understand lets consider a 2-dimensional case:

A sample data:

m = matrix(c(c(8, 18, 15, 17), c(16, 17, 20, 13)), nrow = 4, ncol = 2, dimnames = list(c('msg1', 'msg2', 'msg3', 'msg4'), c('dots', 'dashes')))
m
plot(m, type = 'p', xlim = c(0, 20), ylim = c(0, 20))
text(m, labels = rownames(m), pos = 1)

As we can see we didn't put all frequencies on one dimension but we have separate dimension for each feature.

Now the question is how to compute distances between texts?

Lets compare differences:

base = cbind(expand.grid(m[, 1], m[, 1]), expand.grid(m[, 2], m[, 2]), expand.grid(rownames(m), rownames(m)))[c(2:4, 7:8, 12), ]
rownames(base) = paste(base[, 5], base[, 6], sep = '-')
base = base[, -(5:6)]
colnames(base) = c('dotsA', 'dotsB', 'dashesA', 'dashesB')
dts = sd(m[, 'dots'])
dhs = sd(m[, 'dashes'])

r = rbind(
  sqrt((base[, 1] - base[, 2])^2 + (base[, 3] - base[, 4])^2),
  abs(base[, 1] - base[, 2]) + abs(base[, 3] - base[, 4]),
  0.5 * (abs((base[, 1] - base[, 2]) / dts) + abs((base[, 3] - base[, 4]) / dhs))
)
colnames(r) = rownames(base)
rownames(r) = c('Euclidian', 'Manhattan', 'Cl. Delta')
round(r, 3)

The same values as a relation to the shortest distance (separately for each method):

round(r / matrix(c(min(r[1, ]), min(r[2, ]), min(r[3, ])), nrow = 3, ncol = 6), 3)

What we can see:

What is the right way to measure distance then?

Analysis

Principal Component Analysis

Employing PCA on our test data set will give us:

par(mfrow = c(1, 2))
# base data
plot(m, type = 'p', xlim = c(0, 20), ylim = c(0, 20))
text(m, labels = rownames(m), pos = 1)

# pca
pca = prcomp(m, scale = FALSE, center = FALSE)
mpca = cbind(
  PC1 = m[, 'dots'] * pca$rotation['dots', 'PC1'] + m[, 'dashes'] * pca$rotation['dashes', 'PC1'],
  PC2 = m[, 'dots'] * pca$rotation['dots', 'PC2'] + m[, 'dashes'] * pca$rotation['dashes', 'PC2']
)
plot(mpca, type = 'p', xlim = c(-25, -15), ylim = c(-5, 5))
text(mpca, labels = rownames(mpca), pos = 4)

par(mfrow = c(1, 1))

summary(pca)

And now kingdom for a meaningful name for Principal and Secondary Component :-)

Cluster analysis

Having computed distances between each pair of documents we can try to build up a tree (dendrogram) in which similar texts will be close to each other.

Again there are many different methods which may result in different structure of the tree and there is no clear rule which one is the best.

Good summary is available here

Multidimensional scaling

Our features frequency table typically has much more dimensions that we can imagine (I am completely unable to imagine more then 4 dimensions and typically we have hundreds of them). MDS method tries to deal with that problem by:

Reducing dimensionality of our features frequency table to 2 makes it possible to draw a nice plot. Looking on the plot we can draw some conclusions about similarities of examined texts but we should remember this is extremely weak evidence.

Classification

In classification methods we try to guess some text trait (e.g. authorship) by comparing it to texts with known trait values.

We call the set of texts with a known trait value a training set and the set of unknown texts a test test.



zozlak/styloWorkshop documentation built on May 5, 2019, 1:37 a.m.