In michaelgavin/empson: Vector-space modeling for historical documents

Introduction

This document provides a walkthrough for the article, "Vector Semantics, William Empson, and the Study of Ambiguity" Critical Inquiry (2018), for figures 7 through 14. Here you can see how to access the data used in the article and recreate (see caveat below) the visualizations in the piece.

Caveat and Apology

Here's where I must apologize to any readers interested in recreating the exact images that appear in the article. I failed to preserve the scripts used when generating the charts and so I'm not sure exactly which parameters I used. As a result, following the commands below will result in slightly different layouts and slightly different most-similar word lists. These differences do not, I trust, affect the main points I'm hoping to make in the article, but they are worth noting. Following the instructions below will closely but not perfectly replicate what appears in Critical Inquiry. Also, it's worth noting that when creating the images in R, I exported them to PDF and tweaked them in Inkscape, adjusting the font and spacing for readability.

Where to find the data and functions

The functions required for the analysis are available in the empson R package, which can be downloaded via Github.

devtools::install_github("michaelgavin/empson")
library(empson)
library(ggplot2)

The files you'll need are also available on Github, at the same folder where this supplemental tutorial is hosted. The url is https://github.com/michaelgavin/empson/tree/master/ci. You'll need to download both of the following files, eebo_old.rda and milton.rda. Just navigate to the webpage using your browser, click on each file, and click "Download". Once downloaded, load them into R.

load("~/projects/empson/ci/eebo_old.rda")
load("~/projects/empson/ci/milton.rda")

Your R global environment should now include those two objects. The first is a matrix containing word-context data from EEBO and the second is a list containing all the words from the Milton passage analyzed in the article.

The matrix called eebo has word counts normalized as probability ratios across the row vectors (the sum of each row totals 1). In the visualizations in the article, I computed similarity over a slightly modified version of this matrix, replacing each column with the standard z-score for each value. This had the effect of sharpening the word clouds, especially when measured over composite vectors. (I say a bit more about this in the final comments, below.)

n = eebo
avgs = apply(n, 1, mean)
devs = apply(n, 1, sd)

for (j in 1:ncol(n)) {
  n[,j] = (n[,j] - avgs) / devs
}
rm(avgs, devs, j)

To recreate Figure 7:

similarity_map(n, "foot")

And to recreate Figure 8:

vec = eebo["square",] + eebo["foot",]
similarity_map(n, vec, numResults = 60)

To recreate Figures 9 through 12, run the following for the terms "space," "evil," "abstracted," and "stood."

similarity_map(n, "space", numResults = 50, numGrps = 4)

To recreate Figure 13:

vec = eebo["space",] + eebo["evil",] + eebo["abstracted",] + eebo["stood",]
similarity_map(n, vec, numResults = 100)

and 14:

passage = unlist(milton)
passage = passage[passage %in% rownames(eebo)]

vec = colSums(eebo[passage,])
similarity_map(n, vec, numResults = 125, numGrps = 6)

A few additional comments

Above, I mentioned that the data is normalized in ways that "sharpen" the visualizations. I didn't get into this in the article, because it was really beside the point (or, at least, it seemed so to me). But anybody who digs into the data and starts playing around with it themselves will find that even small adjustments in the parameters will often change the results, sometimes pretty drastically. For example, the composite vectors calculated above are added together using the matrix normalized over the rows, but the similarity measurements that generate the visualizations are then computed over the matrix normalized by z-score over the columns. This small adjustment resulted in graphs with greatly improved legibility, but whether or not it corresponds meaningfully to any qualitative difference -- either in the EEBO corpus particularly or over the language in general -- I don't know.

My point here is to say that any researchers pursuing quantitative study in the humanities will often find themselves facing similar issues. Seemingly minor differences in how data is gathered can result in big differences in one's "findings." From this I draw three lessons, which I apply throughout my quantitative work and which may be useful to others:

I try to be careful that no aspect of my primary argument hinges in any substantive way on any single, precise detail in the analyses, because any such detail hinges on too many poorly understood and undertheorized factors;
We need more research in digital humanities devoted to the study of how data-curation and analytical models correlate (and don't) with our qualitative understandings of literary phenomena, so that we can in more detail and with greater confidence explain what different measurements mean (in this case, how we should think differently about matrices normalized in different ways);
Any pretense of empirical objectivity needs to be thrown out the window and burned to ashes (and the ashes scattered); any quantitative analysis provides no more and no less than one possible perspective on a text and any conclusions drawn are always irremediably those of the critic and are produced by the critic's theoretical presuppositions.

The last point is extremely important to keep in mind, especially for readers who see charts and numbers and assume the author is adopting a posture of scientism. As I hope is clear from the essay, what I love about computational analysis isn't that it's scientific -- whether it is or isn't depends on one's understanding of "science" -- but that it opens a space for creativity and play in much the spirit of Empson. If quantitative methods will contribute to the empirical or theoretical rigor of literary history, they will be able to do so only insofar as our knowledge catches up to our practice. My goal in the essay was simply to nudge myself in that direction.

michaelgavin/empson documentation built on May 22, 2019, 9:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com