This document provides a walkthrough for the article, "Vector Semantics, William Empson, and the Study of Ambiguity" Critical Inquiry (2018), for figures 7 through 14. Here you can see how to access the data used in the article and recreate (see caveat below) the visualizations in the piece.
Caveat and Apology
Here's where I must apologize to any readers interested in recreating the exact images that appear in the article. I failed to preserve the scripts used when generating the charts and so I'm not sure exactly which parameters I used. As a result, following the commands below will result in slightly different layouts and slightly different most-similar word lists. These differences do not, I trust, affect the main points I'm hoping to make in the article, but they are worth noting. Following the instructions below will closely but not perfectly replicate what appears in Critical Inquiry. Also, it's worth noting that when creating the images in R, I exported them to PDF and tweaked them in Inkscape, adjusting the font and spacing for readability.
The functions required for the analysis are available in the empson
R package, which can be downloaded via Github.
devtools::install_github("michaelgavin/empson") library(empson) library(ggplot2)
The files you'll need are also available on Github, at the same folder where this supplemental tutorial is hosted. The url is https://github.com/michaelgavin/empson/tree/master/ci. You'll need to download both of the following files, eebo_old.rda
and milton.rda
. Just navigate to the webpage using your browser, click on each file, and click "Download". Once downloaded, load them into R.
load("~/projects/empson/ci/eebo_old.rda") load("~/projects/empson/ci/milton.rda")
Your R global environment should now include those two objects. The first is a matrix containing word-context data from EEBO and the second is a list containing all the words from the Milton passage analyzed in the article.
The matrix called eebo
has word counts normalized as probability ratios across the row vectors (the sum of each row totals 1). In the visualizations in the article, I computed similarity over a slightly modified version of this matrix, replacing each column with the standard z-score for each value. This had the effect of sharpening the word clouds, especially when measured over composite vectors. (I say a bit more about this in the final comments, below.)
n = eebo avgs = apply(n, 1, mean) devs = apply(n, 1, sd) for (j in 1:ncol(n)) { n[,j] = (n[,j] - avgs) / devs } rm(avgs, devs, j)
To recreate Figure 7:
similarity_map(n, "foot")
And to recreate Figure 8:
vec = eebo["square",] + eebo["foot",] similarity_map(n, vec, numResults = 60)
To recreate Figures 9 through 12, run the following for the terms "space," "evil," "abstracted," and "stood."
similarity_map(n, "space", numResults = 50, numGrps = 4)
To recreate Figure 13:
vec = eebo["space",] + eebo["evil",] + eebo["abstracted",] + eebo["stood",] similarity_map(n, vec, numResults = 100)
and 14:
passage = unlist(milton) passage = passage[passage %in% rownames(eebo)] vec = colSums(eebo[passage,]) similarity_map(n, vec, numResults = 125, numGrps = 6)
Above, I mentioned that the data is normalized in ways that "sharpen" the visualizations. I didn't get into this in the article, because it was really beside the point (or, at least, it seemed so to me). But anybody who digs into the data and starts playing around with it themselves will find that even small adjustments in the parameters will often change the results, sometimes pretty drastically. For example, the composite vectors calculated above are added together using the matrix normalized over the rows, but the similarity measurements that generate the visualizations are then computed over the matrix normalized by z-score over the columns. This small adjustment resulted in graphs with greatly improved legibility, but whether or not it corresponds meaningfully to any qualitative difference -- either in the EEBO corpus particularly or over the language in general -- I don't know.
My point here is to say that any researchers pursuing quantitative study in the humanities will often find themselves facing similar issues. Seemingly minor differences in how data is gathered can result in big differences in one's "findings." From this I draw three lessons, which I apply throughout my quantitative work and which may be useful to others:
The last point is extremely important to keep in mind, especially for readers who see charts and numbers and assume the author is adopting a posture of scientism. As I hope is clear from the essay, what I love about computational analysis isn't that it's scientific -- whether it is or isn't depends on one's understanding of "science" -- but that it opens a space for creativity and play in much the spirit of Empson. If quantitative methods will contribute to the empirical or theoretical rigor of literary history, they will be able to do so only insofar as our knowledge catches up to our practice. My goal in the essay was simply to nudge myself in that direction.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.