In michaelgavin/empson: Vector-space modeling for historical documents

Introduction:

This tutorial will walk you through the basics of using the empson R package. The empson package builds on the basic functionality of tei2r to convert TEI-encoded and plain-text documents into word vectors. empson converts tei2r objects into vector-space models (word-count matrices). The tei2r and empson packages can build a variety of vector-space model types, but in this initial version, the functions are all designed to work with a specific data formats: word-context and term-document matrices.

The package includes a sample word-context matrix, eebo, drawn from the EEBO-TCP corpus, using all texts published between 1640 and 1699. A document matrix, called deebo, is drawn from the same collection. This tutorial will use the eebo model as its test case.

This tutorial covers the following topics:

Installing empson and tei2r
Sample dataset
Similarity measures: similarity()
Graphing concepts: similarity_map()
Analogies and Basic Semantic Relations: compose()
Document-term matrices
Building your own matrix: buildMatrix()

(NOTE: This tutorial assumes that you are using RStudio, which comes preloaded with the devtools package.)

Installing `empson` and `tei2r`:

Install tei2r and empson directly from Github, using the following commands:

# Install empson and tei2r from github
devtools::install_github("michaelgavin/tei2r", build_vignettes = T)
devtools::install_github("michaelgavin/empson", build_vignettes = T)

You must install tei2r in order to run empson. Because empson includes significantly large datasets, the download and installation may take two or three minutes. Once installed, add to your library:

library(tei2r)
library(empson)

Check out the help pages for empson and browse through a few of the functions to get a sense of the help files:

?empson

Because empson is still in development, there may be glitches. The dependencies of the two packages may cause problems: tei2r requires the XML package and suggests mallet. If you have any installation hiccups, please email me right away.

Sample dataset: eebo

Load the sample dataset:

data(eebo)

Explore the data by checking the dimensions:

dim(eebo)

You can just look it over by running:

View(eebo[1:50, 1:50])

eebo is a word-context matrix. This means that the rows are words and the columns are context keywords. Each of the column words was used as a keyword search for keyword-in-context analysis. The data in each column contains the results of that analysis, showing the frequency that each context word appeared during the search. The distribution of frequency counts for a given word is that word's signature. So, for example:

# To see the most frequent collocate words:
sig(mat = eebo, keyword = "food", margin = 2)

These numbers show how often each word appeared within a proximity of 5 words near "food." The word food is always present in its own KWICs, so it's the top hit, but words like god, good, life, man, body, bread, and eat were other words that most often appeared near food. Of the 32,000 words in the model, these are the words the most often showed up in the keyword-in-context analysis.

However, only about 2,000 keywords were measured directly. For the rest of the vocabulary, you search over the rows to see where those words were most often found.

# To see the most frequent collocate words:
sig(eebo, "oatmeal", margin = 1)

Similarity measures: `similarity()`

The key idea of vector-space models is to imagine these matrices as a large, high-dimensional space. By putting the signatures of each word together in a big matrix, it becomes possible to compare and contrast them using basic geometic operations.

The core of empson's capabilities can be found in the similarity() function. Begin by playing around with it a bit. Try the following:

similarity(eebo, "succession")

This used the default similarity measurement (taking the cosine of the angles between two vectors) to identify words most similar to succession.

If words have more than one common meaning, those different senses can sometimes appear more or less clearly in the distribution of the top words. For example, body often was used to refer to political bodies, to physical objects, to human bodies (including corpses), and, of course, to the body of Christ.

The numbers for each word show their similarity score to succession, but by themselves don't say how the words relate to each other. These are built by running similarity and then performing a very simple word-sense disambiguation operation. Basically, it finds the thirty most similar words then performs k-means clustering to see which similar words are semantically related to each other.

similarity_map(eebo, "body")

Basic Semantic Relations: `compose()`

As the word-sense disambiguation performed above suggests, one of the key functions of vector-space models is the ability, not just to model the historical uses of a term, but to differentiate among different possible senses of a word. One of the most common ways to do this is, perhaps strangely, to add and subtract words together. The operation below simply takes the vector of numbers that represents body and subtracts the vector of numbers that represents christ. Essentially, this takes all the uses of the word body and removes everything that overlaps with christ.

vec = compose(eebo, positive = "body", negative = "christ", operation = "+")

If you create a semantic map of the composite vector, you'll see that it retains much of the same elements as the original body vector, but that references to Christian mythology have mostly disappeared, political contexts have been somewhat de-emphasized, and a richer vocabulary of human anatomy is returned. This is very similar to how search engines work, if one entered "body NOT christ" into the search window.

# Notice that similarity() and similarity_map() can search for a keyword or a composite vector 
similarity_map(eebo, vec)

The comparison of body to christ works fairly well because they appear in roughly similar frequencies. Such comparisons can break down when comparing common words with uncommon words. You can deal with this problem by normalizing the vectors, such that the rows show proportional, rather than raw, frequency counts.

nmat = normalize(eebo)
sig(nmat, "body")

When you add together any two or more normalized vectors, searching for terms most similar to the composite vector reveals the words that sit in the semantic spaces between and around them.

similarity_map(nmat, compose(nmat, positive = c("body", "christ"), operation = "+"))

Notice here, too, that compose() commands can be embedded into similarity measurements, without creating a separate object representing the vector in your global environment.

Working with term-document matrices

In addition to eebo, empson comes with a sample dataset that is a term-document matrix. It uses the same list of 2,002 keywords and records their appearance over the 18,311 documents, dated 1640-1699, in the EEBO-TCP Phase I release. Simiarity measurements work over the deebo matrix just like they do over eebo.

data(deebo)
similarity(deebo, "body")

Keep in mind that these results are limited only to the vocabulary of keywords, so they'll tend to be limited. The purpose of the deebo matrix isn't really to measure word meaning, though. Instead, it measures relationships among the books, showing how documents in EEBO cluster in vector space.

The EEBO-TCP ID number for John Locke's Two treatises of government is A48901. Just as word similarity is measured above, we can find the books that use words in proportions most similar to Locke.

matches = similarity(deebo, "A48901", margin = 2)
ids = names(matches)
ids

As you can see, the TCP ID numbers by themselves are pretty opaque, but you can view the search results as a subset of the TCP index.

View(tcp[ids,])

This returns Locke's Treatises as the top hit, of course, because it's identical to itself. But it also returns books that are similar to Locke's, by writers like Robert Filmer and Thomas Hobbes. These results could, of course, be similarly sorted into semantic clusters (though it's hard to represent the titles in PCA graphs). Such clustering techniques are often used to differentiate among authors or genres.

In addition to finding books that are similar to each other, we can use the semantic model of the word-context matrix as a base for searching the term-document matrix. This works because they use the same set of keywords, and so the vector profiles for the words have the same mathematical shape as the vector profiles for each document. Rather than asking, which words are most similar to which, this asks, which documents use words in proportion most like the words that co-occur with a search term.

For example,

matches = similarity(deebo, eebo["body",], margin = 2)
ids = names(matches)
View(tcp[ids,])

You can also use vector composition to join and negate search terms.

# For "body" AND "christ"
vec = compose(eebo, positive = c("body", "christ"))
matches = similarity(deebo, vec, margin = 2)
ids = names(matches)
View(tcp[ids,])

# For "body" NOT "christ"
vec = compose(eebo, positive = "body", negative = "christ")
matches = similarity(deebo, vec, margin = 2)
ids = names(matches)
View(tcp[ids,])

Thus empson supports semantic searching over the EEBO-TCP corpus, with the goal of providing scholars with a more sophisticated way of searching over the document collection and analyzing its contents.

michaelgavin/empson documentation built on May 22, 2019, 9:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com