In michaelgavin/empson: Vector-space modeling for historical documents

Begin by importing libraries

library(tei2r)
library(empson)

Create a working directory then set it

loc_xml = "~/MLH/EEBO-XML/EEBO-TCP XML"
setwd(loc_xml)

Now import the stopwords list and TCP index to your R environment

data(stopwords)
data(tcp)
tcp = tcp[which(tcp$Status == "Free"),]

Then download the TCP documents. Careful: this command will take an hour or two at least on most computers.

tcpDownload(tcp)

Now set other parameters. Establish a year range that you'd like to study and make a directory for holding your processed text files.

range = 1700:1799
loc_texts = "~/MLH/EEBO-DTs/"
setwd(loc_texts)

The empson package comes in with some built-in functions for building word matrices, but those have a number of quality-control features that slow down processing speed. Because we'll be scrolling through thousands of documents, here are a couple simplified functions.

# Special Build Doc List Function
easyDocList = function(directory = "", stopwordsFile = "", indexFile ="") {
  source('~/projects/tei2r/R/docList.R')
  dl = docList()
  dl@directory = directory
  dl@indexFile = indexFile
  dl@index = read.csv(dl@indexFile,stringsAsFactors=FALSE)
  dl@filenames = paste(dl@index$TCP, ".xml", sep="")
  dl@paths = paste(dl@directory,"/",dl@filenames,sep="")
  dl@stopwords = stopwords
  return(dl)
}

# Now define special matrix-building function
buildYearlyMatrices <- function(dt, kw, voc, context = 5) {
  vocab = voc
  keywords = kw
  mat = matrix(0, length(vocab), length(keywords))
  rownames(mat) = vocab
  colnames(mat) = keywords
  for (i in 1:length(keywords)) {
    print(paste("Gathering KWICS for",keywords[i]))
    term_kwics = list()
    for (j in 1:length(dt@text)) {
      text = dt@text[[j]]
      hits = which(text == keywords[i])
      kwics = list()
      if (keywords[i] %in% text) {
        for (k in 1:(length(hits))) {
          start = hits[k] - context
          end = hits[k] + context
          if (start < 1) {
            start = 1
          }
          kwics[[k]] = text[start:end]
        }
        term_kwics[[j]] = kwics
      }
    }
    contexts = table(unlist(term_kwics))
    words = names(contexts)
    freqs = as.numeric(contexts)
    ord = which(words %in% vocab == T)
    words = words[ord]
    freqs = freqs[ord]
    mat[words,i] = freqs
  }
  return(mat)
}

Now import the texts for each year. This is a big for-loop that gathers the texts, normalizes long Ss, removes numbers, punctuation, and capitalization, and takes out all stopwords and any word with a special character (the vast majority of which, in EEBO, are transcription errors or weird glyphs.)

for (i in 1:length(range)) {
  year = range[i]
  print(year)
  write.csv(tcp[which(tcp$Date == year),], "temp_index.csv")
  dl = easyDocList(directory = loc_xml, indexFile = "temp_index.csv")
  dt = importTexts(dl)
  save(dt, file = paste(as.character(year),".rda", sep=""))
}

Now let's build a vocabulary. This for-loop goes through every year, counts the words, and captures the words used each year, putting them into a big vector, called v. Once that vector is complete, the vocabulary for each year was blended into the previous years' to create two smaller vectors: keywords will be the words chosen for Keywords-In-Context analysis; vocab will be the list of context words to be included in that analysis.

The purpose of this is to reduce the size of the matrix (otherwise it becomes very difficult to compute) and to filter out low-frequency words, the majority of which are irregularities in the transcription and markup that have little effect on meaning.

Notice that the keywords object is built using the intersect() function. The line of code that reads keywords = intersect(keywords, v[1:4891]) limits the keywords object to include only words that are among the 4,891 most frequent in every single year. Any word that is not among the 4,891 most frequent words, even in just one year, will be excluded. This ensures continuity over time, such that the keywords will be applicable to every year. The number 4,891 was chosen because it was the threshold at which the corpus returned exactly 2,000 keywords.

The vocab vector is built differently, and results in about 32,000 words. Notice that it, too, uses an intersect() operation, though it intersects with the entire vocabulary, rather than limited to 4,891 most frequent words. Still, the vocabulary of English in EEBO that is used every single year is quite limited: only about 11,000 words. The next line of code, vocab = c(vocab,v[1:20000]), uses a concatenate function, c(), rather than intersect(). This takes the top twenty thousand words in any given year and adds them into the vocabulary.

As a consequence, the vocabulary of the eebo matrix includes two sets of words: either words that are used every single year, regardless of frequency, and words that are among the 20,000 most freqent in any single year.

keywords = c()
vocab = c()
for (i in 1:length(range)) {
  year = range[i]
  print(paste("Working on year", year))
  load(paste(loc_texts, as.character(year),".rda", sep=""))
  v = table(unlist(dt@text))
  v = sort(v, decreasing = T)
  v = names(v)
  if (i == 1) {
    keywords = v[1:4891]
    vocab = v
  } else {
    keywords = intersect(keywords, v[1:4891])
    vocab = intersect(vocab, v)
    vocab = c(vocab,v[1:20000])
    vocab = sort(unique(vocab))
  }
}

Because it's difficult for desktop computers to handle this much data at a time (and because it'll be useful down the road to retain year-by-year data), I begin by compiling a matrix for every year.

First set the location:

loc_mats = "~/MLH/EEBO-MAT/"
setwd(loc_mats)

Now build a matrix for each year

for (i in 1:length(range)) {
  year = range[i]
  print(paste("Working on year", year))
  load(paste(loc_texts, as.character(year),".rda", sep=""))
  mat = buildYearlyMatrices(dt = dt, voc = vocab, kw = keywords)
  save(mat, file = paste(as.character(year),".rda", sep=""))
}

And add them together. ```r ecco = matrix(0, length(vocab), length(keywords)) colnames(ecco) = keywords row.names(ecco) = vocab for (i in 1:length(range)) { print(i) year = range[i] load(paste(loc_mats, as.character(year),".rda", sep="")) ecco = ecco + mat }

michaelgavin/empson documentation built on May 22, 2019, 9:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com