In michaelgavin/tei2r: Import and parse TEI files in R

Introduction

The tei2r package is designed for humanities scholars and students interested in using R for text analysis. Its creation was prompted by the recent public release of thousands of documents by the Text Creation Partnership, including Early English Books Online, Evans, and Eighteenth-Century Collections Online. tei2r has built-in functions that allow users easily to access and download TCP files from Github, https://github.com/textcreationpartnership, where they are freely available.

You don't need to be an experienced R programmer to use tei2r. Its most important functions require learning only a small handful of commands, which are explained below. tei2r makes it easy to measure word frequencies and train topic models. Each of these operations is explained in this introduction (which can be followed step-by-step in less than 20 minutes). For this reason, tei2r is ideal for DH-curious scholars with a strong interest in early modern literature and history.

Scholars already comfortable doing text analysis with R or Python may still find tei2r useful, because its method for organizing objects in the R environment makes it easy to work with small- to medium-sized collections. Two complementary packages -- htn for historical text network analysis and empson for vector-space modeling -- build on and extend the core tei2r functionalities with more complex analytical techniques.

Anyone using tei2r should feel free to email Michael Gavin at mgavin@mailbox.sc.edu with questions, complaints, critiques, requests, bug reports, suggestions, or offers to collaborate.

Installation

The initial development version of tei2r is released on Github. If you're using RStudio, you'll already have the devtools package installed and ready-to-use. (If you're using R by itself, you may need to install devtools separately by running the command, install.packages("devtools"), followed by library(devtools).)

To install tei2r and activate its library:

devtools::install_github("michaelgavin/tei2r", build_vignettes = T)
library(tei2r)

To browse the help files, look over the main page:

?tei2r

Getting started with the EEBP-TCP

(Readers most interested in learning about tei2r's data structure may want to skip to the next section.)

tei2r has two functions that are designed solely for working with the EEBO-TCP corpus. They are called tcpSearch and tcpDownload, and, as their names imply, they allow you to search the TCP archive and download files from the TCP.

If you're just exploring and browsing EEBO, the already existing web interfaces will work better than R Studio, but if you have a good sense of the files you want to grab, tcpSearch will get you started. If you want all the books published in a given year, or everything attributed to an individual author, or everything with a certain term in the title or subject headings, run this command:

tcpSearch(term = "Dryden", field = "Author")

This function does two things: it creates a special kind of table called a data frame in R, and it displays that data frame in your viewing window. The search results pop up in a format that looks very much like a spreadsheet and that contains all the EEBO metadata for every book that includes John Dryden as a contributing author.

Of course you can search by any field. If searching by date, don't use the term argument, instead use range.

tcpSearch(range = 1580:1620, field = "Date")

(Notice that you should not use quotation marks around the date range. This tells R to treat the dates like numbers, rather than like words.)

To download the files, first you need to save the results of your search as a .csv index file. Create a folder to store the results! You don't want to fill your desktop with hundreds of XML documents. After creating the folder, set it as your working directory. (The working directory is the folder that R looks to first for reading and writing files.) In R Studio, you can navigate to the folder in the "Files" pane on the lower right, then click "More --> Set as working directory". Otherwise, you can use the setwd() command, as in the example here:

dir.create(path = "testFolder")
setwd("testFolder")

Now store the results:

results = tcpSearch(term = "Dryden", field = "Author", write = T)

In addition to performing the search, this command saves the results in two places. First, the data frame is stored as an object called results in your R environment. (You should see it in the upper right panel of R Studio.) Second, it saves a .csv file called "index.csv" in your working directory. (You should see it appear in the lower right.)

Note that tcpSearch does not include advanced search functionality, nor does it include any special function for further filtering the results. If you want to drill down with some complex search and you're comfortable with R, just filter the results data frame however you like, then re-save it using write.csv(). If you're not yet comfortable with R's syntax, you can open the index in Microsoft Excel, delete out the rows you don't want, re-save it (remember to keep it in .csv format), and then upload it back into R like this:

results = read.csv("index.csv")

This saves over the results object by reading the new contents of the index back into it. (Reading and writing files using R is really easy to do, once you've done it a few times.)

Now, to download your files, run this:

tcpDownload(results)

Depending on how big your index is and how strong your internet connection is, this may take awhile. If you're just doing a few dozen or few hundred documents, it just takes a minute or so, usually.

At the end, you'll have everything you need to get started: a folder filled with EEBO-TCP XML files and a .csv index that holds the metadata for each.

How to download the entire EEBO-TCP corpus

When working with students, you won't want them downloading the entire corpus, but for your own research you'll quickly find that it's easier to do the download once, then just change the index file based on your interests at any given moment. To do this, run:

data(tcp)
tcp = tcp[which(tcp$Status == "Free"),]
tcpDownload(tcp)

This will download more than 32,000 documents directly to your hard drive. It's several GB worth of data. Be sure you save it in a convenient place.

Importing TEI documents into R

When I've taught coding to humanities professors and graduate students, by far and away the biggest challenge has just been keeping track of all the steps. When you start defining variables and creating lists and tables of data, your workspace (your 'environment') quickly gets cluttered with objects. Really, the entire purpose of tei2r is to mitigate this problem by creating a structure in your R environment that mimics the results of a bibliographic search.

Your index file, created above, provides your list of documents, and it is the blueprint for everything that happens in tei2r. You can use your index to create a document list, or docList, that holds all the metadata about your texts, as well as all the information R will need to analyze those documents in more complex operations.

In this example, let's return to the works of Dryden. If you haven't done so, re-write your folder like this:

dir.create("dryden")
setwd("dryden")
results = tcpSearch(term = "Dryden", field = "Author", write = T)
tcpDownload(results)

After the download completes (it'll take about a minute), you'll have a folder called "dryden" that has the index.csv file and a collection of 131 XML documents. Let's get them into R.

dl = buildDocList(directory = getwd(), indexFile = "index.csv", import = FALSE)

Now, look and see what's in the document list:

show(dl)

Each one of the @ symbols corresponds to a different slot in the document list, and each slot holds a specific kind of data. To view the index, for example, you could run View(dl@index). (Be sure to capitalize View())

As you can see, the docList object stores several bits of data:

The path to the source directory
The filenames and the full paths to each file
The path to the index .csv file
The index itself, stored as a data frame in the R environment
The path to your stopwords file, if you have one
The stopwords themselves (tei2r has a default list), stored as a character vector
An empty slot, named "texts." This is where the text data will be stored.

Notice that you still haven't imported the texts themselves. The document list is simply a structure that holds all the metadata for your collection in R, including the location of the folder (in case you change around your working directory), links and bibliographic data for each file, as well as the location and content of your stopwords list.

Everything in tei2r and its complementary packages, htn and empson, is built on the basic document list structure. This way, you only have to enter the filepath, the index, and the stopwords once, and R will remember what your settings are.

To import the texts, you'll need to import the texts to the @texts slot. tei2r doesn't requires this as a separate step -- in fact, import defaults to TRUE when building a docList, but importing the Dryden texts can be a time-consuming process, especially for slower machines, so we made it a separate action for this tutorial.

dl = importTexts(dl)

This operation will take about a minute or so. The importTexts function automates a number of very common operations in R -- importing XML data, stripping the tags and converting to plain text, switching to lower-case, removing stopwords, and regularizing the long-S in EEBO data.

If you wanted the text but didn't want to normalize the texts, adjust the parameter:

dl = importTexts(dl = dl, normalize = FALSE)

The importTexts() function creates a big list that holds each text as a character vector, then stores that list to the last slot in the docList object. To get a sense of its contours, look at dl@texts:

dl@texts

This will print out the full character vector for each text, so it'll be pretty much unreadable. However, if you scroll up and down in your console you'll get a sense of what happens with each document. Notice that, by default, the texts are stored with their TCP number as the key. If you ever need to look a number up, remember that the index is stored in your environment. Just call View(dl@index).

To print out the character vector of any individual text, you can select it by number (that is, its place in the order of the collection) or by TCP number. For example, here are two different ways to access the word vectors that make up Dryden's Tyrannick Love, which has the TCP number 'A36708' and is the 72nd item in the collection:

dl@texts$A36708

dl@texts[[72]]

The `cleanup()` function

As mentioned above the importTexts() operation performs several text-processing steps more or less automatically. These steps can be adjusted in a variety of ways, and it helps to have a sense of what's happening under the hood as a document is being transformed. Try using cleanup() by itself.

# First, import the stopwords to your global environment
data(stopwords)

# Then, run cleanup()
cleanup("A36708.xml", stopwords = stopwords)[1:500]

You'll see that the text has been reduced to an ordered list of terms, beginning with the title page and dedication. It's worth pausing to look briefly over cleanup() to see how this is done. Here's the text of the function:

# Cleanup has two parameters, stopwords (default is empty) and normalize (default is true)
cleanup <- function (filepath, stopwords = c(), normalize = TRUE) 
{
    #First it checks to see if your file is plain text
    if (length(grep(".txt", filepath)) == 1) {
        text = scan(filepath, what = "character", sep = "\n", 
            fileEncoding = "UTF-8")
        text = paste(text, collapse = " ")
    }

    # if not, it parses the XML, taking the entirety of the <text> element in TEI
    else if (length(grep(".xml", filepath)) == 1) {
        parsedText = htmlTreeParse(filepath, useInternalNodes = TRUE)
        nodes = getNodeSet(parsedText, "//text")
        text = lapply(nodes, xmlValue)
    }

    # Here it deletes out phrases from the TEI encoding that get scooped up
    text = gsub("non-Latin alphabet", " ", text)
    text = gsub("1 page duplicate", " ", text)

    # Then, if normalize is true, it runs these steps
    if (normalize == TRUE) {

        # it deletes out two kinds of long-S encoding
        text = gsub("??", "s", text) 
        text = gsub("A,?", "s", text)

        # it removes all numbers
        text = gsub("[0-9]", "", text)

        # it converts vv to w
        text = gsub("vv", "w", text)

        # it converts 'd and 'ring to ed and ering, respectively
        text = gsub("'d ", "ed ", text)
        text = gsub("'ring ", "ering ", text)
    }

    # Then, it breaks the long string of text into individual words
    text = strsplit(text, "\\W")
    text = unlist(text)
    text = text[text != ""] # removing blanks


    if (normalize == TRUE) {

        # then converts everything to lower case
        text = tolower(text)

        # keeps only words that AREN'T in the stopwords
        text = text[text %in% stopwords == FALSE]

        # then removes all words with special characters
        if (any(grep("[^ -~]", text))) 
            text = text[-grep("[^ -~]", text)]
    }
    return(text)
}

As written, the normalization includes many steps that users may want to play around with. For example, removing numbers results in very tangible distortion when analyzing concepts that involve quantification or biblical citation. Removing special characters can distort results when working with non-English sources. If you want to adjust the normalization procedures, you can tweak the code directly by running fix(cleanup) and commenting out features you don't want or otherwise adjusting them. If you're new to R programming and don't feel comfortable adjusting the operations by hand, please email the author. I would be more than happy to hear about your project and provide what help I can. Seriously.

Topic modeling with `mallet` and `tei2r`

Topic modeling with the mallet package is already pretty easy, but I've found that a number of steps need to be performed every time I ran a model. There are some constraints: using mallet in R Studio means that you're constrained to R's memory limitations. This is fine for small- to medium-sized collections. But if you want to model more than a few hundred texts, you're likely to run into memory errors.

First make sure you have the mallet package installed, then run this:

library(mallet)
tmod = buildModel(dl = dl, tnum = 20)

The buildModel function runs the model and stores the results in several different slots for easy viewing:

tmod@index: the index containing all your documents' metadata
tmod@topics: a table showing the top words in each topic
tmod@frequencies: a table showing how frequent each topic is in each document
tmod@weights: a table showing how frequent each word is in each topic

The data in any of these slots can be viewed, written to csv, or plotted in any number of ways. The mallet package has a number of options, most of which can be controlled as parameters in buildModel(). Users interested in delving deeper into mallet are encouraged to consult its documentation and work with its functions directly.

How to build a topic model from EEBO in 4 commands

Search the EEBO-TCP for the files you want

results = tcpSearch(range = 1660, field = "Date", write = T)

Download the results

tcpDownload(results)

Initialize your collection in R

dl = buildDocList(directory = getwd(), indexFile = "index.csv")

Train the model and store the results in R

tmod = buildModel(dl = dl, tnum = 20)

Note that, if you've already downloaded the EEBO-TCP files to a folder, you can skip step 2 and set that folder as the directory in step 3.

Conclusion

The tei2r package was designed primarily for facilitating specific kinds of analysis I was doing with social and semantic networks. It doesn't do any kind of grammatical analysis, part-of-speech tagging, or other kinds of work. If any readers are interested in collaborating to add these or any other functions should contact the author.

Thank you for using the tei2r package.

michaelgavin/tei2r documentation built on May 22, 2019, 9:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com