In johnmchambers/shakespeare: Tools for XML-based Analysis of Shakespeare's Plays

This vignette considers techniques for extending proxy classes in R that map to classes in another language, to make computing facilities in that language available in an R package. The proxy classes can be used as fields or superclasses for R classes that extend the capabilities from the interface. We'll look at some alternative approaches and the programming to implement them.

Interfaces and Extensions

Using an interface from one programming language or environment to another is a winning strategy for projects with challenging computing requirements. In the "Extending R" book ([@extR], hereafter XR) this strategy is related to R, as one of the three fundamental principles: "Interfaces are part of R".

When essential programming idioms are shared between the languages, the use of an interface can be made natural for programming by defining proxies. In particular, both R and many other powerful languages emphasize versions of functions and classes.

An interface from R to such a language can define proxy functions and proxy classes, to be used in a natural R way but which in fact are automatically interfaced to corresponding program constructs in the other language. In addition to using such proxies straight-out, the application is likely to extend the computations and incorporate them in functions, classes or other structures in R.

We'll look at some techniques for such extensions, particularly for embedding proxy classes in R classes. For a more-or-less realistic project in which these might be useful, we'll consider a package to apply data analysis to the plays of William Shakespeare.

Shakespeare Meets R, and XML, and Python

For over four centuries, Shakespeare's plays have been the focus of a sea of studies and debates, including some recent work applying statistical and natural-language techniques, focussed on questions of authorship in particular.

In a more exploratory spirit, there are a number of interesting questions for which R is suited (well, interesting to me, certainly). How do the plays change with time, or with subject matter? Are characters distinguished in speech by class, by gender or by other context?

Suppose we start on a project to construct tools in R to explore such questions. First question: the data. The plays have been digitized, in some different forms. For our purposes, one of these has the key advantage of preserving the most structural information in a way that can be useful for analysis. Thirty-seven plays were transcribed in HTML format and later converted to XML.

The great asset of XML is that it provides a hierarchical grammar for describing objects in a tree-like structure, using specific terms for the particular objects. For the plays, this organizes the data by acts and scenes. Within each scene speeches, stage directions and a few other items are explicitly distinguished. All this structure is available to us for analysis.

For example, the most important part of the plays will be the speeches. In XML parlance, a speech has a corresponding tag and structure:

<SPEECH>
<SPEAKER>OBERON</SPEAKER>
<LINE>Ill met by moonlight, proud Titania.</LINE>
</SPEECH>

All the text in the speech is available, separated by lines (only one here); in addition, the speaker is explicitly identified. From further up the tree, we know the scene, the act and the play.

What tools might be useful to analyse such data? Again, it's the speeches that really are the plays, those "words, words, words" as Hamlet says. Analysing the words can benefit from natural-language tools that break down the text into "tokens" and attempt to build structure on these. A particularly popular and widely used collection of such tools is NLTK, the Natural Language Toolkit implemented in Python.

So, this is starting to look like a strategy. Obtain the data in the form of XML files, make use of NLTK to ask questions about the speeches and apply R to explore the results. An approach to interfacing the three languages is needed next.

XML excels at representing structure, but it is very hierarchical. Both R and Python are happier with "linear" structures. In R, vectors and vectorized computations, plus other structures built on vectors. In Python, iterators and iterable structures, such as Python lists.

Both R and Python can parse XML files into corresponding internal forms (in several ways in R). The tools of the NLTK will usually be the first step in processing, implying that converting the XML to Python is the obvious approach.

The specific strategy chosen is to parse the files in Python, which produces objects of a particular class ("ElementTree"). From these, we will construct other objects, most importantly lists of speeches. Specialized Python classes will describe the objects.

R's main role is to supply the range of analysis and visualization tools. In addition, two features of R set the basic approach to the project: R packages and the R session. The package is the natural way to organize a project at this scale. The software we'll look at will be part of the shakespeare package, including R and Python software as well as the files for the original data and optionally for intermediate data forms as well.

R's package structure allows for essentially arbitrary files of source for whatever languages are used. In our project, one folder contains the XML files for the plays and another the Python source specially written for this package.

The R session provides a continuing computing environment, interacting with packages loaded into the session. We'll use that to organize the data additionally to make accessing the plays and iterating over them quick and simple. The files of XML data and the interface to Python, for example, allow us to compute and cache data for the plays providing rapid access throughout the session.

The XR Approach: Proxy Functions and Classes

The goals of the XR approach to interfaces include generality in the computations supported and simplicity in the user interface. Both goals benefit from an approach to proxies; that is, objects in R automatically created as proxies (references) to analogous objects in the server language, here Python. Computations in the server return proxy objects to R.

The R package includes additional proxies for functions and classes in the server language. Users deal with the functions as if they were R functions, supplying arguments that may transparently be either ordinary R objects (which will be converted to server language analogues) or proxies from previous computations.

johnmchambers/shakespeare documentation built on May 19, 2019, 5:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com