This tutorial will walk you through the basics of using the htn
package. The Historical Text Network package builds on the tei2r
's meta-data management funcitonality, using the docList
structure to generate a social network for a given collection (typically, some subset of the EEBO-TCP). A typical use case would be for a scholar to use tei2r
to import all the documents published in a given year, then to use htn
to see how the authors, printers, and booksellers working in that year sorted into communities. Once communities are identified, the publications of different groups can be compared lexically.
The main purpose of historical text network analysis is to bridge the divide between text analysis and social network analysis by comparing lexical features of documents that appear in different communities.
This tutorial will cover the basic steps of building a social-network model using htn
. Keep in mind that htn
is designed to simplify some of the most common operations used with igraph
when interfacing with document collections. Users are certain to encounter situations where they want to take advantage of igraph
functionality that isn't explicitly written into htn
, so htn
data objects are designed not to interfere at all with igraph
's functions. This is good because it means that you can do anything with htn
you can do with really sophisticated software. The downside is that you'll probably want to learn igraph
's syntax, which is considerably more complicated than htn
's. (For a good introduction to igraph
in general, see Gábor Csárdi and Tamás Nepusz, http://igraph.org/c/doc/index.html, but for the R application it's easiest, I think, to consult the documentation directly).
library(igraph) ?igraph #open a help window
docList
object using sample dataWe'll start by building a docList
object using the buidDocList
function from tei2r
. In order to do this, we'll need to load the tei2r
package and call the function with the sample data.
If you haven't installed htn
or tei2r
, they can be downloaded and installed from Github.
devtools::install_github("michaelgavin/tei2r", build_vignettes=T) devtools::install_github("michaelgavin/htn", build_vignettes=T)
After installation, activate the libraries in import some sample data from tei2r
.
library(htn) library(tei2r) data(natlaw)
As you can see, these commands build a docList
of publications having to do with seventeenth-century natural law. All the metadata of the collection is now in a docList
named natlaw
.
Now we can generate a network object and graph with one command:
library(htn) dnet = buildNetwork(natlaw)
The buildNetwork()
function performs several operations at once. First, it creates a new data object, dnet
, that inherits the bibliography of the document collection. Then, using the TCP ID numbers from dl
's index, it matches those numbers against publication data stored in htn
, capturing all of the co-publication events associated with each document.
What is a "co-publication event"? Basically, htn
treats any time two people are involved in the same book as a connection. If I sell a book you wrote, that's a connection. If you print a book someone dedicated to me, that's a connection. In this way, historical text networks model print socialities. That is, they model the social relationships books perform to readers. Sometimes this has weird effects. Many connections occur throughout the network among people who likely never met. The most obvious examples appear when books are published posthumously: the authors get "connected" to printers and booksellers who lived years later, sometimes centuries later. This is why it's best not to think of historical text networks as "social networks" in the colloquial sense of the term. They are models of publication metadata -- of the EEBO-TCP catalogue. They show patterns of publication, not patterns of historical reality as such.
So what's in a docNetwork
object?
The dnet
object just created is an instance of an S4 class docNetwork
object. S4 classes are objects with formalized data structures. They're essentially list
s in R, but unlike regular lists they include rules that say what kinds of data can be stored where. Each item in the list is called a slot
, and each slot has a special function.
igraph
package. These graph
objects themselves have slots with different kinds of data.communities
object (that's igraph
lingo) that sorts the people into groups.Now we've built our docNetwork
object using the natlaw
dataset. htn
has already built the graph, which is accessable through dnet@graph
and detected the communities which are accessable through dnet@communities
.
The igraph
package includes a wide range of community-detection algorithms. By default htn
uses the "walktrap" method, which performs a random walk of several steps along the graph from each node, then groups nodes together that tend to cross each other.
The only downside to working with R or Python for network analysis, compared to Gephi or Pajek, is that it's often harder to browse one's data in command-line interfaces. To make it easier to read the communities
object, htn
includes a hidden function that prints a list of names and communities in R's viewer, rather than in the console. To look at which people are grouped into which communities, just enter this into the command line:
dnet@communities
From here, it's easy to draw a graph using htn
's drawGraph
function. By default, colors are assigned to the communities found when the network is built.
drawGraph(dnet)
This is a small docList
, so the communities are mostly disconnected, but you'll notice that there's one larger community where, if you look closely, you'll see that John Milton sits at a very central position.
To figure out which community Milton belongs in, you can browse through the nodes by pulling up the communities as above:
dnet@communities
Or by performing a search using R's grep
function:
hit = grep("Milton", dnet@communities$name) comm_number = dnet@communities$membership[hit]
These two little lines of code do two things. The first searches through the names to find one that includes the string "Milton" and assigns the row number to an object called hit
. Milton is the sixth person named in the community, so the value of hit
should be 6. The second line looks through the membership affliation and finds the 6th entry -- or, more precisely, it finds the membership number(s) for whatever the value(s) is/are of hit
.
Now we can build a subset of the graph that only includes Milton's community.
subnet = communitySubnetwork(dnet, community = comm_number)
This subnetwork can be plotted, just as before.
drawGraph(subnet)
This subnetwork also has its own index, which can be viewed just as above.
View(subnet@index)
The primary purpose of htn
is to allow this function. By identifying the publications common to a particular subgroup of a larger network, htn
uses the basic operations of social-network analysis to perform structured searches over the EEBO catalogue, and it allows you to use bibliographic metadata and textual data to operate as filters for the network. The larger purpose is to reconcile social histories of book publishing with "distant reading" approaches to the archive.
This introduction provided a brief overview of how to create and draw networks from EEBO metadata.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.