Creating cooccurence networks

library(printr)

In this howto we demonstrate two functions to calculate the co-occurence of words. The first is the coOccurenceNetwork function, which calculates the co-occurence of words within documents based on a document term matrix or tokenlist. The second is the windowedCoOccurenceNetwork, which calculates how often words co-occur within a given word distance based on a tokenlist.

Co-occurence within the same documents

library(semnet)
data(simple_dtm)
dtm

dtm is a document term matrix: the rows represent documents and the columns represent words. Values represent how often a word occured within a document. The co-occurence of words can then be calculated as the number of documents in which two words occur together. This is what the coOccurenceNetwork function does.

g = coOccurenceNetwork(dtm)
plot_semnet(g)

The output of coOccurenceNetwork is a graph from the igraph package. The semnet package offers the plot_semnet() function, which automatically sets various plotting parameters that work well for visualizing moderate sized semantic networks (i.e. below 200 nodes, and depending on the size of the plotting device). More detailed instructions for how the plot_semnet() function can be used are given below.

If a different format is preferred, the information for the nodes and edges can easily be extracted with the get.data.frame() function.

get.data.frame(g, 'vertices') # extract vertices with meta information 
get.data.frame(g, 'edges')    # extract edges with meta information

Finally, note that coOccurenceNetwork can also be given a tokenlist as input, in which case the tokenlist is transformed into a DTM using the tokenlistToDTM() function.

Co-occurence within a given word distance

To demonstrate the windowedCoOccurenceNetwork function we'll use a larger dataset, consisting of the state of the union speeches of Obama and Bush (1090 paragraphs). We'll filter the data on part-of-speech tags to contain only the nouns, names and adjectives.

data(sotu)
sotu.tokens = sotu.tokens[sotu.tokens$pos1 %in% c('N','M','A'),]
head(sotu.tokens)

We are interested in three columns in the sotu.tokens dataframe: The lemma column, which is the lemma of a term (the non-plural basic form of a word). We use this instead of the word because we are interested in the meaning of words, for which it is generally less relevant in what specific form it is used. Thus, we consider the words "responsibility" and "responsibilities" to represent the same meaning. The aid column, which is a unique id for the document, in this case for a paragraph in the SotU speeches. We refer to this as the context in which a word occurs. * The id column, which is the specific location of a term within a context. For example, the first row in sotu.tokens shows that in context 111541965, the term unfinished was the fourth term.

g = windowedCoOccurenceNetwork(location=sotu.tokens$id,
                    term=sotu.tokens$lemma, 
                    context=sotu.tokens$aid,
                    window.size=20)
class(g)
vcount(g)
ecount(g)

The output g is an igraph object---a popular format for representing and working with graph/network data. vcount(g) shows that the number of vertices (i.e. terms) is 3976. ecount(g) shows that the number of edges is 201792.

Naturally, this would not be an easy network to interpret. Therefore, we first filter on the most important vertices and edges. There are several methods to do so (see e.g., [Leydesdorff & Welbers, 2011]{http://arxiv.org/abs/1011.5209}). Here we use backbone extraction, which is a relatively new method (see [Kim & Kim, 2015]{http://jcom.sissa.it/archive/14/01/JCOM_1401_2015_A01}. Essentially, this method filters out edges that are not significant based on an alpha value, which can be interpreted similar to a p-value. To filter out vertices, we lower the alpha to a point where only the specified number of vertices remains.

g_backbone = getBackboneNetwork(g, alpha=0.0001, max.vertices=100)
vcount(g_backbone)
ecount(g_backbone)

Now there are only 100 vertices and 255 edge left. This is a network we can interpret. Let's plot!

plot(g_backbone)

Nice, but still a bit messy. We can take some additional steps to focus the analysis and add additional information. First, we can look only at the largest connected component, thus ignoring small islands of terms such as math and science.

# select only largest connected component
g_backbone = decompose.graph(g_backbone, max.comps=1)[[1]]
plot(g_backbone)

Next, it would be interesting to take into account how often each term occured. This can be visualized by using the frequency of terms to set the sizes of the vertices. Also, we can use colors to indicate different clusters.

The output of the (windowed)coOccurenceNetwork function by default contains the vertex attribute freq, which can be used to set the vertex sizes. To find clusters, several community detection algorithms are available. To use this information for visualization some basic understanding of plotting igraph objects is required, which is out of the scope of this tutorial. We do provide a function named setNetworkAttributes which deals with these and some other visualization attributes.

# add vertex cluster membership based on edge.betweenness.community clustering
V(g_backbone)$cluster = edge.betweenness.community(g_backbone)$membership

g_backbone = setNetworkAttributes(g_backbone, size_attribute=V(g_backbone)$freq, cluster_attribute=V(g_backbone)$cluster)

plot(g_backbone)

Now we have a more focused and informational visualization. We can for instance see several clusters that represent important talking points, such as the health care debate and the issue of nuclear weapons. Also, we see that America is at the center of discussions, in particular in context of economy and the job market.



kasperwelbers/semnet documentation built on May 20, 2019, 7:38 a.m.