Introduction to CorporaCoCo
In CorporaCoCo: Corpora Co-Occurrence Comparison

knitr::opts_chunk$set(
  comment = "#",
    error = FALSE,
     tidy = FALSE,
    cache = FALSE,
 collapse = TRUE)
options(width = 100)

Getting started

Load the library. This is available from CRAN and the latest development versions can be found on GitHub.

library(CorporaCoCo)

corp_text

A passage from Alice's Adventures in Wonderland:

a <- "`But do cats eat bats, I wonder?' And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, `Do cats eat bats? Do cats eat bats?' and sometimes, `Do bats eat cats?' for, you see, as she couldn't answer either question, it didn't much matter which way she put it."

Create a corp_text object:

a_text <- corp_text(a)

We used the default tokenization, so let's look at the type mappings:

corp_type_lookup(a_text)

You can combine corp_text objects:

aa <- "The cat sat on the mat."
bb <- "This dog ate this cat."
aaa <- corp_text(aa)
bbb <- corp_text(bb)
ccc <- list(aaa, bbb)
corp_text_rbindlist(ccc)

corp_concordance

You can look at concordance lines using corp_text objects:

corp_concordance(a_text, span = "4LR", nodes = c("eat"))

If you supply one or more collocates, they will be highlighted and act as a filter:

# filtering for one collocate
y <- corp_concordance(a_text, span = "4LR", nodes = c("eat"), collocates = c("but"))
y

# filtering for two collocates
y <- corp_concordance(a_text, span = "4LR", nodes = c("eat"), collocates = c("but", "and"))
y

If you do not want the collocates to act as a filter, but you just want to highlight them in the full concordance, you can add collocates to the print function:

# rerunning the original concordance without a collocate filter
o <- corp_concordance(a_text, span = "4LR", nodes = c("eat"))

# printing the concordance and adding collocate highlighting
print(o, collocates = c("but"))

Because a corp_concordance object is just a data.table you can sort it, filter it. So in the following example of a slightly longer concordance, we can demonstrate the sorting by 1. the node and 2. the first type to the right of the node ("R1_type" position):

s <- corp_concordance(a_text, span = "4LR", nodes = c("cats", "bats"))
s[order(N, R1_type)]

As with another data.table, you can also save the output to a csv file using write.csv, e.g. write.csv(s, "concordance.csv").

You can sort the concordance using the types or the tokens:

names(y)

and if you want to see the raw data.table:

print(y, as_data_table = TRUE)

corp_surface

This works just like the old surface function but now you must pass it corp_text objects:

a_cooccurs <- corp_surface(a_text, span = '1L1R', nodes = c("alice", "bats", "cats"))

By the way, if you do corp_concordance on a corp_surface object it will use the span, nodes and collocates values used to create the corp_surface object as the default values for corp_concordance:

corp_concordance(a_cooccurs)

corp_coco

The same passage after it has been translated into Finnish and back to English using Google translate:

b <- "`But cats eat bats, I wonder?' And here, Alice began to get pretty sleepy and went on to say to herself, in a dreamlike way: `Are cats eating bats? Are cats eating bats?' And sometimes, `Do cats have to eat cats?' since you see because he could not answer the questions, he did not really matter how he put it."
b_text <- corp_text(b)

b_cooccurs <- corp_surface(b_text, span = '1L1R', nodes = c("alice", "bats", "cats"))
b_cooccurs

cats_and_bats <- corp_coco(a_cooccurs, b_cooccurs, nodes = c("alice", "bats", "cats"), fdr = 1.0)

plot(cats_and_bats)

Complex types

Looks like their is a difference for cats and bats but the text is small so we have no power. Let's have a look at the tokenization:

corp_type_lookup(a_text)

Since cats and bats are both types of mammals maybe we can look at the co-occurences with mammals:

a_tokens <- corp_get_tokens(a_text)
b_tokens <- corp_get_tokens(b_text)

a_tokens$type <- sub("^(cats|bats)$", "MAMMALS", a_tokens$type)
b_tokens$type <- sub("^(cats|bats)$", "MAMMALS", b_tokens$type)

a_text_mammals <- corp_text(a, tokens = a_tokens)
b_text_mammals <- corp_text(b, tokens = b_tokens)

So now we have a new type to token mapping:

corp_type_lookup(a_text_mammals)

and so:

a_cooccurs_mammals <- corp_surface(a_text_mammals, span = '1L1R', nodes = c("alice", "MAMMALS"))
b_cooccurs_mammals <- corp_surface(b_text_mammals, span = '1L1R', nodes = c("alice", "MAMMALS"))

mammals <- corp_coco(a_cooccurs_mammals, b_cooccurs_mammals, nodes = c("alice", "MAMMALS"), fdr = 1.0)