knitr::opts_chunk$set( comment = "#", error = FALSE, tidy = FALSE, cache = FALSE, collapse = TRUE) options(width = 100)
Load the library. This is available from CRAN and the latest development versions can be found on GitHub.
library(CorporaCoCo)
A passage from Alice's Adventures in Wonderland:
a <- "`But do cats eat bats, I wonder?' And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, `Do cats eat bats? Do cats eat bats?' and sometimes, `Do bats eat cats?' for, you see, as she couldn't answer either question, it didn't much matter which way she put it."
Create a corp_text
object:
a_text <- corp_text(a)
We used the default tokenization, so let's look at the type mappings:
corp_type_lookup(a_text)
You can combine corp_text
objects:
aa <- "The cat sat on the mat." bb <- "This dog ate this cat." aaa <- corp_text(aa) bbb <- corp_text(bb) ccc <- list(aaa, bbb) corp_text_rbindlist(ccc)
You can look at concordance lines using corp_text
objects:
corp_concordance(a_text, span = "4LR", nodes = c("eat"))
If you supply one or more collocates
, they will be highlighted and act as a filter:
# filtering for one collocate y <- corp_concordance(a_text, span = "4LR", nodes = c("eat"), collocates = c("but")) y # filtering for two collocates y <- corp_concordance(a_text, span = "4LR", nodes = c("eat"), collocates = c("but", "and")) y
If you do not want the collocates to act as a filter, but you just want to highlight them in the full concordance, you can add collocates
to the print
function:
# rerunning the original concordance without a collocate filter o <- corp_concordance(a_text, span = "4LR", nodes = c("eat")) # printing the concordance and adding collocate highlighting print(o, collocates = c("but"))
Because a corp_concordance
object is just a data.table
you can sort it, filter it. So in the following example of a slightly longer concordance, we can demonstrate the sorting by 1. the node and 2. the first type to the right of the node ("R1_type" position):
s <- corp_concordance(a_text, span = "4LR", nodes = c("cats", "bats")) s[order(N, R1_type)]
As with another data.table
, you can also save the output to a csv file using write.csv
, e.g. write.csv(s, "concordance.csv")
.
You can sort the concordance using the types or the tokens:
names(y)
and if you want to see the raw data.table
:
print(y, as_data_table = TRUE)
This works just like the old surface
function but now you must pass it corp_text
objects:
a_cooccurs <- corp_surface(a_text, span = '1L1R', nodes = c("alice", "bats", "cats"))
By the way, if you do corp_concordance
on a corp_surface
object it will use the span
, nodes
and collocates
values used to create the corp_surface
object as the default values for corp_concordance
:
corp_concordance(a_cooccurs)
The same passage after it has been translated into Finnish and back to English using Google translate:
b <- "`But cats eat bats, I wonder?' And here, Alice began to get pretty sleepy and went on to say to herself, in a dreamlike way: `Are cats eating bats? Are cats eating bats?' And sometimes, `Do cats have to eat cats?' since you see because he could not answer the questions, he did not really matter how he put it." b_text <- corp_text(b) b_cooccurs <- corp_surface(b_text, span = '1L1R', nodes = c("alice", "bats", "cats")) b_cooccurs
cats_and_bats <- corp_coco(a_cooccurs, b_cooccurs, nodes = c("alice", "bats", "cats"), fdr = 1.0)
plot(cats_and_bats)
Looks like their is a difference for cats and bats but the text is small so we have no power. Let's have a look at the tokenization:
corp_type_lookup(a_text)
Since cats and bats are both types of mammals maybe we can look at the co-occurences with mammals:
a_tokens <- corp_get_tokens(a_text) b_tokens <- corp_get_tokens(b_text) a_tokens$type <- sub("^(cats|bats)$", "MAMMALS", a_tokens$type) b_tokens$type <- sub("^(cats|bats)$", "MAMMALS", b_tokens$type) a_text_mammals <- corp_text(a, tokens = a_tokens) b_text_mammals <- corp_text(b, tokens = b_tokens)
So now we have a new type to token mapping:
corp_type_lookup(a_text_mammals)
and so:
a_cooccurs_mammals <- corp_surface(a_text_mammals, span = '1L1R', nodes = c("alice", "MAMMALS")) b_cooccurs_mammals <- corp_surface(b_text_mammals, span = '1L1R', nodes = c("alice", "MAMMALS")) mammals <- corp_coco(a_cooccurs_mammals, b_cooccurs_mammals, nodes = c("alice", "MAMMALS"), fdr = 1.0)
plot(mammals)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.