knitr::opts_chunk$set( collapse = FALSE, comment = "##" )
quanteda has the functionality to select, remove or compound multi-word expressions, such as phrasal verbs ("try on", "wake up" etc.) and place names ("New York", "South Korea" etc.).
library(quanteda)
toks <- tokens(data_corpus_inaugural)
Functions for tokens objects take a character vector, a dictionary or collocations as pattern
. All those three can be used for multi-word expressions, but you have to be aware their differences.
The most basic way to define multi-word expressions is separating words by whitespaces and wrap the character vector by phrase()
.
multiword <- c("United States", "New York")
kwic()
is useful to find multi-word expressions in tokens. If you are not sure if "United" and "States" are separated, check their positions (e.g. "434:435").
head(kwic(toks, pattern = phrase(multiword)))
Similarly, you can select or remove multi-word expression using tokens_select()
.
head(tokens_select(toks, pattern = phrase(multiword)))
tokens_compound()
joins elements of multi-word expressions by underscore, so they become "United_States" and "New_York".
comp_toks <- tokens_compound(toks, pattern = phrase(multiword)) head(tokens_select(comp_toks, pattern = c("United_States", "New_York")))
Elements of multi-word expressions should be separately by whitespaces in a dictionary, but you do not use phrase()
here.
dict_multiword <- dictionary(list(country = "United States", city = "New York"))
head(tokens_lookup(toks, dictionary = dict_multiword))
With textstat_collocations()
, it is possible to discover multi-word expressions through statistical scoring of the associations of adjacent words.
If textstat_collocations()
is applied to a tokens object comprised only of capitalize words, it usually returns multi-word proper names.
library("quanteda.textstats") col <- toks |> tokens_remove(stopwords("en")) |> tokens_select(pattern = "^[A-Z]", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) |> textstat_collocations(min_count = 5, tolower = FALSE) head(col)
Collocations are automatically recognized as multi-word expressions by tokens_compound()
in case-sensitive fixed pattern matching. This is the fastest way to compound large numbers of multi-word expressions, but make sure that tolower = FALSE
in textstat_collocations()
to do this.
comp_toks2 <- tokens_compound(toks, pattern = col) head(kwic(comp_toks2, pattern = c("United_States", "New_York")))
You can use phrase()
on collocations if more flexibility is needed. This is usually the case if you compound tokens from different corpus.
comp_toks3 <- tokens_compound(toks, pattern = phrase(col$collocation)) head(kwic(comp_toks3, pattern = c("United_States", "New_York")))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.