Description Usage Arguments Details Value Note Author(s) References Examples
View source: R/textstat_collocations.R
Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.
1 2 3 4 5 6 7 8 9 10 11 | textstat_collocations(
x,
method = "lambda",
size = 2,
min_count = 2,
smoothing = 0.5,
tolower = TRUE,
...
)
is.collocations(x)
|
x |
a character, corpus, or tokens object whose
collocations will be scored. The tokens object should include punctuation,
and if any words have been removed, these should have been removed with
|
method |
association measure for detecting collocations. Currently this
is limited to |
size |
integer; the length of the collocations to be scored |
min_count |
numeric; minimum frequency of collocations that will be scored |
smoothing |
numeric; a smoothing parameter added to the observed counts (default is 0.5) |
tolower |
logical; if |
... |
additional arguments passed to |
Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If x
is a tokens object and some tokens have been removed, this should be done
using [tokens_remove](x, pattern, padding = TRUE)
so that counts will still be
accurate, but the pads will prevent those collocations from being scored.
The lambda
computed for a size = K-word target multi-word
expression the coefficient for the K-way interaction parameter in the
saturated log-linear model fitted to the counts of the terms forming the set
of eligible multi-word expressions. This is the same as the "lambda" computed
in Blaheta and Johnson's (2001), where all multi-word expressions are
considered (rather than just verbs, as in that paper). The z
is the
Wald z-statistic computed as the quotient of lambda
and the Wald
statistic for lambda
as described below.
In detail:
Consider a K-word target expression x, and let z be any
K-word expression. Define a comparison function c(x,z)=(j_{1},
…, j_{K})=c such that the kth element of c is 1 if the
kth word in z is equal to the kth word in x, and 0
otherwise. Let c_{i}=(j_{i1}, …, j_{iK}), i=1, …,
2^{K}=M, be the possible values of c(x,z), with c_{M}=(1,1,
…, 1). Consider the set of c(x,z_{r}) across all expressions
z_{r} in a corpus of text, and let n_{i}, for i=1,…,M,
denote the number of the c(x,z_{r}) which equal c_{i}, plus the
smoothing constant smoothing
. The n_{i} are the counts in a
2^{K} contingency table whose dimensions are defined by the
c_{i}.
λ: The K-way interaction parameter in the saturated loglinear model fitted to the n_{i}. It can be calculated as
λ = ∑_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}
where b_{i} is the number of the elements of c_{i} which are equal to 1.
Wald test z-statistic z is calculated as:
z = \frac{λ}{[∑_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}
textstat_collocations
returns a data.frame of collocations and
their scores and statistics. This consists of the collocations, their
counts, length, and λ and z statistics. When size
is a vector, then count_nested
counts the lower-order collocations
that occur within a higher-order collocation (but this does not affect the
statistics).
is.collocation
returns TRUE
if the object is of class
collocations
, FALSE
otherwise.
This function is under active development, with more measures to be added in the the next release of quanteda.
Kenneth Benoit, Jouni Kuha, Haiyan Wang, and Kohei Watanabe
Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | corp <- data_corpus_inaugural[1:2]
head(cols <- textstat_collocations(corp, size = 2, min_count = 2), 10)
head(cols <- textstat_collocations(corp, size = 3, min_count = 2), 10)
# extracting multi-part proper nouns (capitalized terms)
toks1 <- tokens(data_corpus_inaugural)
toks2 <- tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE)
toks3 <- tokens_select(toks2, pattern = "^([A-Z][a-z\\-]{2,})", valuetype = "regex",
case_insensitive = FALSE, padding = TRUE)
tstat <- textstat_collocations(toks3, size = 3, tolower = FALSE)
head(tstat, 10)
# vectorized size
txt <- c(". . . . a b c . . a b c . . . c d e",
"a b . . a b . . a b . . a b . a b",
"b c d . . b c . b c . . . b c")
textstat_collocations(txt, size = 2:3)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.