View source: R/textstat_collocations.R

textstat_collocations | R Documentation |

Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.

```
textstat_collocations(
x,
method = "lambda",
size = 2,
min_count = 2,
smoothing = 0.5,
tolower = TRUE,
...
)
```

`x` |
a character, corpus, or tokens object whose collocations will be
scored. The tokens object should include punctuation, and if any words
have been removed, these should have been removed with |

`method` |
association measure for detecting collocations. Currently this
is limited to |

`size` |
integer; the length of the collocations to be scored |

`min_count` |
numeric; minimum frequency of collocations that will be scored |

`smoothing` |
numeric; a smoothing parameter added to the observed counts (default is 0.5) |

`tolower` |
logical; if |

`...` |
additional arguments passed to |

Documents are grouped for the purposes of scoring, but collocations will not
span sentences. If `x`

is a tokens object and some tokens have been
removed, this should be done using `[tokens_remove](x, pattern, padding = TRUE)`

so that counts will still be accurate, but the pads will prevent those
collocations from being scored.

The `lambda`

computed for a size = `K`

-word target multi-word expression
the coefficient for the `K`

-way interaction parameter in the saturated
log-linear model fitted to the counts of the terms forming the set of
eligible multi-word expressions. This is the same as the "lambda" computed in
Blaheta and Johnson's (2001), where all multi-word expressions are considered
(rather than just verbs, as in that paper). The `z`

is the Wald
`z`

-statistic computed as the quotient of `lambda`

and the Wald statistic
for `lambda`

as described below.

In detail:

Consider a `K`

-word target expression `x`

, and let `z`

be any
`K`

-word expression. Define a comparison function ```
c(x,z)=(j_{1},
\dots, j_{K})=c
```

such that the `k`

th element of `c`

is 1 if the
`k`

th word in `z`

is equal to the `k`

th word in `x`

, and 0
otherwise. Let `c_{i}=(j_{i1}, \dots, j_{iK})`

, ```
i=1, \dots,
2^{K}=M
```

, be the possible values of `c(x,z)`

, with ```
c_{M}=(1,1,
\dots, 1)
```

. Consider the set of `c(x,z_{r})`

across all expressions
`z_{r}`

in a corpus of text, and let `n_{i}`

, for `i=1,\dots,M`

,
denote the number of the `c(x,z_{r})`

which equal `c_{i}`

, plus the
smoothing constant `smoothing`

. The `n_{i}`

are the counts in a
`2^{K}`

contingency table whose dimensions are defined by the
`c_{i}`

.

`\lambda`

: The `K`

-way interaction parameter in the saturated
loglinear model fitted to the `n_{i}`

. It can be calculated as

`\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}`

where `b_{i}`

is the number of the elements of `c_{i}`

which are
equal to 1.

Wald test `z`

-statistic `z`

is calculated as:

`z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}`

`textstat_collocations`

returns a data.frame of collocations and
their scores and statistics. This consists of the collocations, their
counts, length, and `\lambda`

and `z`

statistics. When `size`

is a
vector, then `count_nested`

counts the lower-order collocations that occur
within a higher-order collocation (but this does not affect the
statistics).

Kenneth Benoit, Jouni Kuha, Haiyan Wang, and Kohei Watanabe

Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

```
library("quanteda")
corp <- data_corpus_inaugural[1:2]
head(cols <- textstat_collocations(corp, size = 2, min_count = 2), 10)
head(cols <- textstat_collocations(corp, size = 3, min_count = 2), 10)
# extracting multi-part proper nouns (capitalized terms)
toks1 <- tokens(data_corpus_inaugural)
toks2 <- tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE)
toks3 <- tokens_select(toks2, pattern = "^([A-Z][a-z\\-]{2,})", valuetype = "regex",
case_insensitive = FALSE, padding = TRUE)
tstat <- textstat_collocations(toks3, size = 3, tolower = FALSE)
head(tstat, 10)
# vectorized size
txt <- c(". . . . a b c . . a b c . . . c d e",
"a b . . a b . . a b . . a b . a b",
"b c d . . b c . b c . . . b c")
textstat_collocations(txt, size = 2:3)
# compounding tokens from collocations
toks <- tokens("This is the European Union.")
colls <- tokens("The new European Union is not the old European Union.") %>%
textstat_collocations(size = 2, min_count = 1, tolower = FALSE)
colls
tokens_compound(toks, colls, case_insensitive = FALSE)
#' # from a collocations object
(coll <- textstat_collocations(tokens("a b c a b d e b d a b")))
phrase(coll)
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.