This extracts words occurring in the neighbourhood of one another, within a certain window range.
The default setting provides the biterms used when fitting
BTM with the default window parameter.
## S3 method for class 'data.frame' terms(x, type = c("tokens", "biterms"), window = 15, ...)
a tokenised data frame containing one row per token with 2 columns
a character string, either 'tokens' or 'biterms'. Defaults to 'tokens'.
integer with the window size for biterm extraction. Defaults to 15.
Depending if type is set to 'tokens' or 'biterms' the following is returned:
type='tokens': a list containing 2 elements:
n which indicates the number of tokens
tokens which is a data.frame with columns id, token and freq,
indicating for all tokens found in the data the frequency of occurrence
type='biterms': a list containing 2 elements:
n which indicates the number of biterms used to train the model
biterms which is a data.frame with columns term1 and term2,
indicating all biterms found in the data. The same biterm combination can occur several times.
Note that a biterm is unordered, in the output of
type='biterms' term1 is always smaller than or equal to term2.
x is a data.frame which has an attribute called 'terms', it just returns that
library(udpipe) data("brussels_reviews_anno", package = "udpipe") x <- subset(brussels_reviews_anno, language == "nl") x <- subset(x, xpos %in% c("NN", "NNP", "NNS")) x <- x[, c("doc_id", "lemma")] biterms <- terms(x, window = 15, type = "biterms") str(biterms) tokens <- terms(x, type = "tokens") str(tokens)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.