terms.data.frame | R Documentation |
This extracts words occurring in the neighbourhood of one another, within a certain window range.
The default setting provides the biterms used when fitting BTM
with the default window parameter.
## S3 method for class 'data.frame' terms(x, type = c("tokens", "biterms"), window = 15, ...)
x |
a tokenised data frame containing one row per token with 2 columns
|
type |
a character string, either 'tokens' or 'biterms'. Defaults to 'tokens'. |
window |
integer with the window size for biterm extraction. Defaults to 15. |
... |
not used |
Depending if type is set to 'tokens' or 'biterms' the following is returned:
If type='tokens'
: a list containing 2 elements:
n
which indicates the number of tokens
tokens
which is a data.frame with columns id, token and freq,
indicating for all tokens found in the data the frequency of occurrence
If type='biterms'
: a list containing 2 elements:
n
which indicates the number of biterms used to train the model
biterms
which is a data.frame with columns term1 and term2,
indicating all biterms found in the data. The same biterm combination can occur several times.
Note that a biterm is unordered, in the output of type='biterms'
term1 is always smaller than or equal to term2.
If x
is a data.frame which has an attribute called 'terms', it just returns that 'terms'
attribute
BTM
, predict.BTM
, logLik.BTM
library(udpipe) data("brussels_reviews_anno", package = "udpipe") x <- subset(brussels_reviews_anno, language == "nl") x <- subset(x, xpos %in% c("NN", "NNP", "NNS")) x <- x[, c("doc_id", "lemma")] biterms <- terms(x, window = 15, type = "biterms") str(biterms) tokens <- terms(x, type = "tokens") str(tokens)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.