terms.data.frame: Get the set of Biterms from a tokenised data frame

View source: R/btm.R

terms.data.frameR Documentation

Get the set of Biterms from a tokenised data frame

Description

This extracts words occurring in the neighbourhood of one another, within a certain window range. The default setting provides the biterms used when fitting BTM with the default window parameter.

Usage

## S3 method for class 'data.frame'
terms(x, type = c("tokens", "biterms"), window = 15, ...)

Arguments

x

a tokenised data frame containing one row per token with 2 columns

  • the first column is a context identifier (e.g. a tweet id, a document id, a sentence id, an identifier of a survey answer, an identifier of a part of a text)

  • the second column is a column called of type character containing the sequence of words occurring within the context identifier

type

a character string, either 'tokens' or 'biterms'. Defaults to 'tokens'.

window

integer with the window size for biterm extraction. Defaults to 15.

...

not used

Value

Depending if type is set to 'tokens' or 'biterms' the following is returned:

  • If type='tokens': a list containing 2 elements:

    • n which indicates the number of tokens

    • tokens which is a data.frame with columns id, token and freq, indicating for all tokens found in the data the frequency of occurrence

  • If type='biterms': a list containing 2 elements:

    • n which indicates the number of biterms used to train the model

    • biterms which is a data.frame with columns term1 and term2, indicating all biterms found in the data. The same biterm combination can occur several times.

    Note that a biterm is unordered, in the output of type='biterms' term1 is always smaller than or equal to term2.

Note

If x is a data.frame which has an attribute called 'terms', it just returns that 'terms' attribute

See Also

BTM, predict.BTM, logLik.BTM

Examples


library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
biterms <- terms(x, window = 15, type = "biterms")
str(biterms)
tokens <- terms(x, type = "tokens")
str(tokens)


BTM documentation built on Feb. 16, 2023, 10:14 p.m.