strsplit.data.frame: Obtain a tokenised data frame by splitting text alongside a...

View source: R/utils.R

strsplit.data.frameR Documentation

Obtain a tokenised data frame by splitting text alongside a regular expression

Description

Obtain a tokenised data frame by splitting text alongside a regular expression. This is the inverse operation of paste.data.frame.

Usage

strsplit.data.frame(
  data,
  term,
  group,
  split = "[[:space:][:punct:][:digit:]]+",
  ...
)

Arguments

data

a data.frame or data.table

term

a character with a column name from data which you want to split into tokens

group

a string with a column name or a character vector of column names from data indicating identifiers of groups. The text in term will be split into tokens by group.

split

a regular expression indicating how to split the term column. Defaults to splitting by spaces, punctuation symbols or digits. This will be passed on to strsplit.

...

further arguments passed on to strsplit

Value

A tokenised data frame containing one row per token.
This data.frame has the columns from group and term where the text in column term will be split by the provided regular expression into tokens.

See Also

paste.data.frame, strsplit

Examples

data(brussels_reviews, package = "udpipe")
x <- strsplit.data.frame(brussels_reviews, term = "feedback", group = "id")
head(x)
x <- strsplit.data.frame(brussels_reviews, 
                         term = c("feedback"), 
                         group = c("listing_id", "language"))
head(x)  
x <- strsplit.data.frame(brussels_reviews, term = "feedback", group = "id", 
                         split = " ", fixed = TRUE)
head(x)                          

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.