nlp_tokenize_text: Tokenize Text Data (mostly) Non-Destructively
In textpress: A Lightweight and Versatile NLP Toolkit

nlp_tokenize_text

R Documentation

Tokenize Text Data (mostly) Non-Destructively

Description

This function tokenizes text data from a data frame using the 'tokenizers' package, preserving the original text structure like capitalization and punctuation.

Usage

nlp_tokenize_text(
  tif,
  text_hierarchy = c("doc_id", "paragraph_id", "sentence_id")
)

Arguments

`tif`	A data frame containing the text to be tokenized and a document identifier in 'doc_id'.
`text_hierarchy`	A character string specifying grouping column.

Value

A named list of tokens, where each list item corresponds to a document.

Examples

tif <- data.frame(doc_id = c('1', '1', '2'),
                  sentence_id = c('1', '2', '1'),
                  text = c("Hello world.",
                           "This is an example.",
                           "This is a party!"))
tokens <- nlp_tokenize_text(tif, text_hierarchy = c('doc_id', 'sentence_id'))

textpress documentation built on Oct. 14, 2024, 5:08 p.m.