tknz_sent: Sentence tokenizer

View source: R/tokenize_sentences.R

tknz_sentR Documentation

Sentence tokenizer

Description

Extract sentences from a batch of text lines.

Usage

tknz_sent(input, EOS = "[.?!:;]+", keep_first = FALSE)

Arguments

input

a character vector.

EOS

a regular expression matching an End-Of-Sentence delimiter.

keep_first

TRUE or FALSE? Should the first character of the matches be appended to the returned sentences (with a space)?

Details

tknz_sent() splits text into sentences, where sentence delimiters are specified by a regular expression through the EOS argument. Specifically, when an EOS token is found, the next sentence begins at the first position in the input string not containing any of the EOS tokens or white space (so that entries like "Hi there!!!" or "Hello . . ." are both recognized as a single sentence).

If keep_first is FALSE, the delimiters are stripped off from the returned sequences. Otherwise, the first character of the substrings matching the EOS regular expressions are appended to the corresponding sentences, preceded by a white space.

In the absence of any EOS delimiter, tknz_sent() returns the input as is, since parts of text corresponding to different entries of the input vector x are understood as parts of separate sentences.

Note. This function, as well as preprocess, are included in the library for illustrative purposes only, and are not optimized for performance. Furthermore (for performance reasons) the function has a separate implementation for Windows and UNIX OS types, respectively, so that results obtained in the two cases may differ slightly. In contexts that require full reproducibility, users are encouraged to define their own preprocessing and tokenization custom functions - or to work with externally processed data.

Value

a character vector, each entry of which corresponds to a single sentence.

Author(s)

Valerio Gherardi

Examples

tknz_sent("Hi there! I'm using kgrams.")

vgherard/kgrams documentation built on Nov. 17, 2024, 8:56 p.m.