tknz_sent: Sentence tokenizer

Description Usage Arguments Details Value Author(s) Examples

View source: R/RcppExports.R

Description

Extract sentences from a batch of text lines.

Usage

1
tknz_sent(input, EOS = "[.?!:;]+", keep_first = FALSE)

Arguments

input

a character vector.

EOS

a regular expression matching an End-Of-Sentence delimiter.

keep_first

TRUE or FALSE? Should the first character of the matches be appended to the returned sentences (with a space)?

Details

tknz_sent() splits text into sentences using a list of single character delimiters, specified by the parameter EOS. Specifically, when an EOS token is found, the next sentence begins at the first position in the input string not containing any of the EOS tokens or white space (so that entries like "Hi there!!!" or "Hello . . ." are both recognized as a single sentence).

If keep_first is FALSE, the delimiters are stripped off from the returned sequences, which means that all delimiters are treated symmetrically.

In the absence of any EOS delimiter, tknz_sent() returns the input as is, since parts of text corresponding to different entries of the input vector x are understood as parts of separate sentences.

Value

a character vector, each entry of which corresponds to a single sentence.

Author(s)

Valerio Gherardi

Examples

1
tknz_sent("Hi there! I'm using `sbo`.")

kgrams documentation built on Nov. 16, 2021, 9:22 a.m.