tknz_sent: Sentence tokenizer
In vgherard/kgrams: Classical k-gram Language Models

tknz_sent

R Documentation

Sentence tokenizer

Description

Extract sentences from a batch of text lines.

Usage

tknz_sent(input, EOS = "[.?!:;]+", keep_first = FALSE)

Arguments

`input`	a character vector.
`EOS`	a regular expression matching an End-Of-Sentence delimiter.
`keep_first`	TRUE or FALSE? Should the first character of the matches be appended to the returned sentences (with a space)?

Details

tknz_sent() splits text into sentences, where sentence delimiters are specified by a regular expression through the EOS argument. Specifically, when an EOS token is found, the next sentence begins at the first position in the input string not containing any of the EOS tokens or white space (so that entries like "Hi there!!!" or "Hello . . ." are both recognized as a single sentence).

If keep_first is FALSE, the delimiters are stripped off from the returned sequences. Otherwise, the first character of the substrings matching the EOS regular expressions are appended to the corresponding sentences, preceded by a white space.

In the absence of any EOS delimiter, tknz_sent() returns the input as is, since parts of text corresponding to different entries of the input vector x are understood as parts of separate sentences.

Note. This function, as well as preprocess, are included in the library for illustrative purposes only, and are not optimized for performance. Furthermore (for performance reasons) the function has a separate implementation for Windows and UNIX OS types, respectively, so that results obtained in the two cases may differ slightly. In contexts that require full reproducibility, users are encouraged to define their own preprocessing and tokenization custom functions - or to work with externally processed data.

Value

a character vector, each entry of which corresponds to a single sentence.

Author(s)

Valerio Gherardi

Examples

tknz_sent("Hi there! I'm using kgrams.")

vgherard/kgrams documentation built on Nov. 17, 2024, 8:56 p.m.

vgherard/kgrams index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vgherard/kgrams
Classical k-gram Language Models

tknz_sent: Sentence tokenizer
In vgherard/kgrams: Classical k-gram Language Models

Sentence tokenizer

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to tknz_sent in vgherard/kgrams...

R Package Documentation

Browse R Packages

We want your feedback!

vgherard/kgrams Classical k-gram Language Models

tknz_sent: Sentence tokenizer In vgherard/kgrams: Classical k-gram Language Models

Sentence tokenizer

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to tknz_sent in vgherard/kgrams...

R Package Documentation

Browse R Packages

We want your feedback!

vgherard/kgrams
Classical k-gram Language Models

tknz_sent: Sentence tokenizer
In vgherard/kgrams: Classical k-gram Language Models