tokenise_sentence: Split a corpus by sentence-boundary
In gederajeg/corplingr: Tidy Concordances, Collocates, and Wordlist

Description Usage Arguments Value Examples

View source: R/corplingr_sent_tokeniser.R

The embedded function in the collocational framework to split input corpus into vector of sentences using stri_split_boundaries from stringi package. Each sentence line will be appended, at the beginning and at the end, with "ZSENTENCEZ" marker as many as the number of collocational window-span is required. This marker will help identify if collocates of a word cross the boundary of the sentence in which the word occurs. The function automatically detects and removes if "ZSENTENCEZ" is part of the identified collocate.

1	tokenise_sentence(strings = NULL, to_lower = TRUE, window_span = NULL)

`strings`	character vector of a corpus text.
`to_lower`	logical; turn the corpus into lowercase when `TRUE` (the default).
`window_span`	integer; it is supplied from the value of the `span` argument in the higher-level collocational function call. It will determine the number of times the `"ZSENTENCEZ"` marker will be appended at the beginning and at the end of each sentence.

A character vector with as many sentences as there are in the input corpus as identified by stri_split_boundaries.

txt <- c("It is one sentence. It is another sentence! There are TWO sentences.",
         "The second sentence and the second element of 'txt' corpus.",
         "Can we add another one (sentence) here as the third element?",
         "Of course!")
sent <- tokenise_sentence(strings = txt,
                          to_lower = TRUE,
                          window_span = 3)
sent