tokenise_sentence: Split a corpus by sentence-boundary

Description Usage Arguments Value Examples

View source: R/corplingr_sent_tokeniser.R

Description

The embedded function in the collocational framework to split input corpus into vector of sentences using stri_split_boundaries from stringi package. Each sentence line will be appended, at the beginning and at the end, with "ZSENTENCEZ" marker as many as the number of collocational window-span is required. This marker will help identify if collocates of a word cross the boundary of the sentence in which the word occurs. The function automatically detects and removes if "ZSENTENCEZ" is part of the identified collocate.

Usage

1
tokenise_sentence(strings = NULL, to_lower = TRUE, window_span = NULL)

Arguments

strings

character vector of a corpus text.

to_lower

logical; turn the corpus into lowercase when TRUE (the default).

window_span

integer; it is supplied from the value of the span argument in the higher-level collocational function call. It will determine the number of times the "ZSENTENCEZ" marker will be appended at the beginning and at the end of each sentence.

Value

A character vector with as many sentences as there are in the input corpus as identified by stri_split_boundaries.

Examples

1
2
3
4
5
6
7
8
txt <- c("It is one sentence. It is another sentence! There are TWO sentences.",
         "The second sentence and the second element of 'txt' corpus.",
         "Can we add another one (sentence) here as the third element?",
         "Of course!")
sent <- tokenise_sentence(strings = txt,
                          to_lower = TRUE,
                          window_span = 3)
sent

gederajeg/corplingr documentation built on Dec. 20, 2021, 9:50 a.m.