Description Usage Arguments Value Examples
View source: R/corplingr_sent_tokeniser.R
The embedded function in the collocational framework to split input corpus into vector of sentences using stri_split_boundaries
from stringi
package.
Each sentence line will be appended, at the beginning and at the end, with "ZSENTENCEZ"
marker as many as the number of collocational window-span is required.
This marker will help identify if collocates of a word cross the boundary of the sentence in which the word occurs.
The function automatically detects and removes if "ZSENTENCEZ"
is part of the identified collocate.
1 | tokenise_sentence(strings = NULL, to_lower = TRUE, window_span = NULL)
|
strings |
character vector of a corpus text. |
to_lower |
logical; turn the corpus into lowercase when |
window_span |
integer; it is supplied from the value of the |
A character vector with as many sentences as there are in the input corpus as identified by stri_split_boundaries
.
1 2 3 4 5 6 7 8 | txt <- c("It is one sentence. It is another sentence! There are TWO sentences.",
"The second sentence and the second element of 'txt' corpus.",
"Can we add another one (sentence) here as the third element?",
"Of course!")
sent <- tokenise_sentence(strings = txt,
to_lower = TRUE,
window_span = 3)
sent
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.