Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text.
Examples of the boundary analysis process include:
Locating positions to word-wrap text to fit
within specific margins while displaying or printing,
Counting characters, words, sentences, or paragraphs,
Making a list of the unique words in a document,
stri_extract_all_words and then
Capitalizing the first letter of each word
or sentence, see also
Locating a particular unit of the text (for example,
finding the third word in the document),
Generally, text boundary analysis is a locale-dependent operation. For example, in Japanese and Chinese one does not separate words with spaces - a line break can occur even in the middle of a word. These languages have punctuation and diacritical marks that cannot start or end a line, so this must also be taken into account.
stringi uses ICU's
BreakIterator to locate specific
text boundaries. Note that the
may be controlled in come cases, see
character boundary iterator tries to match what a user
would think of as a “character” – a basic unit of a writing system
for a language – which may be more than just a single Unicode code point.
word boundary iterator locates the boundaries
of words, for purposes such as “Find whole words” operations.
line_break iterator locates positions that would
be appropriate to wrap lines when displaying the text.
The break iterator of type
locates sentence boundaries.
For technical details on different classes of text boundaries refer to the ICU User Guide, see below.
Marek Gagolewski and other contributors
Boundary Analysis – ICU User Guide, https://unicode-org.github.io/icu/userguide/boundaryanalysis/
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi: 10.18637/jss.v103.i02
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.