View source: R/search_split_bound.R
| stri_split_boundaries | R Documentation |
This function locates text boundaries (like character, word, line, or sentence boundaries) and splits strings at the indicated positions.
stri_split_boundaries(
str,
n = -1L,
tokens_only = FALSE,
simplify = FALSE,
...,
opts_brkiter = NULL
)
str |
character vector or an object coercible to |
n |
integer vector, maximal number of strings to return |
tokens_only |
single logical value; may affect the result if |
simplify |
single logical value; if |
... |
additional settings for |
opts_brkiter |
a named list with ICU BreakIterator's settings,
see |
Vectorized over str and n.
If n is negative (the default), then all text pieces are extracted.
Otherwise, if tokens_only is FALSE (which is the default),
then n-1 tokens are extracted (if possible) and the n-th string
gives the (non-split) remainder (see Examples).
On the other hand, if tokens_only is TRUE,
then only full tokens (up to n pieces) are extracted.
For more information on text boundary analysis
performed by ICU's BreakIterator, see
stringi-search-boundaries.
If simplify=FALSE (the default),
then the functions return a list of character vectors.
Otherwise, stri_list2matrix with byrow=TRUE
and n_min=n arguments is called on the resulting object.
In such a case, a character matrix with length(str) rows
is returned. Note that stri_list2matrix's fill
argument is set to an empty string and NA,
for simplify equal to TRUE and NA, respectively.
Marek Gagolewski and other contributors
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v103.i02")}
Other search_split:
about_search,
stri_split_lines(),
stri_split()
Other locale_sensitive:
%s<%(),
about_locale,
about_search_boundaries,
about_search_coll,
stri_compare(),
stri_count_boundaries(),
stri_duplicated(),
stri_enc_detect2(),
stri_extract_all_boundaries(),
stri_locate_all_boundaries(),
stri_opts_collator(),
stri_order(),
stri_rank(),
stri_sort_key(),
stri_sort(),
stri_trans_tolower(),
stri_unique(),
stri_wrap()
Other text_boundaries:
about_search_boundaries,
about_search,
stri_count_boundaries(),
stri_extract_all_boundaries(),
stri_locate_all_boundaries(),
stri_opts_brkiter(),
stri_split_lines(),
stri_trans_tolower(),
stri_wrap()
test <- 'The\u00a0above-mentioned features are very useful. ' %s+%
'Spam, spam, eggs, bacon, and spam. 123 456 789'
stri_split_boundaries(test, type='line')
stri_split_boundaries(test, type='word')
stri_split_boundaries(test, type='word', skip_word_none=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_letter=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_number=TRUE)
stri_split_boundaries(test, type='sentence')
stri_split_boundaries(test, type='sentence', skip_sentence_sep=TRUE)
stri_split_boundaries(test, type='character')
# a filtered break iterator with the new ICU:
stri_split_boundaries('Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.', type='sentence', locale='en_US@ss=standard') # ICU >= 56 only
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.