stri_split_boundaries: Split a String at Text Boundaries

View source: R/search_split_bound.R

stri_split_boundariesR Documentation

Split a String at Text Boundaries


This function locates text boundaries (like character, word, line, or sentence boundaries) and splits strings at the indicated positions.


  n = -1L,
  tokens_only = FALSE,
  simplify = FALSE,
  opts_brkiter = NULL



character vector or an object coercible to


integer vector, maximal number of strings to return


single logical value; may affect the result if n is positive, see Details


single logical value; if TRUE or NA, then a character matrix is returned; otherwise (the default), a list of character vectors is given, see Value


additional settings for opts_brkiter


a named list with ICU BreakIterator's settings, see stri_opts_brkiter; NULL for the default break iterator, i.e., line_break


Vectorized over str and n.

If n is negative (the default), then all text pieces are extracted.

Otherwise, if tokens_only is FALSE (which is the default), then n-1 tokens are extracted (if possible) and the n-th string gives the (non-split) remainder (see Examples). On the other hand, if tokens_only is TRUE, then only full tokens (up to n pieces) are extracted.

For more information on text boundary analysis performed by ICU's BreakIterator, see stringi-search-boundaries.


If simplify=FALSE (the default), then the functions return a list of character vectors.

Otherwise, stri_list2matrix with byrow=TRUE and n_min=n arguments is called on the resulting object. In such a case, a character matrix with length(str) rows is returned. Note that stri_list2matrix's fill argument is set to an empty string and NA, for simplify equal to TRUE and NA, respectively.


Marek Gagolewski and other contributors

See Also

The official online manual of stringi at

Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v103.i02")}

Other search_split: about_search, stri_split_lines(), stri_split()

Other locale_sensitive: %s<%(), about_locale, about_search_boundaries, about_search_coll, stri_compare(), stri_count_boundaries(), stri_duplicated(), stri_enc_detect2(), stri_extract_all_boundaries(), stri_locate_all_boundaries(), stri_opts_collator(), stri_order(), stri_rank(), stri_sort_key(), stri_sort(), stri_trans_tolower(), stri_unique(), stri_wrap()

Other text_boundaries: about_search_boundaries, about_search, stri_count_boundaries(), stri_extract_all_boundaries(), stri_locate_all_boundaries(), stri_opts_brkiter(), stri_split_lines(), stri_trans_tolower(), stri_wrap()


test <- 'The\u00a0above-mentioned    features are very useful. ' %s+%
   'Spam, spam, eggs, bacon, and spam. 123 456 789'
stri_split_boundaries(test, type='line')
stri_split_boundaries(test, type='word')
stri_split_boundaries(test, type='word', skip_word_none=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_letter=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_number=TRUE)
stri_split_boundaries(test, type='sentence')
stri_split_boundaries(test, type='sentence', skip_sentence_sep=TRUE)
stri_split_boundaries(test, type='character')

# a filtered break iterator with the new ICU:
stri_split_boundaries('Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.', type='sentence', locale='en_US@ss=standard') # ICU >= 56 only

stringi documentation built on Nov. 23, 2023, 5:07 p.m.