.devel/sphinx/rapi/strsplit.md

strsplit: Split Strings into Tokens

Description

Splits each string into chunks delimited by occurrences of a given pattern.

Usage

strsplit(
  x,
  pattern = split,
  ...,
  ignore_case = ignore.case,
  fixed = FALSE,
  perl = FALSE,
  useBytes = FALSE,
  ignore.case = FALSE,
  split
)

Arguments

| | | |----|----| | x | character vector whose elements are to be examined | | pattern | character vector of nonempty search patterns | | ... | further arguments to stri_split, e.g., omit_empty, locale, dotall | | ignore_case | single logical value; indicates whether matching should be case-insensitive | | fixed | single logical value; FALSE for matching with regular expressions (see about_search_regex); TRUE for fixed pattern matching (about_search_fixed); NA for the Unicode collation algorithm (about_search_coll) | | perl, useBytes | not used (with a warning if attempting to do so) [DEPRECATED] | | ignore.case | alias to the ignore_case argument [DEPRECATED] | | split | alias to the pattern argument [DEPRECATED] |

Details

This function is fully vectorised with respect to both arguments.

For splitting text into \'characters\' (grapheme clusters), words, or sentences, use stri_split_boundaries instead.

Value

Returns a list of character vectors representing the identified tokens.

Differences from Base R

Replacements for base strsplit implemented with stri_split.

Author(s)

Marek Gagolewski

See Also

The official online manual of stringx at https://stringx.gagolewski.com/

Related function(s): paste, nchar, grepl, gsub, substr

Examples

stringx::strsplit(c(x="a, b", y="c,d,  e"), ",\\s*")
## $x
## [1] "a" "b"
## 
## $y
## [1] "c" "d" "e"
x <- strcat(c(
    "abc", "123", ",!.", "\U0001F4A9",
    "\U0001F64D\U0001F3FC\U0000200D\U00002642\U0000FE0F",
    "\U000026F9\U0001F3FF\U0000200D\U00002640\U0000FE0F",
    "\U0001F3F4\U000E0067\U000E0062\U000E0073\U000E0063\U000E0074\U000E007F"
))
# be careful when splitting into individual code points:
base::strsplit(x, "")  # stringx does not support this
## [[1]]
##  [1] "a"  "b"  "c"  "1"  "2"  "3"  ","  "!"  "."  "💩" "🙍" "🏼" "‍"   "♂"  "️"  
## [16] "⛹"  "🏿" "‍"   "♀"  "️"   "🏴" "󠁧"   "󠁢"   "󠁳"   "󠁣"   "󠁴"   "󠁿"
stringx::strsplit(x, "(?s)(?=.)", omit_empty=TRUE)  # look-ahead for any char with dot-all
## [[1]]
##  [1] "a"  "b"  "c"  "1"  "2"  "3"  ","  "!"  "."  "💩" "🙍" "🏼" "‍"   "♂"  "️"  
## [16] "⛹"  "🏿" "‍"   "♀"  "️"   "🏴" "󠁧"   "󠁢"   "󠁳"   "󠁣"   "󠁴"   "󠁿"
stringi::stri_split_boundaries(x, type="character")  # grapheme clusters
## [[1]]
##  [1] "a"     "b"     "c"     "1"     "2"     "3"     ","     "!"     "."    
## [10] "💩"    "🙍🏼‍♂️" "⛹🏿‍♀️"  "🏴󠁧󠁢󠁳󠁣󠁴󠁿"


gagolews/stringx documentation built on Jan. 15, 2025, 9:46 p.m.