stringx: Replacements for Base String Functions Powered by 'stringi'

strsplit: Split Strings into Tokens

Splits each string into chunks delimited by occurrences of a given pattern.

strsplit(
  x,
  pattern = split,
  ...,
  ignore_case = ignore.case,
  fixed = FALSE,
  perl = FALSE,
  useBytes = FALSE,
  ignore.case = FALSE,
  split
)

| | | |----|----| | x | character vector whose elements are to be examined | | pattern | character vector of nonempty search patterns | | ... | further arguments to stri_split, e.g., omit_empty, locale, dotall | | ignore_case | single logical value; indicates whether matching should be case-insensitive | | fixed | single logical value; FALSE for matching with regular expressions (see about_search_regex); TRUE for fixed pattern matching (about_search_fixed); NA for the Unicode collation algorithm (about_search_coll) | | perl, useBytes | not used (with a warning if attempting to do so) [DEPRECATED] | | ignore.case | alias to the ignore_case argument [DEPRECATED] | | split | alias to the pattern argument [DEPRECATED] |

This function is fully vectorised with respect to both arguments.

For splitting text into \'characters\' (grapheme clusters), words, or sentences, use stri_split_boundaries instead.

Returns a list of character vectors representing the identified tokens.

Replacements for base strsplit implemented with stri_split.

base R implementation is not portable as it is based on the system PCRE or TRE library (e.g., some Unicode classes may not be available or matching thereof can depend on the current LC_CTYPE category [fixed here]
not suitable for natural language processing [fixed here -- use fixed=NA]
two different regular expression libraries are used (and historically, ERE was used in place of TRE) [here, ICU Java-like regular expression engine is only available, hence the perl argument has no meaning]
there are inconsistencies between the argument order and naming in grepl, strsplit, and startsWith (amongst others); e.g., where the needle can precede the haystack, the use of the forward pipe operator, |>, is less convenient [fixed here]
grepl also features the ignore.case argument [added here]
if split is a zero-length vector, it is treated as "", which extracts individual code points (which is not the best idea for natural language processing tasks) [empty search patterns are not supported here, zero-length vectors are propagated correctly]
last empty token is removed from the output, but first is not [fixed here -- see also the omit_empty argument]
missing values in split are not propagated correctly [fixed here]
partial recycling without the usual warning, not fully vectorised w.r.t. the split argument [fixed here]
only the names attribute of x is preserved [fixed here]

Marek Gagolewski

The official online manual of stringx at https://stringx.gagolewski.com/

Related function(s): paste, nchar, grepl, gsub, substr

stringx::strsplit(c(x="a, b", y="c,d,  e"), ",\\s*")

## $x
## [1] "a" "b"
## 
## $y
## [1] "c" "d" "e"

x <- strcat(c(
    "abc", "123", ",!.", "\U0001F4A9",
    "\U0001F64D\U0001F3FC\U0000200D\U00002642\U0000FE0F",
    "\U000026F9\U0001F3FF\U0000200D\U00002640\U0000FE0F",
    "\U0001F3F4\U000E0067\U000E0062\U000E0073\U000E0063\U000E0074\U000E007F"
))
# be careful when splitting into individual code points:
base::strsplit(x, "")  # stringx does not support this

## [[1]]
##  [1] "a"  "b"  "c"  "1"  "2"  "3"  ","  "!"  "."  "💩" "🙍" "🏼" "‍"   "♂"  "️"  
## [16] "⛹"  "🏿" "‍"   "♀"  "️"   "🏴" "󠁧"   "󠁢"   "󠁳"   "󠁣"   "󠁴"   "󠁿"

stringx::strsplit(x, "(?s)(?=.)", omit_empty=TRUE)  # look-ahead for any char with dot-all

## [[1]]
##  [1] "a"  "b"  "c"  "1"  "2"  "3"  ","  "!"  "."  "💩" "🙍" "🏼" "‍"   "♂"  "️"  
## [16] "⛹"  "🏿" "‍"   "♀"  "️"   "🏴" "󠁧"   "󠁢"   "󠁳"   "󠁣"   "󠁴"   "󠁿"

stringi::stri_split_boundaries(x, type="character")  # grapheme clusters

## [[1]]
##  [1] "a"     "b"     "c"     "1"     "2"     "3"     ","     "!"     "."    
## [10] "💩"    "🙍🏼‍♂️" "⛹🏿‍♀️"  "🏴󠁧󠁢󠁳󠁣󠁴󠁿"

gagolews/stringx documentation built on Jan. 15, 2025, 9:46 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

gagolews/stringx
Replacements for Base String Functions Powered by 'stringi'

.devel/sphinx/rapi/strsplit.md
In gagolews/stringx: Replacements for Base String Functions Powered by 'stringi'

strsplit: Split Strings into Tokens

Description

Usage

Arguments

Details

Value

Differences from Base R

Author(s)

See Also

Examples

R Package Documentation

Browse R Packages

We want your feedback!

gagolews/stringx Replacements for Base String Functions Powered by 'stringi'

.devel/sphinx/rapi/strsplit.md In gagolews/stringx: Replacements for Base String Functions Powered by 'stringi'

strsplit: Split Strings into Tokens

Description

Usage

Arguments

Details

Value

Differences from Base R

Author(s)

See Also

Examples

R Package Documentation

Browse R Packages

We want your feedback!

gagolews/stringx
Replacements for Base String Functions Powered by 'stringi'

.devel/sphinx/rapi/strsplit.md
In gagolews/stringx: Replacements for Base String Functions Powered by 'stringi'