string_split2df: Splits a character vector into a data frame
In stringmagic: Character String Operations and Interpolation, Magic Edition

string_split2df

R Documentation

Splits a character vector into a data frame

Description

Splits a character vector and formats the resulting substrings into a data.frame

Usage

string_split2df(
  x,
  data = NULL,
  split = NULL,
  id = NULL,
  add.pos = FALSE,
  id_unik = TRUE,
  fixed = FALSE,
  ignore.case = FALSE,
  word = FALSE,
  envir = parent.frame(),
  dt = FALSE,
  ...
)

string_split2dt(
  x,
  data = NULL,
  split = NULL,
  id = NULL,
  add.pos = FALSE,
  id_unik = TRUE,
  fixed = FALSE
)

Arguments

`x`	A character vector or a two-sided formula. If a two-sided formula, then the argument `data` must be provided since the variables will be fetched in there. A formula is of the form `char_var ~ id1 + id2` where `char_var` on the left is a character variable and on the right `id1` and `id2` are identifiers which will be included in the resulting table. Alternatively, you can provide identifiers via the argument `id`.
`data`	Optional, only used if the argument `x` is a formula. It should contain the variables of the formula.
`split`	A character scalar. Used to split the character vectors. By default this is a regular expression. You can use flags in the pattern in the form `⁠flag1, flag2/pattern⁠`. Available flags are `ignore` (case), `fixed` (no regex), word (add word boundaries), magic (add interpolation with `"{}"`). Example: if "ignore/hello" and the text contains "Hello", it will be split at "Hello". Shortcut: use the first letters of the flags. Ex: "iw/one" will split at the word "one" (flags 'ignore' + 'word').
`id`	Optional. A character vector or a list of vectors. If provided, the values of `id` are considered as identifiers that will be included in the resulting table.
`add.pos`	Logical, default is `FALSE`. Whether to include the position of each split element.
`id_unik`	Logical, default is `TRUE`. In the case identifiers are provided, whether to trigger a message if the identifiers are not unique. Indeed, if the identifiers are not unique, it is not possible to reconstruct the original texts.
`fixed`	Logical, default is `FALSE`. Whether to consider the argument `split` as fixed (and not as a regular expression).
`ignore.case`	Logical scalar, default is `FALSE`. If `TRUE`, then case insensitive search is triggered.
`word`	Logical scalar, default is `FALSE`. If `TRUE` then a) word boundaries are added to the pattern, and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation. Example: if `word = TRUE`, then pattern = "The, mountain" will select strings containing either the word 'The' or the word 'mountain'.
`envir`	Environment in which to evaluate the interpolations if the flag `"magic"` is provided. Default is `parent.frame()`.
`dt`	Logical, default is `FALSE`. Whether to return a `data.table`. See also the function `string_split2dt`.
`...`	Not currently used.

Value

It returns a data.frame or a data.table which will contain: i) obs: the observation index, ii) pos: the position of the text element in the initial string (optional, via add.pos), iii) the text element, iv) the identifier(s) (optional, only if id was provided).

Functions

string_split2dt(): Splits a string vector and returns a data.table

Examples


x = c("Nor rain, wind, thunder, fire are my daughters.",
      "When my information changes, I alter my conclusions.")

id = c("ws", "jmk")

# we split at each word
string_split2df(x, "[[:punct:] ]+")

# we add the 'id'
string_split2df(x, "[[:punct:] ]+", id = id)

# TO NOTE:
# - the second argument is `data`
# - when it is missing, the argument `split` becomes implicitly the second
# - ex: above we did not use `split = "[[:punct:] ]+"`

#
# using the formula

base = data.frame(text = x, my_id = id)
string_split2df(text ~ my_id, base, "[[:punct:] ]+")

#
# with 2+ identifiers

base = within(mtcars, carname <- rownames(mtcars))

# we have a message because the identifiers are not unique
string_split2df(carname ~ am + gear + carb, base, " +")

# adding the position of the words & removing the message
string_split2df(carname ~ am + gear + carb, base, " +", id_unik = FALSE, add.pos = TRUE)

stringmagic documentation built on June 8, 2025, 12:41 p.m.