View source: R/separate-wider.R
separate_wider_delim | R Documentation |
Each of these functions takes a string column and splits it into multiple new columns:
separate_wider_delim()
splits by delimiter.
separate_wider_position()
splits at fixed widths.
separate_wider_regex()
splits with regular expression matches.
These functions are equivalent to separate()
and extract()
, but use
stringr as the underlying string
manipulation engine, and their interfaces reflect what we've learned from
unnest_wider()
and unnest_longer()
.
separate_wider_delim(
data,
cols,
delim,
...,
names = NULL,
names_sep = NULL,
names_repair = "check_unique",
too_few = c("error", "debug", "align_start", "align_end"),
too_many = c("error", "debug", "drop", "merge"),
cols_remove = TRUE
)
separate_wider_position(
data,
cols,
widths,
...,
names_sep = NULL,
names_repair = "check_unique",
too_few = c("error", "debug", "align_start"),
too_many = c("error", "debug", "drop"),
cols_remove = TRUE
)
separate_wider_regex(
data,
cols,
patterns,
...,
names_sep = NULL,
names_repair = "check_unique",
too_few = c("error", "debug", "align_start"),
cols_remove = TRUE
)
data |
A data frame. |
cols |
< |
delim |
For |
... |
These dots are for future extensions and must be empty. |
names |
For |
names_sep |
If supplied, output names will be composed
of the input column name followed by the separator followed by the
new column name. Required when For |
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
too_few |
What should happen if a value separates into too few pieces?
|
too_many |
What should happen if a value separates into too many pieces?
|
cols_remove |
Should the input |
widths |
A named numeric vector where the names become column names, and the values specify the column width. Unnamed components will match, but not be included in the output. |
patterns |
A named character vector where the names become column names and the values are regular expressions that match the contents of the vector. Unnamed components will match, but not be included in the output. |
A data frame based on data
. It has the same rows, but different
columns:
The primary purpose of the functions are to create new columns from
components of the string.
For separate_wider_delim()
the names of new columns come from names
.
For separate_wider_position()
the names come from the names of widths
.
For separate_wider_regex()
the names come from the names of
patterns
.
If too_few
or too_many
is "debug"
, the output will contain additional
columns useful for debugging:
{col}_ok
: a logical vector which tells you if the input was ok or
not. Use to quickly find the problematic rows.
{col}_remainder
: any text remaining after separation.
{col}_pieces
, {col}_width
, {col}_matches
: number of pieces,
number of characters, and number of matches for separate_wider_delim()
,
separate_wider_position()
and separate_regexp_wider()
respectively.
If cols_remove = TRUE
(the default), the input cols
will be removed
from the output.
df <- tibble(id = 1:3, x = c("m-123", "f-455", "f-123"))
# There are three basic ways to split up a string into pieces:
# 1. with a delimiter
df %>% separate_wider_delim(x, delim = "-", names = c("gender", "unit"))
# 2. by length
df %>% separate_wider_position(x, c(gender = 1, 1, unit = 3))
# 3. defining each component with a regular expression
df %>% separate_wider_regex(x, c(gender = ".", ".", unit = "\\d+"))
# Sometimes you split on the "last" delimiter
df <- tibble(var = c("race_1", "race_2", "age_bucket_1", "age_bucket_2"))
# _delim won't help because it always splits on the first delimiter
try(df %>% separate_wider_delim(var, "_", names = c("var1", "var2")))
df %>% separate_wider_delim(var, "_", names = c("var1", "var2"), too_many = "merge")
# Instead, you can use _regex
df %>% separate_wider_regex(var, c(var1 = ".*", "_", var2 = ".*"))
# this works because * is greedy; you can mimic the _delim behaviour with .*?
df %>% separate_wider_regex(var, c(var1 = ".*?", "_", var2 = ".*"))
# If the number of components varies, it's most natural to split into rows
df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA))
df %>% separate_longer_delim(x, delim = " ")
# But separate_wider_delim() provides some tools to deal with the problem
# The default behaviour tells you that there's a problem
try(df %>% separate_wider_delim(x, delim = " ", names = c("a", "b")))
# You can get additional insight by using the debug options
df %>%
separate_wider_delim(
x,
delim = " ",
names = c("a", "b"),
too_few = "debug",
too_many = "debug"
)
# But you can suppress the warnings
df %>%
separate_wider_delim(
x,
delim = " ",
names = c("a", "b"),
too_few = "align_start",
too_many = "merge"
)
# Or choose to automatically name the columns, producing as many as needed
df %>% separate_wider_delim(x, delim = " ", names_sep = "", too_few = "align_start")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.