In easystats/datawizard: Easy Data Wrangling and Statistical Transformations

library(knitr)
options(knitr.kable.NA = "")
knitr::opts_chunk$set(
  eval = FALSE,
  message = FALSE,
  warning = FALSE,
  dpi = 300
)

pkgs <- c(
  "dplyr",
  "tidyr"
)
all_deps_available <- all(vapply(pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1L)))

if (all_deps_available) {
  library(datawizard)
  library(dplyr)
  library(tidyr)
}

# Since we explicitly put `eval = TRUE` for some chunks, we can't rely on
# `knitr::opts_chunk$set(eval = FALSE)` at the beginning of the script.
# Therefore, we introduce a logical that is `FALSE` only if all suggested
# dependencies are not installed (cf easystats/easystats#317)
evaluate_chunk <- all_deps_available && getRversion() >= "4.1.0"

This vignette can be referred to by citing the following:

Patil et al., (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software, 7(78), 4684, https://doi.org/10.21105/joss.04684

```{css, echo=FALSE, eval = TRUE} .datawizard, .datawizard > .sourceCode { background-color: #e6e6ff; } .tidyverse, .tidyverse > .sourceCode { background-color: #d9f2e5; } .custom_note { border-left: solid 5px hsl(220, 100%, 30%); background-color: hsl(220, 100%, 95%); padding: 5px; margin-bottom: 10px }

# Introduction

`{datawizard}` package aims to make basic data wrangling easier than
with base R. The data wrangling workflow it supports is similar to the one
supported by the tidyverse package combination of `{dplyr}` and `{tidyr}`. However,
one of its main features is that it has a very few dependencies: `{stats}` and `{utils}`
(included in base R) and `{insight}`, which is the core package of the _easystats_
ecosystem. This package grew organically to simultaneously satisfy the
"0 non-base hard dependency" principle of _easystats_ and the data wrangling needs
of the constituent packages in this ecosystem. It is also
important to note that `{datawizard}` was designed to avoid namespace collisions
with `{tidyverse}` packages.

In this article, we will see how to go through basic data wrangling steps with
`{datawizard}`. We will also compare it to the `{tidyverse}` syntax for achieving the same.
This way, if you decide to make the switch, you can easily find the translations here.
This vignette is largely inspired from `{dplyr}`'s [Getting started vignette](https://dplyr.tidyverse.org/articles/dplyr.html).

<!-- NOTE: use raw HTML so that vignette can compile even if `evaluate_chunk` is FALSE. -->
<!-- See e.g. #527 -->

<div class="custom_note">
  <p>
    Note: In this vignette, we use the native pipe-operator, `|>`, which was 
    introduced in R 4.1. Users of R version 3.6 or 4.0 should replace the native 
    pipe by magrittr's one (`%>%`) so that examples work.
  </p>
</div>

```r
library(dplyr)
library(tidyr)
library(datawizard)

data(efc)
efc <- head(efc)

Workhorses

Before we look at their tidyverse equivalents, we can first have a look at {datawizard}'s key functions for data wrangling:

Note that there are a few functions in {datawizard} that have no strict equivalent in {dplyr} or {tidyr} (e.g data_rotate()), and so we won't discuss them in the next section.

Equivalence with `{dplyr}` / `{tidyr}`

Before we look at them individually, let's first have a look at the summary table of this equivalence.

Filtering {#filtering}

data_filter() is a wrapper around subset(). However, if you want to have several filtering conditions, you can either use & (as in subset()) or , (as in dplyr::filter()).

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars |>
  data_filter(
    skin_color == "light",
    eye_color == "brown"
  )

# or
starwars |>
  data_filter(
    skin_color == "light" &
      eye_color == "brown"
  )

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  filter(
    skin_color == "light",
    eye_color == "brown"
  )

:::

::::

starwars <- head(starwars)

Selecting {#selecting}

data_select() is the equivalent of dplyr::select(). The main difference between these two functions is that data_select() uses two arguments (select and exclude) and requires quoted column names if we want to select several variables, while dplyr::select() accepts any unquoted column names.

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars |>
  data_select(select = c("hair_color", "skin_color", "eye_color"))

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  select(hair_color, skin_color, eye_color)

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars |>
  data_select(select = -ends_with("color"))

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  select(-ends_with("color"))

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars |>
  data_select(select = -(hair_color:eye_color))

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  select(!(hair_color:eye_color))

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars |>
  data_select(exclude = regex("color$"))

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  select(-contains("color$"))

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars |>
  data_select(select = is.numeric)

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  select(where(is.numeric))

:::

::::

You can find a list of all the select helpers with ?data_select.

Modifying {#modifying}

data_modify() is a wrapper around base::transform() but has several additional benefits:

it allows us to use newly created variables in the following expressions;
it works with grouped data;
it preserves variable attributes such as labels;
it accepts expressions as character vectors so that it is easy to program with it

This last point is also the main difference between data_modify() and dplyr::mutate().

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
efc |>
  data_modify(
    c12hour_c = center(c12hour),
    c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE),
    c12hour_z2 = standardize(c12hour)
  )

:::

::: {}

# ---------- tidyverse -----------
efc |>
  mutate(
    c12hour_c = center(c12hour),
    c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE),
    c12hour_z2 = standardize(c12hour)
  )

:::

::::

data_modify() accepts expressions as strings:

new_exp <- c(
  "c12hour_c = center(c12hour)",
  "c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE)"
)
data_modify(efc, new_exp)

This makes it easy to use it in custom functions:

miles_to_km <- function(data, var) {
  data_modify(
    data,
    paste0("km = ", var, "* 1.609344")
  )
}

distance <- data.frame(miles = c(1, 8, 233, 88, 9))
distance

miles_to_km(distance, "miles")

Sorting {#sorting}

data_arrange() is the equivalent of dplyr::arrange(). It takes two arguments: a data frame, and a vector of column names used to sort the rows. Note that contrary to most other functions in {datawizard}, it is not possible to use select helpers such as starts_with() in data_arrange().

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars |>
  data_arrange(c("hair_color", "height"))

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  arrange(hair_color, height)

:::

::::

You can also sort variables in descending order by putting a "-" in front of their name, like below:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars |>
  data_arrange(c("-hair_color", "-height"))

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  arrange(desc(hair_color), -height)

:::

::::

Extracting {#extracting}

Although we mostly work on data frames, it is sometimes useful to extract a single column as a vector. This can be done with data_extract(), which reproduces the behavior of dplyr::pull():

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars |>
  data_extract(gender)

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  pull(gender)

:::

::::

We can also specify several variables in select. In this case, data_extract() is equivalent to data_select():

starwars |>
  data_extract(select = contains("color"))

Renaming {#renaming}

data_rename() is the equivalent of dplyr::rename() but the syntax between the two is different. While dplyr::rename() takes new-old pairs of column names, data_rename() requires a vector of column names to rename, and then a vector of new names for these columns that must be of the same length.

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars |>
  data_rename(
    select = c("sex", "hair_color"),
    replacement = c("Sex", "Hair Color")
  )

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  rename(
    Sex = sex,
    "Hair Color" = hair_color
  )

:::

::::

The way data_rename() is designed makes it easy to apply the same modifications to a vector of column names. For example, we can remove underscores and use TitleCase with the following code:

to_rename <- names(starwars)

starwars |>
  data_rename(
    select = to_rename,
    replacement = tools::toTitleCase(gsub("_", " ", to_rename, fixed = TRUE))
  )

It is also possible to add a prefix or a suffix to all or a subset of variables with data_addprefix() and data_addsuffix(). The argument select accepts all select helpers that we saw above with data_select():

starwars |>
  data_addprefix(
    pattern = "OLD.",
    select = contains("color")
  ) |>
  data_addsuffix(
    pattern = ".NEW",
    select = -contains("color")
  )

Relocating {#relocating}

Sometimes, we want to relocate one or a small subset of columns in the dataset. Rather than typing many names in data_select(), we can use data_relocate(), which is the equivalent of dplyr::relocate(). Just like data_select(), we can specify a list of variables we want to relocate with select and exclude. Then, the arguments before and after^[Note that we use before and after whereas dplyr::relocate() uses .before and .after.] specify where the selected columns should be relocated:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars |>
  data_relocate(sex:homeworld, before = "height")

:::

::: {}

# ---------- tidyverse -----------
starwars |>
  relocate(sex:homeworld, .before = height)

:::

::::

In addition to column names, before and after accept column indices. Finally, one can use before = -1 to relocate the selected columns just before the last column, or after = -1 to relocate them after the last column.

# ---------- datawizard -----------
starwars |>
  data_relocate(sex:homeworld, after = -1)

Reshaping {#reshaping}

Longer

Reshaping data from wide to long or from long to wide format can be done with data_to_long() and data_to_wide(). These functions were designed to match tidyr::pivot_longer() and tidyr::pivot_wider() arguments, so that the only thing to do is to change the function name. However, not all of tidyr::pivot_longer() and tidyr::pivot_wider() features are available yet.

We will use the relig_income dataset, as in the {tidyr} vignette.

relig_income

We would like to reshape this dataset to have 3 columns: religion, count, and income. The column "religion" doesn't need to change, so we exclude it with -religion. Then, each remaining column corresponds to an income category. Therefore, we want to move all these column names to a single column called "income". Finally, the values corresponding to each of these columns will be reshaped to be in a single new column, called "count".

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
relig_income |>
  data_to_long(
    -religion,
    names_to = "income",
    values_to = "count"
  )

:::

::: {}

# ---------- tidyverse -----------
relig_income |>
  pivot_longer(
    !religion,
    names_to = "income",
    values_to = "count"
  )

:::

::::

To explore a bit more the arguments of data_to_long(), we will use another dataset: the billboard dataset.

billboard

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
billboard |>
  data_to_long(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

:::

::: {}

# ---------- tidyverse -----------
billboard |>
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

:::

::::

Wider

Once again, we use an example in the {tidyr} vignette to show how close data_to_wide() and pivot_wider() are:

fish_encounters

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
fish_encounters |>
  data_to_wide(
    names_from = "station",
    values_from = "seen",
    values_fill = 0
  )

:::

::: {}

# ---------- tidyverse -----------
fish_encounters |>
  pivot_wider(
    names_from = station,
    values_from = seen,
    values_fill = 0
  )

:::

::::

Joining {#joining}

In {datawizard}, joining datasets is done with data_join() (or its alias data_merge()). Contrary to {dplyr}, this unique function takes care of all types of join, which are then specified inside the function with the argument join (by default, join = "left").

Below, we show how to perform the four most common joins: full, left, right and inner. We will use the datasets band_membersand band_instruments provided by {dplyr}:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

band_members

:::

::: {}

band_instruments

:::

::::

Full join

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members |>
  data_join(band_instruments, join = "full")

:::

::: {}

# ---------- tidyverse -----------
band_members |>
  full_join(band_instruments)

:::

::::

Left and right joins

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members |>
  data_join(band_instruments, join = "left")

:::

::: {}

# ---------- tidyverse -----------
band_members |>
  left_join(band_instruments)

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members |>
  data_join(band_instruments, join = "right")

:::

::: {}

# ---------- tidyverse -----------
band_members |>
  right_join(band_instruments)

:::

::::

Inner join

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members |>
  data_join(band_instruments, join = "inner")

:::

::: {}

# ---------- tidyverse -----------
band_members |>
  inner_join(band_instruments)

:::

::::

Uniting {#uniting}

Uniting variables is useful e.g to create unique indices by combining several variables or to gather years, months, and days into a single date. data_unite() offers an interface very close to tidyr::unite():

test <- data.frame(
  year = 2002:2004,
  month = c("02", "03", "09"),
  day = c("11", "22", "28"),
  stringsAsFactors = FALSE
)
test

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test |>
  data_unite(
    new_column = "date",
    select = c("year", "month", "day"),
    separator = "-"
  )

:::

::: {}

# ---------- tidyverse -----------
test |>
  unite(
    col = "date",
    year, month, day,
    sep = "-"
  )

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test |>
  data_unite(
    new_column = "date",
    select = c("year", "month", "day"),
    separator = "-",
    append = TRUE
  )

:::

::: {}

# ---------- tidyverse -----------
test |>
  unite(
    col = "date",
    year, month, day,
    sep = "-",
    remove = FALSE
  )

:::

::::

Separating {#separating}

Separating variables is the counterpart to uniting variables and is useful to split values into multiple columns, e.g. when splitting a date into values for years, months and days. data_separate() offers an interface very close to tidyr::separate():

test <- data.frame(
  date_arrival = c("2002-02-11", "2003-03-22", "2004-09-28"),
  date_departure = c("2002-03-15", "2003-03-28", "2004-09-30"),
  stringsAsFactors = FALSE
)
test

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test |>
  data_separate(
    select = "date_arrival",
    new_columns = c("Year", "Month", "Day")
  )

:::

::: {}

# ---------- tidyverse -----------
test |>
  separate(
    date_arrival,
    into = c("Year", "Month", "Day")
  )

:::

::::

Unlike tidyr::separate(), you can separate multiple columns in one step with data_separate().

test |>
  data_separate(
    new_columns = list(
      date_arrival = c("Arr_Year", "Arr_Month", "Arr_Day"),
      date_departure = c("Dep_Year", "Dep_Month", "Dep_Day")
    )
  )

Other useful functions

{datawizard} contains other functions that are not necessarily included in {dplyr} or {tidyr} or do not directly modify the data. Some of them are inspired from the package janitor.

Work with rownames

We can convert a column in rownames and move rownames to a new column with rownames_as_column() and column_as_rownames():

mtcars <- head(mtcars)
mtcars

mtcars2 <- mtcars |>
  rownames_as_column(var = "model")

mtcars2

mtcars2 |>
  column_as_rownames(var = "model")

Work with row ids

rowid_as_column() is close but not identical to tibble::rowid_to_column(). The main difference is when we use it with grouped data. While tibble::rowid_to_column() uses one distinct rowid for every row in the dataset, rowid_as_column() creates one id for every row in each group. Therefore, two rows in different groups can have the same row id.

This means that rowid_as_column() is closer to using n() in mutate(), like the following:

test <- data.frame(
  group = c("A", "A", "B", "B"),
  value = c(3, 5, 8, 1),
  stringsAsFactors = FALSE
)
test

test |>
  data_group(group) |>
  tibble::rowid_to_column()

test |>
  data_group(group) |>
  rowid_as_column()

test |>
  data_group(group) |>
  mutate(id = seq_len(n()))

Work with column names

When dealing with messy data, it is sometimes useful to use a row as column names, and vice versa. This can be done with row_to_colnames() and colnames_to_row().

x <- data.frame(
  X_1 = c(NA, "Title", 1:3),
  X_2 = c(NA, "Title2", 4:6)
)
x
x2 <- x |>
  row_to_colnames(row = 2)
x2

x2 |>
  colnames_to_row()

Take a quick look at the data

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
data_peek(iris)

:::

::: {}

# ---------- tidyverse -----------
glimpse(iris)

:::

::::

easystats/datawizard documentation built on April 17, 2025, 5:32 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

easystats/datawizard
Easy Data Wrangling and Statistical Transformations

In easystats/datawizard: Easy Data Wrangling and Statistical Transformations

Workhorses

Equivalence with `{dplyr}` / `{tidyr}`

Filtering {#filtering}

Selecting {#selecting}

Modifying {#modifying}

Sorting {#sorting}

Extracting {#extracting}

Renaming {#renaming}

Relocating {#relocating}

Reshaping {#reshaping}

Longer

Wider

Joining {#joining}

Full join

Left and right joins

Inner join

Uniting {#uniting}

Separating {#separating}

Other useful functions

Work with rownames

Work with row ids

Work with column names

Take a quick look at the data

R Package Documentation

Browse R Packages

We want your feedback!

easystats/datawizard Easy Data Wrangling and Statistical Transformations

In easystats/datawizard: Easy Data Wrangling and Statistical Transformations

Workhorses

Equivalence with {dplyr} / {tidyr}

Filtering {#filtering}

Selecting {#selecting}

Modifying {#modifying}

Sorting {#sorting}

Extracting {#extracting}

Renaming {#renaming}

Relocating {#relocating}

Reshaping {#reshaping}

Longer

Wider

Joining {#joining}

Full join

Left and right joins

Inner join

Uniting {#uniting}

Separating {#separating}

Other useful functions

Work with rownames

Work with row ids

Work with column names

Take a quick look at the data

R Package Documentation

Browse R Packages

We want your feedback!

easystats/datawizard
Easy Data Wrangling and Statistical Transformations

Equivalence with `{dplyr}` / `{tidyr}`