Coming from 'tidyverse
In datawizard: Easy Data Wrangling and Statistical Transformations

library(knitr)
options(knitr.kable.NA = "")
knitr::opts_chunk$set(
  eval = FALSE,
  message = FALSE,
  warning = FALSE,
  dpi = 300
)

pkgs <- c(
  "dplyr",
  "datawizard",
  "tidyr"
)

# since we explicitely put eval = TRUE for some chunks, we can't rely on
# knitr::opts_chunk$set(eval = FALSE) at the beginning of the script. So we make
# a logical that is FALSE only if deps are not installed (cf easystats/easystats#317)
evaluate_chunk <- TRUE

if (!all(vapply(pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1L)))) {
  evaluate_chunk <- FALSE
}

This vignette can be referred to by citing the following:

Patil et al., (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software, 7(78), 4684, https://doi.org/10.21105/joss.04684

```{css, echo=FALSE, eval = evaluate_chunk} .datawizard, .datawizard > .sourceCode { background-color: #e6e6ff; } .tidyverse, .tidyverse > .sourceCode { background-color: #d9f2e5; }

# Introduction

`{datawizard}` package aims to make basic data wrangling easier than 
with base R. The data wrangling workflow it supports is similar to the one
supported by the tidyverse package combination of `{dplyr}` and `{tidyr}`. However,
one of its main features is that it has a very few dependencies: `{stats}` and `{utils}`
(included in base R) and `{insight}`, which is the core package of the _easystats_ 
ecosystem. This package grew organically to simultaneously satisfy the 
"0 non-base hard dependency" principle of _easystats_ and the data wrangling needs
of the constituent packages in this ecosystem.

One drawback of this genesis is that not all features of the `{tidyverse}` 
packages are supported since only features that were necessary for _easystats_ 
ecosystem have been implemented. Some of these missing features (such as `summarize`
or the pipe operator `%>%`) are made available in other dependency-free packages, 
such as [`{poorman}`](https://github.com/nathaneastwood/poorman/). It is also 
important to note that `{datawizard}` was designed to avoid namespace collisions 
with `{tidyverse}` packages.

In this article, we will see how to go through basic data wrangling steps with 
`{datawizard}`. We will also compare it to the `{tidyverse}` syntax for achieving the same. 
This way, if you decide to make the switch, you can easily find the translations here.
This vignette is largely inspired from `{dplyr}`'s [Getting started vignette](https://dplyr.tidyverse.org/articles/dplyr.html).

```r
library(dplyr)
library(tidyr)
library(datawizard)

data(efc)
efc <- head(efc)

Workhorses

Before we look at their tidyverse equivalents, we can first have a look at {datawizard}'s key functions for data wrangling:

Note that there are a few functions in {datawizard} that have no strict equivalent in {dplyr} or {tidyr} (e.g data_rotate()), and so we won't discuss them in the next section.

Equivalence with `{dplyr}` / `{tidyr}`

Before we look at them individually, let's first have a look at the summary table of this equivalence.

Filtering {#filtering}

data_filter() is a wrapper around subset(). However, if you want to have several filtering conditions, you can either use & (as in subset()) or , (as in dplyr::filter()).

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_filter(
    skin_color == "light",
    eye_color == "brown"
  )

# or
starwars %>%
  data_filter(
    skin_color == "light" &
      eye_color == "brown"
  )

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  filter(
    skin_color == "light",
    eye_color == "brown"
  )

:::

::::

starwars <- head(starwars)

Selecting {#selecting}

data_select() is the equivalent of dplyr::select(). The main difference between these two functions is that data_select() uses two arguments (select and exclude) and requires quoted column names if we want to select several variables, while dplyr::select() accepts any unquoted column names.

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(select = c("hair_color", "skin_color", "eye_color"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(hair_color, skin_color, eye_color)

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(select = -ends_with("color"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(-ends_with("color"))

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(select = -(hair_color:eye_color))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(!(hair_color:eye_color))

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(exclude = regex("color$"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(-contains("color$"))

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(select = is.numeric)

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(where(is.numeric))

:::

::::

You can find a list of all the select helpers with ?data_select.

Modifying {#modifying}

data_modify() is a wrapper around base::transform() but has several additional benefits:

it allows us to use newly created variables in the following expressions;
it works with grouped data;
it preserves variable attributes such as labels;
it accepts expressions as character vectors so that it is easy to program with it

This last point is also the main difference between data_modify() and dplyr::mutate().

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
efc %>%
  data_modify(
    c12hour_c = center(c12hour),
    c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE),
    c12hour_z2 = standardize(c12hour)
  )

:::

::: {}

# ---------- tidyverse -----------
efc %>%
  mutate(
    c12hour_c = center(c12hour),
    c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE),
    c12hour_z2 = standardize(c12hour)
  )

:::

::::

data_modify() accepts expressions as strings:

new_exp <- c(
  "c12hour_c = center(c12hour)",
  "c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE)"
)
data_modify(efc, new_exp)

This makes it easy to use it in custom functions:

miles_to_km <- function(data, var) {
  data_modify(
    data,
    paste0("km = ", var, "* 1.609344")
  )
}

distance <- data.frame(miles = c(1, 8, 233, 88, 9))
distance

miles_to_km(distance, "miles")

Sorting {#sorting}

data_arrange() is the equivalent of dplyr::arrange(). It takes two arguments: a data frame, and a vector of column names used to sort the rows. Note that contrary to most other functions in {datawizard}, it is not possible to use select helpers such as starts_with() in data_arrange().

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars %>%
  data_arrange(c("hair_color", "height"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  arrange(hair_color, height)

:::

::::

You can also sort variables in descending order by putting a "-" in front of their name, like below:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars %>%
  data_arrange(c("-hair_color", "-height"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  arrange(desc(hair_color), -height)

:::

::::

Extracting {#extracting}

Although we mostly work on data frames, it is sometimes useful to extract a single column as a vector. This can be done with data_extract(), which reproduces the behavior of dplyr::pull():

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars %>%
  data_extract(gender)

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  pull(gender)

:::

::::

We can also specify several variables in select. In this case, data_extract() is equivalent to data_select():

starwars %>%
  data_extract(select = contains("color"))

Renaming {#renaming}

data_rename() is the equivalent of dplyr::rename() but the syntax between the two is different. While dplyr::rename() takes new-old pairs of column names, data_rename() requires a vector of column names to rename, and then a vector of new names for these columns that must be of the same length.

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_rename(
    pattern = c("sex", "hair_color"),
    replacement = c("Sex", "Hair Color")
  )

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  rename(
    Sex = sex,
    "Hair Color" = hair_color
  )

:::

::::

The way data_rename() is designed makes it easy to apply the same modifications to a vector of column names. For example, we can remove underscores and use TitleCase with the following code:

to_rename <- names(starwars)

starwars %>%
  data_rename(
    pattern = to_rename,
    replacement = tools::toTitleCase(gsub("_", " ", to_rename, fixed = TRUE))
  )

It is also possible to add a prefix or a suffix to all or a subset of variables with data_addprefix() and data_addsuffix(). The argument select accepts all select helpers that we saw above with data_select():

starwars %>%
  data_addprefix(
    pattern = "OLD.",
    select = contains("color")
  ) %>%
  data_addsuffix(
    pattern = ".NEW",
    select = -contains("color")
  )

Relocating {#relocating}

Sometimes, we want to relocate one or a small subset of columns in the dataset. Rather than typing many names in data_select(), we can use data_relocate(), which is the equivalent of dplyr::relocate(). Just like data_select(), we can specify a list of variables we want to relocate with select and exclude. Then, the arguments before and after^[Note that we use before and after whereas dplyr::relocate() uses .before and .after.] specify where the selected columns should be relocated:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_relocate(sex:homeworld, before = "height")

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  relocate(sex:homeworld, .before = height)

:::

::::

In addition to column names, before and after accept column indices. Finally, one can use before = -1 to relocate the selected columns just before the last column, or after = -1 to relocate them after the last column.

# ---------- datawizard -----------
starwars %>%
  data_relocate(sex:homeworld, after = -1)

Reshaping {#reshaping}

Longer

Reshaping data from wide to long or from long to wide format can be done with data_to_long() and data_to_wide(). These functions were designed to match tidyr::pivot_longer() and tidyr::pivot_wider() arguments, so that the only thing to do is to change the function name. However, not all of tidyr::pivot_longer() and tidyr::pivot_wider() features are available yet.

We will use the relig_income dataset, as in the {tidyr} vignette.

relig_income

We would like to reshape this dataset to have 3 columns: religion, count, and income. The column "religion" doesn't need to change, so we exclude it with -religion. Then, each remaining column corresponds to an income category. Therefore, we want to move all these column names to a single column called "income". Finally, the values corresponding to each of these columns will be reshaped to be in a single new column, called "count".

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
relig_income %>%
  data_to_long(
    -religion,
    names_to = "income",
    values_to = "count"
  )

:::

::: {}

# ---------- tidyverse -----------
relig_income %>%
  pivot_longer(
    !religion,
    names_to = "income",
    values_to = "count"
  )

:::

::::

To explore a bit more the arguments of data_to_long(), we will use another dataset: the billboard dataset.

billboard

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
billboard %>%
  data_to_long(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

:::

::: {}

# ---------- tidyverse -----------
billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

:::

::::

Wider

Once again, we use an example in the {tidyr} vignette to show how close data_to_wide() and pivot_wider() are:

fish_encounters

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
fish_encounters %>%
  data_to_wide(
    names_from = "station",
    values_from = "seen",
    values_fill = 0
  )

:::

::: {}

# ---------- tidyverse -----------
fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen,
    values_fill = 0
  )

:::

::::

Joining {#joining}

In {datawizard}, joining datasets is done with data_join() (or its alias data_merge()). Contrary to {dplyr}, this unique function takes care of all types of join, which are then specified inside the function with the argument join (by default, join = "left").

Below, we show how to perform the four most common joins: full, left, right and inner. We will use the datasets band_membersand band_instruments provided by {dplyr}:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

band_members

:::

::: {}

band_instruments

:::

::::

Full join

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "full")

:::

::: {}

# ---------- tidyverse -----------
band_members %>%
  full_join(band_instruments)

:::

::::

Left and right joins

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "left")

:::

::: {}

# ---------- tidyverse -----------
band_members %>%
  left_join(band_instruments)

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "right")

:::

::: {}

# ---------- tidyverse -----------
band_members %>%
  right_join(band_instruments)

:::

::::

Inner join

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "inner")

:::

::: {}

# ---------- tidyverse -----------
band_members %>%
  inner_join(band_instruments)

:::

::::

Uniting {#uniting}

Uniting variables is useful e.g to create unique indices by combining several variables or to gather years, months, and days into a single date. data_unite() offers an interface very close to tidyr::unite():

test <- data.frame(
  year = 2002:2004,
  month = c("02", "03", "09"),
  day = c("11", "22", "28"),
  stringsAsFactors = FALSE
)
test

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test %>%
  data_unite(
    new_column = "date",
    select = c("year", "month", "day"),
    separator = "-"
  )

:::

::: {}

# ---------- tidyverse -----------
test %>%
  unite(
    col = "date",
    year, month, day,
    sep = "-"
  )

:::

::::

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test %>%
  data_unite(
    new_column = "date",
    select = c("year", "month", "day"),
    separator = "-",
    append = TRUE
  )

:::

::: {}

# ---------- tidyverse -----------
test %>%
  unite(
    col = "date",
    year, month, day,
    sep = "-",
    remove = FALSE
  )

:::

::::

Separating {#separating}

Separating variables is the counterpart to uniting variables and is useful to split values into multiple columns, e.g. when splitting a date into values for years, months and days. data_separate() offers an interface very close to tidyr::separate():

test <- data.frame(
  date_arrival = c("2002-02-11", "2003-03-22", "2004-09-28"),
  date_departure = c("2002-03-15", "2003-03-28", "2004-09-30"),
  stringsAsFactors = FALSE
)
test

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test %>%
  data_separate(
    select = "date_arrival",
    new_columns = c("Year", "Month", "Day")
  )

:::

::: {}

# ---------- tidyverse -----------
test %>%
  separate(
    date_arrival,
    into = c("Year", "Month", "Day")
  )

:::

::::

Unlike tidyr::separate(), you can separate multiple columns in one step with data_separate().

test %>%
  data_separate(
    new_columns = list(
      date_arrival = c("Arr_Year", "Arr_Month", "Arr_Day"),
      date_departure = c("Dep_Year", "Dep_Month", "Dep_Day")
    )
  )

Other useful functions

{datawizard} contains other functions that are not necessarily included in {dplyr} or {tidyr} or do not directly modify the data. Some of them are inspired from the package janitor.

Work with rownames

We can convert a column in rownames and move rownames to a new column with rownames_as_column() and column_as_rownames():

mtcars <- head(mtcars)
mtcars

mtcars2 <- mtcars %>%
  rownames_as_column(var = "model")

mtcars2

mtcars2 %>%
  column_as_rownames(var = "model")

Work with row ids

rowid_as_column() is close but not identical to tibble::rowid_to_column(). The main difference is when we use it with grouped data. While tibble::rowid_to_column() uses one distinct rowid for every row in the dataset, rowid_as_column() creates one id for every row in each group. Therefore, two rows in different groups can have the same row id.

This means that rowid_as_column() is closer to using n() in mutate(), like the following:

test <- data.frame(
  group = c("A", "A", "B", "B"),
  value = c(3, 5, 8, 1),
  stringsAsFactors = FALSE
)
test

test %>%
  data_group(group) %>%
  tibble::rowid_to_column()

test %>%
  data_group(group) %>%
  rowid_as_column()

test %>%
  data_group(group) %>%
  mutate(id = seq_len(n()))

Work with column names

When dealing with messy data, it is sometimes useful to use a row as column names, and vice versa. This can be done with row_to_colnames() and colnames_to_row().

x <- data.frame(
  X_1 = c(NA, "Title", 1:3),
  X_2 = c(NA, "Title2", 4:6)
)
x
x2 <- x %>%
  row_to_colnames(row = 2)
x2

x2 %>%
  colnames_to_row()

Take a quick look at the data

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
data_peek(iris)

:::

::: {}

# ---------- tidyverse -----------
glimpse(iris)

:::

::::

Any scripts or data that you put into this service are public.

datawizard documentation built on Sept. 15, 2023, 9:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

datawizard
Easy Data Wrangling and Statistical Transformations

Coming from 'tidyverse
In datawizard: Easy Data Wrangling and Statistical Transformations

Workhorses

Equivalence with `{dplyr}` / `{tidyr}`

Filtering {#filtering}

Selecting {#selecting}

Modifying {#modifying}

Sorting {#sorting}

Extracting {#extracting}

Renaming {#renaming}

Relocating {#relocating}

Reshaping {#reshaping}

Longer

Wider

Joining {#joining}

Full join

Left and right joins

Inner join

Uniting {#uniting}

Separating {#separating}

Other useful functions

Work with rownames

Work with row ids

Work with column names

Take a quick look at the data

Try the datawizard package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

datawizard Easy Data Wrangling and Statistical Transformations

Coming from 'tidyverse In datawizard: Easy Data Wrangling and Statistical Transformations

Workhorses

Equivalence with {dplyr} / {tidyr}

Filtering {#filtering}

Selecting {#selecting}

Modifying {#modifying}

Sorting {#sorting}

Extracting {#extracting}

Renaming {#renaming}

Relocating {#relocating}

Reshaping {#reshaping}

Longer

Wider

Joining {#joining}

Full join

Left and right joins

Inner join

Uniting {#uniting}

Separating {#separating}

Other useful functions

Work with rownames

Work with row ids

Work with column names

Take a quick look at the data

Try the datawizard package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

datawizard
Easy Data Wrangling and Statistical Transformations

Coming from 'tidyverse
In datawizard: Easy Data Wrangling and Statistical Transformations

Equivalence with `{dplyr}` / `{tidyr}`