Coming from 'tidyverse

library(knitr)
options(knitr.kable.NA = "")
knitr::opts_chunk$set(
  eval = FALSE,
  message = FALSE,
  warning = FALSE,
  dpi = 300
)

pkgs <- c(
  "dplyr",
  "datawizard",
  "tidyr"
)

# since we explicitely put eval = TRUE for some chunks, we can't rely on
# knitr::opts_chunk$set(eval = FALSE) at the beginning of the script. So we make
# a logical that is FALSE only if deps are not installed (cf easystats/easystats#317)
evaluate_chunk <- TRUE

if (!all(vapply(pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1L)))) {
  evaluate_chunk <- FALSE
}

This vignette can be referred to by citing the following:

Patil et al., (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software, 7(78), 4684, https://doi.org/10.21105/joss.04684

```{css, echo=FALSE, eval = evaluate_chunk} .datawizard, .datawizard > .sourceCode { background-color: #e6e6ff; } .tidyverse, .tidyverse > .sourceCode { background-color: #d9f2e5; }

# Introduction

`{datawizard}` package aims to make basic data wrangling easier than 
with base R. The data wrangling workflow it supports is similar to the one
supported by the tidyverse package combination of `{dplyr}` and `{tidyr}`. However,
one of its main features is that it has a very few dependencies: `{stats}` and `{utils}`
(included in base R) and `{insight}`, which is the core package of the _easystats_ 
ecosystem. This package grew organically to simultaneously satisfy the 
"0 non-base hard dependency" principle of _easystats_ and the data wrangling needs
of the constituent packages in this ecosystem.

One drawback of this genesis is that not all features of the `{tidyverse}` 
packages are supported since only features that were necessary for _easystats_ 
ecosystem have been implemented. Some of these missing features (such as `summarize`
or the pipe operator `%>%`) are made available in other dependency-free packages, 
such as [`{poorman}`](https://github.com/nathaneastwood/poorman/). It is also 
important to note that `{datawizard}` was designed to avoid namespace collisions 
with `{tidyverse}` packages.

In this article, we will see how to go through basic data wrangling steps with 
`{datawizard}`. We will also compare it to the `{tidyverse}` syntax for achieving the same. 
This way, if you decide to make the switch, you can easily find the translations here.
This vignette is largely inspired from `{dplyr}`'s [Getting started vignette](https://dplyr.tidyverse.org/articles/dplyr.html).

```r
library(dplyr)
library(tidyr)
library(datawizard)

data(efc)
efc <- head(efc)

Workhorses

Before we look at their tidyverse equivalents, we can first have a look at {datawizard}'s key functions for data wrangling:

| Function | Operation | | :---------------- | :------------------------------------------------ | | data_filter() | to select only certain observations | | data_select() | to select only a few variables | | data_modify() | to create variables or modify existing ones | | data_arrange() | to sort observations | | data_extract() | to extract a single variable | | data_rename() | to rename variables | | data_relocate() | to reorder a data frame | | data_to_long() | to convert data from wide to long | | data_to_wide() | to convert data from long to wide | | data_join() | to join two data frames | | data_unite() | to concatenate several columns into a single one | | data_separate() | to separate a single column into multiple columns |

Note that there are a few functions in {datawizard} that have no strict equivalent in {dplyr} or {tidyr} (e.g data_rotate()), and so we won't discuss them in the next section.

Equivalence with {dplyr} / {tidyr}

Before we look at them individually, let's first have a look at the summary table of this equivalence.

| Function | Tidyverse equivalent(s) | | :---------------- | :------------------------------------------------------------------ | | data_filter() | dplyr::filter(), dplyr::slice() | | data_select() | dplyr::select() | | data_modify() | dplyr::mutate() | | data_arrange() | dplyr::arrange() | | data_extract() | dplyr::pull() | | data_rename() | dplyr::rename() | | data_relocate() | dplyr::relocate() | | data_to_long() | tidyr::pivot_longer() | | data_to_wide() | tidyr::pivot_wider() | | data_join() | dplyr::inner_join(), dplyr::left_join(), dplyr::right_join(), | | | dplyr::full_join(), dplyr::anti_join(), dplyr::semi_join() | | data_peek() | dplyr::glimpse() | | data_unite() | tidyr::unite() | | data_separate() | tidyr::separate() |

Filtering {#filtering}

data_filter() is a wrapper around subset(). However, if you want to have several filtering conditions, you can either use & (as in subset()) or , (as in dplyr::filter()).

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_filter(
    skin_color == "light",
    eye_color == "brown"
  )

# or
starwars %>%
  data_filter(
    skin_color == "light" &
      eye_color == "brown"
  )

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  filter(
    skin_color == "light",
    eye_color == "brown"
  )

:::

::::


starwars <- head(starwars)

Selecting {#selecting}

data_select() is the equivalent of dplyr::select(). The main difference between these two functions is that data_select() uses two arguments (select and exclude) and requires quoted column names if we want to select several variables, while dplyr::select() accepts any unquoted column names.

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(select = c("hair_color", "skin_color", "eye_color"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(hair_color, skin_color, eye_color)

:::

::::


:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(select = -ends_with("color"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(-ends_with("color"))

:::

::::


:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(select = -(hair_color:eye_color))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(!(hair_color:eye_color))

:::

::::


:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(exclude = regex("color$"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(-contains("color$"))

:::

::::


:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_select(select = is.numeric)

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  select(where(is.numeric))

:::

::::


You can find a list of all the select helpers with ?data_select.

Modifying {#modifying}

data_modify() is a wrapper around base::transform() but has several additional benefits:

This last point is also the main difference between data_modify() and dplyr::mutate().

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
efc %>%
  data_modify(
    c12hour_c = center(c12hour),
    c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE),
    c12hour_z2 = standardize(c12hour)
  )

:::

::: {}

# ---------- tidyverse -----------
efc %>%
  mutate(
    c12hour_c = center(c12hour),
    c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE),
    c12hour_z2 = standardize(c12hour)
  )

:::

::::


data_modify() accepts expressions as strings:

new_exp <- c(
  "c12hour_c = center(c12hour)",
  "c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE)"
)
data_modify(efc, new_exp)

This makes it easy to use it in custom functions:

miles_to_km <- function(data, var) {
  data_modify(
    data,
    paste0("km = ", var, "* 1.609344")
  )
}

distance <- data.frame(miles = c(1, 8, 233, 88, 9))
distance

miles_to_km(distance, "miles")

Sorting {#sorting}

data_arrange() is the equivalent of dplyr::arrange(). It takes two arguments: a data frame, and a vector of column names used to sort the rows. Note that contrary to most other functions in {datawizard}, it is not possible to use select helpers such as starts_with() in data_arrange().

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars %>%
  data_arrange(c("hair_color", "height"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  arrange(hair_color, height)

:::

::::


You can also sort variables in descending order by putting a "-" in front of their name, like below:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars %>%
  data_arrange(c("-hair_color", "-height"))

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  arrange(desc(hair_color), -height)

:::

::::


Extracting {#extracting}

Although we mostly work on data frames, it is sometimes useful to extract a single column as a vector. This can be done with data_extract(), which reproduces the behavior of dplyr::pull():

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"} :::{}

# ---------- datawizard -----------
starwars %>%
  data_extract(gender)

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  pull(gender)

:::

::::


We can also specify several variables in select. In this case, data_extract() is equivalent to data_select():

starwars %>%
  data_extract(select = contains("color"))

Renaming {#renaming}

data_rename() is the equivalent of dplyr::rename() but the syntax between the two is different. While dplyr::rename() takes new-old pairs of column names, data_rename() requires a vector of column names to rename, and then a vector of new names for these columns that must be of the same length.

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_rename(
    pattern = c("sex", "hair_color"),
    replacement = c("Sex", "Hair Color")
  )

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  rename(
    Sex = sex,
    "Hair Color" = hair_color
  )

:::

::::


The way data_rename() is designed makes it easy to apply the same modifications to a vector of column names. For example, we can remove underscores and use TitleCase with the following code:

to_rename <- names(starwars)

starwars %>%
  data_rename(
    pattern = to_rename,
    replacement = tools::toTitleCase(gsub("_", " ", to_rename, fixed = TRUE))
  )

It is also possible to add a prefix or a suffix to all or a subset of variables with data_addprefix() and data_addsuffix(). The argument select accepts all select helpers that we saw above with data_select():

starwars %>%
  data_addprefix(
    pattern = "OLD.",
    select = contains("color")
  ) %>%
  data_addsuffix(
    pattern = ".NEW",
    select = -contains("color")
  )

Relocating {#relocating}

Sometimes, we want to relocate one or a small subset of columns in the dataset. Rather than typing many names in data_select(), we can use data_relocate(), which is the equivalent of dplyr::relocate(). Just like data_select(), we can specify a list of variables we want to relocate with select and exclude. Then, the arguments before and after^[Note that we use before and after whereas dplyr::relocate() uses .before and .after.] specify where the selected columns should be relocated:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
starwars %>%
  data_relocate(sex:homeworld, before = "height")

:::

::: {}

# ---------- tidyverse -----------
starwars %>%
  relocate(sex:homeworld, .before = height)

:::

::::


In addition to column names, before and after accept column indices. Finally, one can use before = -1 to relocate the selected columns just before the last column, or after = -1 to relocate them after the last column.

# ---------- datawizard -----------
starwars %>%
  data_relocate(sex:homeworld, after = -1)

Reshaping {#reshaping}

Longer

Reshaping data from wide to long or from long to wide format can be done with data_to_long() and data_to_wide(). These functions were designed to match tidyr::pivot_longer() and tidyr::pivot_wider() arguments, so that the only thing to do is to change the function name. However, not all of tidyr::pivot_longer() and tidyr::pivot_wider() features are available yet.

We will use the relig_income dataset, as in the {tidyr} vignette.

relig_income

We would like to reshape this dataset to have 3 columns: religion, count, and income. The column "religion" doesn't need to change, so we exclude it with -religion. Then, each remaining column corresponds to an income category. Therefore, we want to move all these column names to a single column called "income". Finally, the values corresponding to each of these columns will be reshaped to be in a single new column, called "count".

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
relig_income %>%
  data_to_long(
    -religion,
    names_to = "income",
    values_to = "count"
  )

:::

::: {}

# ---------- tidyverse -----------
relig_income %>%
  pivot_longer(
    !religion,
    names_to = "income",
    values_to = "count"
  )

:::

::::


To explore a bit more the arguments of data_to_long(), we will use another dataset: the billboard dataset.

billboard

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
billboard %>%
  data_to_long(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

:::

::: {}

# ---------- tidyverse -----------
billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

:::

::::


Wider

Once again, we use an example in the {tidyr} vignette to show how close data_to_wide() and pivot_wider() are:

fish_encounters

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
fish_encounters %>%
  data_to_wide(
    names_from = "station",
    values_from = "seen",
    values_fill = 0
  )

:::

::: {}

# ---------- tidyverse -----------
fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen,
    values_fill = 0
  )

:::

::::


Joining {#joining}

In {datawizard}, joining datasets is done with data_join() (or its alias data_merge()). Contrary to {dplyr}, this unique function takes care of all types of join, which are then specified inside the function with the argument join (by default, join = "left").

Below, we show how to perform the four most common joins: full, left, right and inner. We will use the datasets band_membersand band_instruments provided by {dplyr}:

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

band_members

:::

::: {}

band_instruments

:::

::::

Full join

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "full")

:::

::: {}

# ---------- tidyverse -----------
band_members %>%
  full_join(band_instruments)

:::

::::


Left and right joins

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "left")

:::

::: {}

# ---------- tidyverse -----------
band_members %>%
  left_join(band_instruments)

:::

::::


:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "right")

:::

::: {}

# ---------- tidyverse -----------
band_members %>%
  right_join(band_instruments)

:::

::::


Inner join

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "inner")

:::

::: {}

# ---------- tidyverse -----------
band_members %>%
  inner_join(band_instruments)

:::

::::


Uniting {#uniting}

Uniting variables is useful e.g to create unique indices by combining several variables or to gather years, months, and days into a single date. data_unite() offers an interface very close to tidyr::unite():

test <- data.frame(
  year = 2002:2004,
  month = c("02", "03", "09"),
  day = c("11", "22", "28"),
  stringsAsFactors = FALSE
)
test

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test %>%
  data_unite(
    new_column = "date",
    select = c("year", "month", "day"),
    separator = "-"
  )

:::

::: {}

# ---------- tidyverse -----------
test %>%
  unite(
    col = "date",
    year, month, day,
    sep = "-"
  )

:::

::::


:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test %>%
  data_unite(
    new_column = "date",
    select = c("year", "month", "day"),
    separator = "-",
    append = TRUE
  )

:::

::: {}

# ---------- tidyverse -----------
test %>%
  unite(
    col = "date",
    year, month, day,
    sep = "-",
    remove = FALSE
  )

:::

::::


Separating {#separating}

Separating variables is the counterpart to uniting variables and is useful to split values into multiple columns, e.g. when splitting a date into values for years, months and days. data_separate() offers an interface very close to tidyr::separate():

test <- data.frame(
  date_arrival = c("2002-02-11", "2003-03-22", "2004-09-28"),
  date_departure = c("2002-03-15", "2003-03-28", "2004-09-30"),
  stringsAsFactors = FALSE
)
test

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
test %>%
  data_separate(
    select = "date_arrival",
    new_columns = c("Year", "Month", "Day")
  )

:::

::: {}

# ---------- tidyverse -----------
test %>%
  separate(
    date_arrival,
    into = c("Year", "Month", "Day")
  )

:::

::::


Unlike tidyr::separate(), you can separate multiple columns in one step with data_separate().

test %>%
  data_separate(
    new_columns = list(
      date_arrival = c("Arr_Year", "Arr_Month", "Arr_Day"),
      date_departure = c("Dep_Year", "Dep_Month", "Dep_Day")
    )
  )

Other useful functions

{datawizard} contains other functions that are not necessarily included in {dplyr} or {tidyr} or do not directly modify the data. Some of them are inspired from the package janitor.

Work with rownames

We can convert a column in rownames and move rownames to a new column with rownames_as_column() and column_as_rownames():

mtcars <- head(mtcars)
mtcars

mtcars2 <- mtcars %>%
  rownames_as_column(var = "model")

mtcars2

mtcars2 %>%
  column_as_rownames(var = "model")

Work with row ids

rowid_as_column() is close but not identical to tibble::rowid_to_column(). The main difference is when we use it with grouped data. While tibble::rowid_to_column() uses one distinct rowid for every row in the dataset, rowid_as_column() creates one id for every row in each group. Therefore, two rows in different groups can have the same row id.

This means that rowid_as_column() is closer to using n() in mutate(), like the following:

test <- data.frame(
  group = c("A", "A", "B", "B"),
  value = c(3, 5, 8, 1),
  stringsAsFactors = FALSE
)
test

test %>%
  data_group(group) %>%
  tibble::rowid_to_column()

test %>%
  data_group(group) %>%
  rowid_as_column()

test %>%
  data_group(group) %>%
  mutate(id = seq_len(n()))

Work with column names

When dealing with messy data, it is sometimes useful to use a row as column names, and vice versa. This can be done with row_to_colnames() and colnames_to_row().

x <- data.frame(
  X_1 = c(NA, "Title", 1:3),
  X_2 = c(NA, "Title2", 4:6)
)
x
x2 <- x %>%
  row_to_colnames(row = 2)
x2

x2 %>%
  colnames_to_row()

Take a quick look at the data

:::: {style="display: grid; grid-template-columns: 50% 50%; grid-column-gap: 10px;"}

::: {}

# ---------- datawizard -----------
data_peek(iris)

:::

::: {}

# ---------- tidyverse -----------
glimpse(iris)

:::

::::




Try the datawizard package in your browser

Any scripts or data that you put into this service are public.

datawizard documentation built on Sept. 15, 2023, 9:06 a.m.