widyr: Widen, process, and re-tidy a dataset

This package wraps the pattern of un-tidying data into a wide matrix, performing some processing, then turning it back into a tidy form. This is useful for several mathematical operations such as co-occurrence counts, correlations, or clustering that are best done on a wide matrix.

Towards a precise definition of "wide" data

The term "wide data" has gone out of fashion as being "imprecise" (Wickham 2014)), but I think with a proper definition the term could be entirely meaningful and useful.

A wide dataset is one or more matrices where:

When would you want data to be wide rather than tidy? Notable examples include classification, clustering, correlation, factorization, or other operations that can take advantage of a matrix structure. In general, when you want to compare between items rather than compare between variables, this is a useful structure.

The widyr package is based on the observation that during a tidy data analysis, you often want data to be wide only temporarily, before returning to a tidy structure for visualization and further analysis. widyr makes this easy through a set of pairwise_ functions.

Example: gapminder

Consider the gapminder dataset in the gapminder package.

library(dplyr)
library(gapminder)

gapminder

This tidy format (one-row-per-country-per-year) is very useful for grouping, summarizing, and filtering operations. But if we want to compare pairs of countries (for example, to find countries that are similar to each other), we would have to reshape this dataset. Note that here, country is the item, while year is the feature column.

Pairwise operations

The widyr package offers pairwise_ functions that operate on pairs of items within data. An example is pairwise_dist:

library(widyr)

gapminder %>%
  pairwise_dist(country, year, lifeExp)

In a single step, this finds the Euclidean distance between the lifeExp value in each pair of countries, matching pairs based on year. We could find the closest pairs of countries overall with arrange():

gapminder %>%
  pairwise_dist(country, year, lifeExp) %>%
  arrange(distance)

Notice that this includes duplicates (Germany/Belgium and Belgium/Germany). To avoid those (the upper triangle of the distance matrix), use upper = FALSE:

gapminder %>%
  pairwise_dist(country, year, lifeExp, upper = FALSE) %>%
  arrange(distance)

In some analyses, we may be interested in correlation rather than distance of pairs. For this we would use pairwise_cor:

gapminder %>%
  pairwise_cor(country, year, lifeExp, upper = FALSE, sort = TRUE)


Try the widyr package in your browser

Any scripts or data that you put into this service are public.

widyr documentation built on Sept. 13, 2022, 9:05 a.m.