In nacnudus/unpivotr: Unpivot Complex and Irregular Data Layouts

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)

unpivotr

unpivotr deals with non-tabular data, especially from spreadsheets. Use unpivotr when your source data has any of these 'features':

Multi-headered hydra
Meaningful formatting
Headers anywhere but at the top of each column
Non-text headers e.g. dates
Other stuff around the table
Several similar tables in one sheet
Sentinel values
Superscript symbols
Meaningful comments
Nested HTML tables

If that list makes your blood boil, you'll enjoy the function names.

behead() deals with multi-headered hydra tables one layer of headers at a time, working from the edge of the table inwards. It's a bit like using header = TRUE in read.csv(), but because it's a function, you can apply it to as many layers of headers as you need. You end up with all the headers in columns.
spatter() is like tidyr::spread() but preserves mixed data types. You get into a mixed-data-type situation by delaying type coercion until after the table is tidy (rather than before, like read.csv() et al). And yes, it usually follows behead().

More positive, corrective functions:

justify() aligns column headers before behead()ing, and has deliberate moral overtones.
enhead() attaches a header to the body of the data, a la Frankenstein. The effect is the same as behead(), but is more powerful because you can choose exactly which header cells you want, paying attention to formatting (which behead() doesn't understand).
isolate_sentinels() separates meaningful symbols like "N/A" or "confidential" from the rest of the data, giving them some time alone think about what they've done.
partition() takes a sheet with several tables on it, and slashes into pieces that each contain one table. You can then unpivot each table in turn with purrr::map() or similar.

Make cells tidy

Unpivotr uses data where each cells is represented by one row in a dataframe. Like this.

Gif of tidyxl converting cells into a tidy representation of one row per cell

What can you do with tidy cells? The best places to start are:

Spreadsheet Munging Strategies, a free, online cookbook using tidyxl and unpivotr
Screencasts on YouTube.
Worked examples on GitHub.

Otherwise the basic idea is:

Read the data with a specialist tool.
For spreadsheets, use tidyxl.
For plain text files, you might soon be able to use readr, but for now you'll have to install a pull-request on that package with devtools::install_github("tidyverse/readr#760").
For tables in html pages, use unpivotr::tidy_html()
For data frames, use unpivotr::as_cells() -- this should be a last resort, because by the time the data is in a conventional data frame, it is often too late -- formatting has been lost, and most data types have been coerced to strings.
Either behead() straight away, else dplyr::filter() separately for the header cells and the data cells, and then recombine with enhead().
spatter() so that each column has one data type.

library(unpivotr)
library(tidyverse)
x <- purpose$`up-left left-up`
x # A pivot table in a conventional data frame.  Four levels of headers, in two
  # rows and two columns.

y <- as_cells(x) # 'Tokenize' or 'melt' the data frame into one row per cell
y

rectify(y) # useful for reviewing the melted form as though in a spreadsheet

y %>%
  behead("up-left", "sex") %>%               # Strip headers
  behead("up", "life-satisfication") %>%  # one
  behead("left-up", "qualification") %>%     # by
  behead("left", "age-band") %>%            # one.
  select(-row, -col, -data_type, count = chr) %>% # cleanup
  mutate(count = as.integer(count))

Note the compass directions in the code above, which hint to behead() where to find the header cell for each data cell.

"up-left" means the header (Female, Male) is positioned up and to the left of the columns of data cells it describes.
"up" means the header (0 - 6, 7 - 10) is positioned directly above the columns of data cells it describes.
"left-up" means the header (Bachelor's degree, Certificate, etc.) is positioned to the left and upwards of the rows of data cells it describes.
"left" means the header (15 - 24, 25 - 44, etc.) is positioned directly to the left of the rows of data cells it describes.

Installation

# install.packages("devtools") # If you don't already have devtools
devtools::install_github("nacnudus/unpivotr", build_vignettes = TRUE)

The version 0.4.0 release had somee breaking changes. See NEWS.md for details. The previous version can be installed as follow:

devtools::install_version("unpivotr", version = "0.3.1", repos = "http://cran.us.r-project.org")

Similar projects

unpivotr is inspired by Databaker, a collaboration between the United Kingdom Office of National Statistics and The Sensible Code Company. unpivotr.

jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms. unpivotr differs by being less magic, and equipping you to express what you want to do.

nacnudus/unpivotr documentation built on Feb. 6, 2023, 4:55 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com