knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

Warning: tidyABS is in early development. Please use for experimentation only.

Overview

The tidyABS package converts ABS excel tables to tidy data frames. It uses information on cells' formatting (indenting, bolding, italics) and some rules-of-thumb to detect the structure of excel tables. It sometimes needs pointers from the user.

Note: tidyABS works with .xlsx files only.

Installation

The tidyABS package is not available on CRAN. It can be installed from github with the following script:

# install.packages("devtools")
devtools::install_github("ianmoran11/tidyABS")

Usage

options(tibble.print_min = 3, tibble.print_max = 3)
library(magick)
library(tidyABS)
library(tidyverse)

Below is a short demonstration that tidies a table from the Australian Industry publication (Cat No. 8155.0).

tidy_aus_industry_df <-
  tidyABS_example("australian-industry.xlsx") %>%
  process_sheet(sheets = "Table_1") %>%
  assemble_table_components()

tidy_aus_industry_df %>% str()

Examples

The tidyABS package contains several example files. Use the helper tidyABS_example() function with no arguments to list these files:

tidyABS_example()

Example 1: Australian Industry

magick::image_read("australian-industry_table-1.png")

Above is the first sheet of an excel workbook in the Australian Industry publication. We can retrieve the path of this file using the tidyABS_example function:

ai_path <- tidyABS_example("australian-industry.xlsx")

To process the sheet above, we pass the workbook file path to the process_sheet function and identify the sheet we'd like to tidy.

ai_processed <- process_sheet(path = ai_path, sheets = "Table_1")

ai_processed %>% str(1)

This produces a list of three data frames. They store the location and format information of row names (row_groups), column names (col_groups) and table data (tabledata).

We can inspect the row and column names in ai_processed using the inspect_table_components function.

inspect_table_components(ai_processed)

We can use plot_table_components to visualize how these groups are spatially layed out in the spreadsheet and see how tidyABS will relate them to table values. Row names directly to the left of their data points should be labelled "W", and column names directly above should be labelled "N". (See the unpivotr package for more information.)

plot_table_components(ai_processed) +
  ylim(-30, 0)

Finally, we can assemble the components into a tidy data frame using assemble_table_components.

assemble_table_components(ai_processed) %>%
  glimpse()

Example 2: Environmental-Economic Accounts

magick::image_read("environmental-economic-accounts_table-6-01.png")

Here's an example that requires some manual work, the Environmental-Economic Accounts. Let's retrieve the path of our example workbook and proces Table 6.1:

eea_path <- tidyABS_example("environmental-economic-accounts.xlsx")

eea_processed <- process_sheet(path = eea_path, sheets = "Table 6.1")

On visual inspection, we can see row_group_01 has been given a "W" orientation to the data, not "WNW".

plot_table_components(eea_processed)

Luckily, we can fix this with the change_direction function.

eea_processed <-
  eea_processed %>%
  change_direction("row_group_01", "WNW")

Plotting the table confirms the direction has been corrected.

plot_table_components(eea_processed)

Finally, we can assemble the components into a tidy dataframe using assemble_table_components.

assemble_table_components(eea_processed) %>%
  glimpse()

Example 3: Consumer Price Index (time series)

magick::image_read("consumer-price-index_data01.png")

Time series data require the user to manually identify the inner table cells. This is because some of the column names are numeric --- for example, collection month.

I recommend using the readABS package for this. It was created for importing ABS time series data and does not require manual identifcation of table cells.

That said, here's how you would process this table with tidyABS.

cpi_path <- tidyABS_example("consumer-price-index.xlsx")

We need to identify the inner table cells using the manual_value_references argument.This argument takes a vector of addresses, identifying the inner corners of the table.

cpi_processed <-
  process_sheet(
    path = cpi_path, sheets = "Data1",
    manual_value_references = "B11:AB292"
  )

Here is the resulting data frame.

assemble_table_components(cpi_processed) %>%
  glimpse()


ianmoran11/tidyABS documentation built on May 30, 2019, 4:03 a.m.