extract_GVA_data: extract GVA data from ONS working file spreadsheet

Description Usage Arguments Details Value Examples

View source: R/extract_GVA_data.R

Description

The data which underlies the Economic Sectors for DCMS sectors data is typically provided to DCMS as a spreadsheet from the Office for National Statistics. This function extracts the GVA data from that spreadsheet, and saves it to .Rds format. These data are spread over multiple worksheets (usually named 2010 Use, etc), so this function iteratres over a user supplied (or default) list of worksheets to extract the data.

IT IS HIGHLY ADVISEABLE TO ENSURE THAT THE DATA WHICH ARE CREATED BY THIS FUNCTION ARE NOT STORED IN A FOLDER WHICH IS A GITHUB REPOSITORY TO MITIGATE AGAINST ACCIDENTAL COMMITTING OF OFFICIAL DATA TO GITHUB. TOOLS TO FURTHER HELP MITIGATE THIS RISK ARE AVAILABLE AT https://github.com/ukgovdatascience/dotfiles.

Usage

1
2
extract_GVA_data(x, sheet_range = paste(paste(1997:2015), "Use"),
  col_names = FALSE, header_rows = 8:9, slice = 3:122)

Arguments

x

Location of the input spreadsheet file. Named something like "working_file_dcms_VXX.xlsm".

sheet_range

The range of sheets over which to iterate and extract GVA data. Defaults to paste(paste(1997:2015), 'Use').

col_names

Passes to readxl::read_excel. In most cases the default FALSE will be appropriate, unless major changes occur to the udnerlying spreadsheet.

header_rows

The rows in the original spreadsheet from which the headers (SIC and description) will be taken. Defaults to 8:9.

slice

The rows of interest as desribed below in 4. Defaults to 3:122 which captures all the SIC codes for which GVA data are supplied in the underlying spreadsheets.

Details

The best way to understand what happens when you run this function is to look at the source code, which is available at https://github.com/ukgovdatascience/eesectors/blob/master/R/. The code is relatively transparent and well documented. A brief explanation of what the function does here:

1. The function calls readxl::read_excel to load the first worksheet given by the argument sheet_range, using the defaults this will be 1997 Use.

2. The rows denoted by header_rows are extracted into a new dataframe, and the SIC codes formatted to prepend zeros onto values which are numeric, and have a character length of one (integers between 0 and 9).

3. The SIC code descriptions are then cleaned to make them a bit more friendly to work with.

4. The rows that we are interested in are subset using the slice argument. It may be necessary to look into the target spreadsheet to establish which rows are of interest. Note that these will appear as columns in the spreadsheet, and are transposed into rows in the above step 2.

5. The purrr::map_df function is run on the sheet_range vector. This creates a master dataframe containing teh GVA values from each of the worksheets. If default values are used, this will be sheets 1997 Use through to 2015 Use.

6. The .id column is renamed to year for each of the corresponding worksheets from which data have been extracted.

7. The SIC codes extracted into y in previous steps are now applied to the data, matching up SIC codes with the relevant GVA values.

8. Finally tidyr::gather is run to collect the data into a long dataframe to make subsetting easier. The data are then saved out to a file based on whether test = TRUE. Note that the default behaviour is to prepend OFFICIAL onto the filename to prevent it from accidentally being committed to a public repository. Note that if you use appropriate tools (https://github.com/ukgovdatascience/dotfiles), checking for OFFICIAl files becomes an automated process.

Value

The function returns nothing, but saves the extracted dataset to file.path(output_path, 'OFFICIAL_ABS.Rds'). This is an R data object, which retains the column types which would be lost if converted to a flat format like CSV.

Examples

1
2
3
4
5
6
7
## Not run: 
library(eesectors)
extract_GVA_data(
x = 'OFFICIAL_working_file_dcms_V13.xlsm'
)

## End(Not run)

ukgovdatascience/eesectors documentation built on Sept. 11, 2020, 12:19 p.m.