extract_ABS_data: extract ABS data from ONS working file spreadsheet

Description Usage Arguments Details Value Examples

View source: R/extract_ABS_data.R

Description

The data which underlies the Economic Sectors for DCMS sectors data is typically provided to DCMS as a spreadsheet from the Office for National Statistics. This function extracts the ABS data from that spreadsheet, and saves it to .Rds format.

IT IS HIGHLY ADVISEABLE TO ENSURE THAT THE DATA WHICH ARE CREATED BY THIS FUNCTION ARE NOT STORED IN A FOLDER WHICH IS A GITHUB REPOSITORY TO MITIGATE AGAINST ACCIDENTAL COMMITTING OF OFFICIAL DATA TO GITHUB. TOOLS TO FURTHER HELP MITIGATE THIS RISK ARE AVAILABLE AT https://github.com/ukgovdatascience/dotfiles.

Usage

1
extract_ABS_data(x, sheet_name = "New ABS Data", ...)

Arguments

x

Location of the input spreadsheet file. Named something like "working_file_dcms_VXX.xlsm".

sheet_name

The name of the spreadsheet in which the data are stored. Defaults to New ABS Data.

...

passes arguments to readxl::read_excel() which is the basis of this function.

Details

The best way to understand what happens when you run this function is to look at the source code, which is available at https://github.com/ukgovdatascience/eesectors/blob/master/R/. The code is relatively transparent and well documented. A brief explanation of what the function does here:

1. The function calls readxl::read_excel to load the appropriate page from the underlying spreadsheet.

2. In the 2016 version of the working file spreadhseet, there were a number of replicated columns. These are removed, however it is at present a relatively dumb exercise. The function simply looks for columns with teh same name, and retains only the first one.

3. The column names are cleaned to make selection of variables easier using make.names.

4. The data are pivoted into long form using tidyr::gather_. This converts the data from a wide dataframe with year as column headers, into a long dataframe with year included in a year column. This makes the data much easier to subset.

5. All the ABS values are combined into a column called ABS. In the 2016 spreadsheet there were a number of full stops (.) in the ABS column, which will be coerced to NA when the the column is converted to numeric using as.numeric (the next step). The internal function eesectors::integrity_check runs a quick check to make sure that the only NAs creeping into the ABS column are from full stops in the original data. The full stops are then converted to zeros.

6. The internal function eesectors::clean_sic is run on the DOMVAL column to ensure that all 3 and 4 digit SIC codes are formatted properly.

7. The data are printed to console, and can be saved out using the normal methods, for instance saveRDS, or write.csv.

Value

The function returns nothing, but saves the extracted dataset to file.path(output_path, 'OFFICIAL_ABS.Rds'). This is an R data object, which retains the column types which would be lost if converted to a flat format like CSV.

Examples

1
2
3
4
5
6
7
8
## Not run: 
library(eesectors)
extract_ABS_data(
x = 'OFFICIAL_working_file_dcms_V13.xlsm',
sheet_name = 'New ABS Data'
)

## End(Not run)

ukgovdatascience/eesectors documentation built on Sept. 11, 2020, 12:19 p.m.