Description Usage Arguments Details Value Examples
View source: R/extract_GVA_data.R
The data which underlies the Economic Sectors for DCMS sectors
data is typically provided to DCMS as a spreadsheet from the Office for
National Statistics. This function extracts the GVA data from that
spreadsheet, and saves it to .Rds format. These data are spread over
multiple worksheets (usually named 2010 Use
, etc), so this function
iteratres over a user supplied (or default) list of worksheets to extract
the data.
IT IS HIGHLY ADVISEABLE TO ENSURE THAT THE DATA WHICH ARE CREATED BY THIS FUNCTION ARE NOT STORED IN A FOLDER WHICH IS A GITHUB REPOSITORY TO MITIGATE AGAINST ACCIDENTAL COMMITTING OF OFFICIAL DATA TO GITHUB. TOOLS TO FURTHER HELP MITIGATE THIS RISK ARE AVAILABLE AT https://github.com/ukgovdatascience/dotfiles.
1 2 | extract_GVA_data(x, sheet_range = paste(paste(1997:2015), "Use"),
col_names = FALSE, header_rows = 8:9, slice = 3:122)
|
x |
Location of the input spreadsheet file. Named something like "working_file_dcms_VXX.xlsm". |
sheet_range |
The range of sheets over which to iterate and extract GVA
data. Defaults to |
col_names |
Passes to |
header_rows |
The rows in the original spreadsheet from which the
headers (SIC and description) will be taken. Defaults to |
slice |
The rows of interest as desribed below in 4. Defaults to
|
The best way to understand what happens when you run this function is to look at the source code, which is available at https://github.com/ukgovdatascience/eesectors/blob/master/R/. The code is relatively transparent and well documented. A brief explanation of what the function does here:
1. The function calls readxl::read_excel
to load the first worksheet
given by the argument sheet_range
, using the defaults this will be
1997 Use
.
2. The rows denoted by header_rows
are extracted into a new
dataframe, and the SIC codes formatted to prepend zeros onto values which
are numeric, and have a character length of one (integers between 0 and 9).
3. The SIC code descriptions are then cleaned to make them a bit more friendly to work with.
4. The rows that we are interested in are subset using the slice
argument. It may be necessary to look into the target spreadsheet to
establish which rows are of interest. Note that these will appear as
columns in the spreadsheet, and are transposed into rows in the above step
2.
5. The purrr::map_df
function is run on the sheet_range
vector. This creates a master dataframe containing teh GVA values from each
of the worksheets. If default values are used, this will be sheets
1997 Use
through to 2015 Use
.
6. The .id
column is renamed to year for each of the corresponding
worksheets from which data have been extracted.
7. The SIC codes extracted into y
in previous steps are now applied
to the data, matching up SIC codes with the relevant GVA values.
8. Finally tidyr::gather
is run to collect the data into a long
dataframe to make subsetting easier. The data are then saved out to a file
based on whether test = TRUE
. Note that the default behaviour is to
prepend OFFICIAL onto the filename to prevent it from accidentally being
committed to a public repository. Note that if you use appropriate tools
(https://github.com/ukgovdatascience/dotfiles), checking for OFFICIAl
files becomes an automated process.
The function returns nothing, but saves the extracted dataset to
file.path(output_path, 'OFFICIAL_ABS.Rds')
. This is an R data
object, which retains the column types which would be lost if converted to
a flat format like CSV.
1 2 3 4 5 6 7 | ## Not run:
library(eesectors)
extract_GVA_data(
x = 'OFFICIAL_working_file_dcms_V13.xlsm'
)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.