extract_DCMS_sectors: extract list of DCMS sectors from ONS working file...
In DCMSstats/eesectors: Functions to create DCMS Economic Estimates statistics

Description Usage Arguments Details Value Examples

The data which underlies the Economic Sectors for DCMS sectors data is typically provided to DCMS as a spreadsheet from the Office for National Statistics. This function extracts the list of sectors that DCMS are responsible for from the working_file_dcms_V13.xlsm spreadsheet. This information is also recorded in the methodology note which accompanies the publication at https://www.gov.uk/government/publications/dcms-sectors-economic-estimates-methodology and a version correct at the time of the 2016 release is included in the package as eesectors::DCMS_sectors. Hence, it is not necessary to run this function every time - only if changes to the DCMS sectors are made.

1
2
3

extract_DCMS_sectors(x, sheet_name = "Working File", skip = 7,
  sectors = c("creative", "digital", "culture", "telecoms", "gambling",
  "sport", "tourism", "all_dcms"))

`x`	Location of the input spreadsheet file. Named something like "working_file_dcms_VXX.xlsm".
`sheet_name`	The name of the spreadsheet in which the data are stored. Defaults to `New ABS Data`.
`skip`	Number of lines to skip when reading the worksheet, inherits from `readxl::read_excel`.
`sectors`	A character vector of the sectors for which DCMS is responsible, currently: `c('creative', 'digital', 'culture', 'telecoms', 'gambling', 'sport', 'tourism', 'all_dcms')`.

The best way to understand what happens when you run this function is to look at the source code, which is available at https://github.com/ukgovdatascience/eesectors/blob/master/R/. A brief explanation of what the function does here:

1. The function calls readxl::read_excel to load the appropriate page from the underlying spreadsheet.

2. The column names are sanitised and cleaned to remove extraneous characters, and are made all lower case

3. The dataframe is limited to the columns: 'sic', 'description', and those contained in the sector argument: c('creative', 'digital', 'culture', 'telecoms', 'gambling', 'sport', 'tourism', 'all_dcms').

4. The data are pivoted into long form using tidyr::gather_. This converts the data from a wide dataframe with 'sector' as the key column, and present as the value column (i.e. present in DCMS sector?). The result is a much longer dataframe which is much easier to subset.

5. For consistency with later steps, the 'sic' column is renamed to 'SIC'.

6. The asterisks used in the spreadsheet to denote presence in a DCMS sector are replaced by a binary variable with TRUE and FALSE in place of * and NA.

7. NA values created by step 6 are removed from the dataframe.

8. The tourism entry which is formatted differently in the 'working file' worksheet in working_file_dcms_V13.xlsm is fixed to ensure that it has both a description and a SIC (where previously it just had a SIC), and 'tourism' is labelled as a DCMS sector under tourism and all_dcms.

9. The data are saved out to a .Rds file, and a check run to ensure that the file exists. The size of the new file is reported in bytes.

The function returns nothing, but saves the extracted dataset to file.path(output_path, 'DCMS_sectors.Rds'). This is an R data object, which retains the column types which would be lost if converted to a flat format like CSV.

## Not run: 
library(eesectors)
extract_DCMS_sectors(
x = 'OFFICIAL_working_file_dcms_V13.xlsm',
sheet_name = 'Working File',
output_path = '../OFFICIAL/'
)

## End(Not run)