Description Usage Arguments Details Value Examples
The data which underlies the Economic Sectors for DCMS sectors
data is typically provided to DCMS as a spreadsheet from the Office for
National Statistics. This function extracts the list of sectors that DCMS
are responsible for from the working_file_dcms_V13.xlsm spreadsheet. This
information is also recorded in the methodology note which accompanies the
publication at
https://www.gov.uk/government/publications/dcms-sectors-economic-estimates-methodology
and a version correct at the time of the 2016 release is included in the
package as eesectors::DCMS_sectors
. Hence, it is not necessary to
run this function every time - only if changes to the DCMS sectors are
made.
1 2 3 | extract_DCMS_sectors(x, sheet_name = "Working File", skip = 7,
sectors = c("creative", "digital", "culture", "telecoms", "gambling",
"sport", "tourism", "all_dcms"))
|
x |
Location of the input spreadsheet file. Named something like "working_file_dcms_VXX.xlsm". |
sheet_name |
The name of the spreadsheet in which the data are stored.
Defaults to |
skip |
Number of lines to skip when reading the worksheet, inherits from
|
sectors |
A character vector of the sectors for which DCMS is
responsible, currently: |
The best way to understand what happens when you run this function is to look at the source code, which is available at https://github.com/ukgovdatascience/eesectors/blob/master/R/. A brief explanation of what the function does here:
1. The function calls readxl::read_excel
to load the appropriate
page from the underlying spreadsheet.
2. The column names are sanitised and cleaned to remove extraneous characters, and are made all lower case
3. The dataframe is limited to the columns: 'sic'
,
'description'
, and those contained in the sector
argument:
c('creative', 'digital', 'culture', 'telecoms', 'gambling', 'sport',
'tourism', 'all_dcms')
.
4. The data are pivoted into long form using tidyr::gather_
. This
converts the data from a wide dataframe with 'sector'
as the key
column, and present
as the value column (i.e. present in DCMS
sector?). The result is a much longer dataframe which is much easier to
subset.
5. For consistency with later steps, the 'sic'
column is renamed to
'SIC'
.
6. The asterisks used in the spreadsheet to denote presence in a DCMS
sector are replaced by a binary variable with TRUE
and FALSE
in place of *
and NA
.
7. NA
values created by step 6 are removed from the dataframe.
8. The tourism entry which is formatted differently in the 'working file' worksheet in working_file_dcms_V13.xlsm is fixed to ensure that it has both a description and a SIC (where previously it just had a SIC), and 'tourism' is labelled as a DCMS sector under tourism and all_dcms.
9. The data are saved out to a .Rds file, and a check run to ensure that the file exists. The size of the new file is reported in bytes.
The function returns nothing, but saves the extracted dataset to
file.path(output_path, 'DCMS_sectors.Rds')
. This is an R data
object, which retains the column types which would be lost if converted to
a flat format like CSV.
1 2 3 4 5 6 7 8 9 | ## Not run:
library(eesectors)
extract_DCMS_sectors(
x = 'OFFICIAL_working_file_dcms_V13.xlsm',
sheet_name = 'Working File',
output_path = '../OFFICIAL/'
)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.