knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(cepumd) library(tidyverse) library(readxl) library(gt)
In this workflow I’m going to calculate estimated mean utilities expenditures for 2015 using integrated data by CU composition using the FAM_TYPE variable. In this case I’m going to start by looking at the codes for that variable to show how one might run into an inconsistency in code definitions across survey instruments.
I'll need the following data and metadata sources:
Data files can be downloaded from the CE PUMD Data Files page (it's easiest to use CSV) and hierarchical grouping files can be downloaded from the CE PUMD Documentation page
To get the files I'll first create a temporary directory and download all of the files that I need from the BLS website into this directory. You might choose to store your files differently, but this convention will keep the files organized, it will keep the code simple, and everything will be in a folder that will be easy to clean up after. Since the BLS blocks third party applications I'll add a user-agent to identify myself in the download function that is stored in a variable called cepumd_ua
(not shown).
cepumd_ua <- "Arcenis Rojas (arcenis.rojas@gmail.com)"
ce_data_dir <- tempdir() download.file( "https://www.bls.gov/cex/pumd/data/comma/intrvw15.zip", fs::path(ce_data_dir, "intrvw15.zip"), mode = "wb", headers = list( "User-Agent" = cepumd_ua ) ) download.file( "https://www.bls.gov/cex/pumd/data/comma/diary15.zip", fs::path(ce_data_dir, "diary15.zip"), mode = "wb", headers = list( "User-Agent" = cepumd_ua ) ) download.file( "https://www.bls.gov/cex/pumd/stubs.zip", fs::path(ce_data_dir, "stubs.zip"), mode = "wb", headers = list( "User-Agent" = cepumd_ua ) ) download.file( "https://www.bls.gov/cex/pumd/ce-pumd-interview-diary-dictionary.xlsx", fs::path(ce_data_dir, "ce-data-dict.xlsx"), mode = "wb", headers = list( "User-Agent" = cepumd_ua ) )
First, I’ll load the "Codes " worksheet from the CE Data Dictionary, which can be downloaded from the CE PUMD Documentation page. Then I'll look at code descriptions for the “FAM_TYPE” variable in the dictionary and I’m going to focus on the code values of 3, 5, and 7.
ce_codes <- read_excel( file.path(ce_data_dir, "ce-data-dict.xlsx"), sheet = "Codes " ) ce_codes |> janitor::clean_names() |> filter( variable %in% "FAM_TYPE", first_year <= 2015, (last_year >= 2015 | is.na(last_year)), code_value %in% c("3", "5", "7") ) |> select(survey, code_value, code_description) |> arrange(code_value, survey) |> gt()
The code descriptions for these 3 code values are different across instruments. To resolve this I’m going to create a table containing only codes from the Interview survey.
fam_type_codes <- ce_codes |> janitor::clean_names() |> filter( variable %in% "FAM_TYPE", first_year <= 2015, (last_year >= 2015 | is.na(last_year)) ) codes2keep <- fam_type_codes |> filter(survey %in% "INTERVIEW") |> select(code_value, code_description) fam_type_codes <- fam_type_codes |> select(-code_description) |> left_join(codes2keep, by = "code_value") |> relocate(code_description, .after = code_value) fam_type_codes |> filter(code_value %in% c("3", "5", "7")) |> select(survey, code_value, code_description) |> arrange(code_value, survey) |> gt()
Now the codes are consistent across survey instruments and I can use this code-book in my call to ce_prepdata()
using the own_code_book
argument.
Next I’ll load the integrated hierarchical grouping file for 2015 and find the title for the utilities category.
integ15_hg <- ce_hg( 2015, integrated, hg_zip_path = file.path(ce_data_dir, "stubs.zip") ) integ15_hg |> filter(str_detect(str_to_lower(title), "utilities")) |> gt()
The expenditure category associated with utilities is “Utilities, fuels, and public services”. I’ll store that title to work with later and narrow down the section of the stub file that includes only these expenditures. I'll use this title to filter out the UCC data from the stub file that make up the utilities category to check which survey these expenditures come from to calculate estimates.
utilities_title <- integ15_hg |> filter(str_detect(str_to_lower(title), "utilities")) |> pull(title) utilities_hg <- ce_uccs( integ15_hg, expenditure = utilities_title, uccs_only = FALSE ) gt(utilities_hg)
To check what survey instruments the expenditures are collected through for published estimates I'll check what surveys are listed in the "survey" column of the hierarchical grouping data for utilities.
utilities_hg |> distinct(survey) |> gt()
It seems utilities expenditures are collected only through the Interview survey (the “G” stands for “UCC group”), so I’ll only need to use Interview data files to calculate estimates.
fam_type_utilities <- ce_prepdata( 2015, interview, utilities_hg, uccs = ce_uccs(utilities_hg, expenditure = utilities_title), fam_type, int_zp = file.path(ce_data_dir, "intrvw15.zip"), recode_variables = TRUE, dict_path = file.path(ce_data_dir, "ce-data-dict.xlsx") ) |> group_nest(fam_type) |> mutate(ce_mean_df = map(data, ce_mean)) |> select(-data) |> unnest(ce_mean_df) fam_type_utilities |> arrange(fam_type) |> gt()
And finally, a quick lollipop plot to visualize the data.
fam_type_utilities |> mutate(fam_type = fct_reorder(fam_type, mean_exp)) |> ggplot(aes(x = mean_exp, y = fam_type, mean_exp)) + geom_segment(aes(x = 0, xend = mean_exp, yend = fam_type)) + geom_point(color = "red", size = 5) + scale_y_discrete(labels = function(x) str_wrap(x, width = 25)) + scale_x_continuous(labels = scales::dollar) + labs( y = "CU composition (FAM_TYPE)", x = "Estimated, weighted, annual mean expenditure", title = "Estimated annual mean utilities expenditures by CU composition" ) + theme_bw()
Finally, now that the analysis is done, I'll delete the temporary directory that contains all the CE data.
unlink(ce_data_dir, recursive = TRUE, force = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.