In arcenis-r/cepumd: Calculate Consumer Expenditure Survey (CE) Annual Estimates

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(cepumd)
library(tidyverse)
library(readxl)
library(gt)

Introduction

In this workflow I’m going to calculate estimated mean utilities expenditures for 2015 using integrated data by CU composition using the FAM_TYPE variable. In this case I’m going to start by looking at the codes for that variable to show how one might run into an inconsistency in code definitions across survey instruments.

I'll need the following data and metadata sources:

Interview survey data zip files for 2015
Diary survey data zip files for 2015
Hierarchical grouping zip files
CE Data Dictionary

Data gathering

Data files can be downloaded from the CE PUMD Data Files page (it's easiest to use CSV) and hierarchical grouping files can be downloaded from the CE PUMD Documentation page

To get the files I'll first create a temporary directory and download all of the files that I need from the BLS website into this directory. You might choose to store your files differently, but this convention will keep the files organized, it will keep the code simple, and everything will be in a folder that will be easy to clean up after. Since the BLS blocks third party applications I'll add a user-agent to identify myself in the download function that is stored in a variable called cepumd_ua (not shown).

cepumd_ua <- "Arcenis Rojas (arcenis.rojas@gmail.com)"

ce_data_dir <- tempdir()

download.file(
  "https://www.bls.gov/cex/pumd/data/comma/intrvw15.zip",
  fs::path(ce_data_dir, "intrvw15.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/data/comma/diary15.zip",
  fs::path(ce_data_dir, "diary15.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/stubs.zip",
  fs::path(ce_data_dir, "stubs.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/ce-pumd-interview-diary-dictionary.xlsx",
  fs::path(ce_data_dir, "ce-data-dict.xlsx"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

Exploring grouping variable code definitions

First, I’ll load the "Codes " worksheet from the CE Data Dictionary, which can be downloaded from the CE PUMD Documentation page. Then I'll look at code descriptions for the “FAM_TYPE” variable in the dictionary and I’m going to focus on the code values of 3, 5, and 7.

ce_codes <- read_excel(
  file.path(ce_data_dir, "ce-data-dict.xlsx"),
  sheet = "Codes "
)

ce_codes |>
  janitor::clean_names() |>
  filter(
    variable %in% "FAM_TYPE",
  first_year <= 2015,
  (last_year >= 2015 | is.na(last_year)),
  code_value %in% c("3", "5", "7")
  ) |>
  select(survey, code_value, code_description) |>
  arrange(code_value, survey) |>
  gt()

The code descriptions for these 3 code values are different across instruments. To resolve this I’m going to create a table containing only codes from the Interview survey.

fam_type_codes <- ce_codes |>
  janitor::clean_names() |>
  filter(
    variable %in% "FAM_TYPE",
    first_year <= 2015,
    (last_year >= 2015 | is.na(last_year))
  )

codes2keep <- fam_type_codes |>
  filter(survey %in% "INTERVIEW") |>
  select(code_value, code_description)

fam_type_codes <- fam_type_codes |>
  select(-code_description) |>
  left_join(codes2keep, by = "code_value") |>
  relocate(code_description, .after = code_value)

fam_type_codes |>
  filter(code_value %in% c("3", "5", "7")) |>
  select(survey, code_value, code_description) |>
  arrange(code_value, survey) |>
  gt()

Now the codes are consistent across survey instruments and I can use this code-book in my call to ce_prepdata() using the own_code_book argument.

Calculating CE weighted estimate means by CU composition

Next I’ll load the integrated hierarchical grouping file for 2015 and find the title for the utilities category.

integ15_hg <- ce_hg(
  2015,
  integrated,
  hg_zip_path = file.path(ce_data_dir, "stubs.zip")
)

integ15_hg |>
  filter(str_detect(str_to_lower(title), "utilities")) |>
  gt()

The expenditure category associated with utilities is “Utilities, fuels, and public services”. I’ll store that title to work with later and narrow down the section of the stub file that includes only these expenditures. I'll use this title to filter out the UCC data from the stub file that make up the utilities category to check which survey these expenditures come from to calculate estimates.

utilities_title <- integ15_hg |>
  filter(str_detect(str_to_lower(title), "utilities")) |>
  pull(title)

utilities_hg <- ce_uccs(
  integ15_hg,
  expenditure = utilities_title,
  uccs_only = FALSE
)

gt(utilities_hg)

To check what survey instruments the expenditures are collected through for published estimates I'll check what surveys are listed in the "survey" column of the hierarchical grouping data for utilities.

utilities_hg |> distinct(survey) |> gt()

It seems utilities expenditures are collected only through the Interview survey (the “G” stands for “UCC group”), so I’ll only need to use Interview data files to calculate estimates.

fam_type_utilities <- ce_prepdata(
  2015,
  interview,
  utilities_hg,
  uccs = ce_uccs(utilities_hg, expenditure = utilities_title),
  fam_type, 
  int_zp = file.path(ce_data_dir, "intrvw15.zip"),
  recode_variables = TRUE,
  dict_path = file.path(ce_data_dir, "ce-data-dict.xlsx")
) |>
  group_nest(fam_type) |>
  mutate(ce_mean_df = map(data, ce_mean)) |>
  select(-data) |>
  unnest(ce_mean_df)

fam_type_utilities |>
  arrange(fam_type) |>
  gt()

And finally, a quick lollipop plot to visualize the data.

fam_type_utilities |>
  mutate(fam_type = fct_reorder(fam_type, mean_exp)) |>
  ggplot(aes(x = mean_exp, y = fam_type, mean_exp)) +
  geom_segment(aes(x = 0, xend = mean_exp, yend = fam_type)) +
  geom_point(color = "red", size = 5) +
  scale_y_discrete(labels = function(x) str_wrap(x, width = 25)) +
  scale_x_continuous(labels = scales::dollar) +
  labs(
    y = "CU composition (FAM_TYPE)",
    x = "Estimated, weighted, annual mean expenditure",
    title =
      "Estimated annual mean utilities expenditures by CU composition"
  ) +
  theme_bw()

Clean-up

Finally, now that the analysis is done, I'll delete the temporary directory that contains all the CE data.

unlink(ce_data_dir, recursive = TRUE, force = TRUE)

arcenis-r/cepumd documentation built on April 10, 2024, 9:25 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

arcenis-r/cepumd
Calculate Consumer Expenditure Survey (CE) Annual Estimates

In arcenis-r/cepumd: Calculate Consumer Expenditure Survey (CE) Annual Estimates

Introduction

Data gathering

Exploring grouping variable code definitions

Calculating CE weighted estimate means by CU composition

Clean-up

R Package Documentation

Browse R Packages

We want your feedback!

arcenis-r/cepumd Calculate Consumer Expenditure Survey (CE) Annual Estimates

In arcenis-r/cepumd: Calculate Consumer Expenditure Survey (CE) Annual Estimates

Introduction

Data gathering

Exploring grouping variable code definitions

Calculating CE weighted estimate means by CU composition

Clean-up

R Package Documentation

Browse R Packages

We want your feedback!

arcenis-r/cepumd
Calculate Consumer Expenditure Survey (CE) Annual Estimates