knitr::opts_chunk$set( fig.height = 3, fig.width = 6, message = TRUE, warning = FALSE, collapse = TRUE, comment = "#>", dpi = 150 )
library(dplyr) library(tidyr) library(purrr) library(ggplot2) library(camiller) library(cwi)
The tables I'm working with are B01003
, total population; B03002
, race and Latino ethnicity; and B25003
, housing tenure. It's easiest to save these in a named list, then map over the list calling multi_geo_acs()
for each table number.
yr <- 2020 table_nums <- list( total_pop = "B01003", race = "B03002", tenure = "B25003" )
I'm pulling out the entries in the cwi
dataset cwi::regions
(a list) to only include the Greater New Haven-area ones. Then I fetch the ACS tables for those regions, their towns, and New Haven County.
gnh_regions <- regions[c("Greater New Haven", "New Haven Inner Ring", "New Haven Outer Ring")] gnh_data <- map(table_nums, multi_geo_acs, year = yr, towns = regions$`Greater New Haven`, regions = gnh_regions, counties = "New Haven", state = "09", sleep = 1 ) gnh_data$total_pop
Neighborhoods with corresponding tracts or block groups are included for 4 cities (see neighborhood_tracts
). Pass those to get neighborhood-level aggregates.
multi_geo_acs("B01003", towns = "New Haven", counties = "New Haven", neighborhoods = new_haven_tracts, nhood_geoid = "geoid", year = yr )
The total population data is very straightforward, as it only has one variable, B01003_001
. The tibble returned has the GEOID, except for custom geographies like regions; the name of each geography, including the names of each region; the variable codes; estimates; margins of error at the default 90\% confidence level; the geographic level, numbered in order of decreasing size; and the counties of the towns.
The race and ethnicity table will require some calculations, using the brilliantly-titled camiller
package:
label_acs()
, join the race
tibble with the cwi::acs_vars
dataset to get variable labels. Oftentimes, these labels need to be separated by their "!!"
delimeter. camiller::add_grps()
with a list of racial groups and their labels' positions in the label
column. This gives estimates and, optionally, margins of error for aggregatescamiller::calc_shares()
then gives shares of each group's estimate over the "total"
denominator.gnh_data$race |> label_acs(year = yr) |> group_by(level, county, name) |> add_grps(list(total = 1, white = 3, black = 4, latino = 12, other = 5:9), group = label) |> calc_shares(group = label, denom = "total")
With the tenure table, it's easiest to separate the labels by "!!"
. Here the table can be wrangled into shares of households that are owner-occupied.
homeownership <- gnh_data$tenure |> label_acs(year = yr) |> separate(label, into = c("total", "tenure"), sep = "!!", fill = "left") |> select(level, name, tenure, estimate) |> group_by(level, name) |> calc_shares(group = tenure, denom = "Total") |> filter(tenure == "Owner occupied") homeownership
geo_level_plot()
gives a quick visual overview of the homeownership rates, highlighting town-level values.
homeownership |> geo_level_plot(value = share, hilite = "darkslateblue", type = "point")
acs_quick_map()
gives a quick map sketch of the rates. This function uses the Jenks algorithm for making breaks with jenks()
. This algorithm is well suited for visually displaying larger inequalities, but the number of breaks you give it won't necessarily be the number of breaks returned.This function lets us see whether there's a geographic distribution of this data with minimal work.
tenure_map <- homeownership |> filter(level == "4_town") |> quick_map( value = share, level = "town", color = "black", linewidth = 0.4, title = "Homeownership, Greater New Haven", palette = "BuPu" ) tenure_map
Since this returns a ggplot
object with sf
data, we can add additional ggplot
functions, such as labeling, themes, or additional scales or geoms.
tenure_map + labs(subtitle = stringr::str_glue("By town, {yr}")) + geom_sf(data = . %>% filter(name == "New Haven"), fill = NA, color = "black", linewidth = 1.5)
Say as part of a pipeline, you need to do some calculations, write different sections of a data frame to CSV files to pass along to a colleague or refer to later, and then continue on to some more calculations. batch_csv_dump()
takes either a list of data frames or a data frame plus a column to split by, and writes out a set of CSV files, then lets you move along to the next step in your pipeline.
For example, I need to pull a table of populations by age group for several regions of Connecticut. I don't need to split populations by gender, so I'll add up male and female populations for each age group. I don't actually need to more detailed age groups now, but I need to stash them in files for later, so I'll aggregate, write a bunch of files, and then aggregate into broader age groups that I need for my current work.
new_haven_regions <- regions[c( "Greater New Haven", "New Haven Inner Ring", "New Haven Outer Ring", "Lower Naugatuck Valley", "Greater Waterbury" )] age <- multi_geo_acs( table = "B01001", year = yr, towns = NULL, regions = new_haven_regions, counties = c("New Haven County", "Fairfield County") ) |> label_acs(year = yr) |> # shortcut around tidyr::separate separate_acs(into = c("sex", "age"), drop_total = TRUE) |> filter(!is.na(age)) |> mutate(age = forcats::as_factor(age)) |> group_by(name, level, age) |> summarise(estimate = sum(estimate)) |> ungroup() age |> split(~name) |> batch_csv_dump(base_name = "pop_by_age", bind = TRUE, verbose = TRUE) |> group_by(level, name) |> camiller::add_grps(list(ages00_04 = 1, ages05_17 = 2:4, ages00_17 = 1:4), group = age, value = estimate ) |> arrange(level, name, age)
I'm also interested in learning about employment by industry over the past several years. qwi_industry()
fetches county-level data by industry over time, either quarterly or annually. Here I'll look at annual averages of all industries for South Central COG and Connecticut over the past 16 years. I'm filtering out the industry code "00", which is the counts for all industries.
scc_employment <- qwi_industry(2002:2018, counties = "170", annual = TRUE) |> mutate(location = "South Central COG") ct_employment <- qwi_industry(2002:2018, annual = T) |> mutate(location = "Connecticut") employment <- bind_rows(scc_employment, ct_employment) |> filter(industry != "00") |> inner_join(naics_codes |> select(-ind_level), by = "industry") employment
Next, say I want to look at the industries that were largest in the South Central COG in 2018, and see how those have changed both for the COG and statewide over this time period. I'll filter employment
, get the industries with the largest numbers of employees, then filter employment
for just those industries and plot it.
top2018 <- employment |> filter(year == 2018, county == "170") |> top_n(8, emp) |> pull(industry) top2018 employment |> filter(industry %in% top2018) |> mutate(label = stringr::str_sub(label, 1, 25)) |> mutate(Emp_1k = emp / 1000) |> ggplot(aes(x = year, y = Emp_1k, color = label)) + geom_line() + labs( x = "Year", y = "Employees (thousands)", title = "Employment by industry", subtitle = "Connecticut and South Central COG, 2002-2018", color = "Industry" ) + theme_minimal() + facet_wrap(vars(location), scales = "free_y")
Update 11/2021: The QWI API was broken for a little while. It's up again, but all the payroll data is missing. This code should still be valid if it ever gets returned.
If I'm interested in changes in wages over this period, I can use the adj_inflation()
function. This takes a data frame, the name of the column containing dollar amounts, and a base year, then adds two columns for the inflation adjustment factor and the adjusted value.
employment |> filter(industry %in% top2018) |> mutate(label = stringr::str_sub(label, 1, 25)) |> mutate(avg_wage = Payroll / Emp) |> adj_inflation(value = avg_wage, base_year = 2018, year = year) |> mutate(adj_wage_1k = round(adj_avg_wage / 1000)) |> ggplot(aes(x = year, y = adj_wage_1k, color = label)) + geom_line() + scale_y_continuous(labels = scales::dollar) + labs( x = "Year", y = "Average annual wages (thousands)", title = "Average annual wages by industry (adjusted to 2018 dollars)", subtitle = "Connecticut and New Haven County, 2002-2018", color = "Industry" ) + theme_minimal() + facet_wrap(vars(location), scales = "free_y")
Now we have a visual that shows that in a few industries, wages have climbed over the past several years, but in many industries, wages haven't increased except by inflation.
To look at unemployment rates over time, I can use laus_trend()
. The LAUS covers smaller geographies than the QWI, so laus_trend()
is set up to find data by a combination of state, counties, or towns. The LAUS API returns monthly data on labor force counts, employment counts, unemployed counts, and unemployment rate; laus_trend()
lets you specify which of these measures to fetch.
```r unemployment <- laus_trend(c("New Haven", "New Haven County", "Connecticut"), startyear = 2000, endyear = 2018, measures = "unemployment rate" ) |> mutate(unemployment_rate = unemployment_rate / 100) |> select(area, date, value = unemployment_rate)
unemployment
unemp_plot <- ggplot(unemployment, aes(x = date, y = value, group = area, color = area)) + geom_line() + geom_smooth(se = FALSE, method = "loess", linewidth = 0.8) unemp_plot
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.