knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
library(dplyr)
library(PUMSutils)

ACS PUMS Data

The American Community Survey (ACS) is an annual survey conducted by the U.S. Census Bureau, covering many of the same topics as the decennial census. It is a 1% sample, and it comes in 2 parts: a housing survey of household units and a population survey of people. ACS data is available in several forms. Online, you can get custom tables from American FactFinder. Microdata is available through the Public Use Microdata Sample, or PUMS. This contains actual responses to the ACS, one per row. This allows essentially any query to be formed, not just those provided by AFF. However, the data is harder to work with, since every query requires doing tabulation. The PUMSutils package reduces this part to function calls.

The most important thing to know about the ACS PUMS data is that it is a weighted sample. Each row has an expansion weight indicating the number of households or people represented by this row. In addition to the main expansion weight, there are 80 weight replicates, which are used in a bootstrap-like procedure for computing standard errors.

In the PUMSutils package, a partial 2016 ACS PUMS housing record for Washington State is included as wa.house16 and Census Bureau-provided ground truth estimates for some Washington State housing characteristics are provided as wa.gold16. Below we show part of the wa.gold16 data.

# Ground truth for Washington State
wa.gold16[c(1, 3, 4, 9),]

Here there are 4 characteristics shown, total housing units, the number of owned units, the number of rented units, and the number of vacant units. The column pums_est_16 has point estimates of these characteristics. The third column, called pums_se_16, has standard errors.

If we were not using PUMSutils, we would compute a point estimate of the housing unit count like this:

# Sum the weights to estimate a count
sum(wa.house16$WGTP)

Proportions, various statistics and standard errors can also be computed, but the calculations are more involved. The PUMSutils package simplifies these calculations.

PUMSutils

The PUMSUtils package has functions for computing estimates and standard errors for counts, proportions, means and quantiles as well as computing tables of any statistic that it can compute. It also contains utility functions for loading PUMS datasets, renaming the numerous integer-coded categorical variables in the PUMS data, and querying metadata from the data dictionary.

Using PUMSutils the above example looks like this:

estimate(wa.house16)

In this case, there is no real difference. Computing a standard error using the weight replicates is more involved, and requires using the 80 replicate weights. See the PUMS technical documentation for details.

Using PUMSutils the standard errors are computed with a function acs.se.

acs.se(wa.house16, estimate)

The value given above in wa.gold16 for pums_se_16 was 675.

The ACS PUMS data contains lots of integer-coded categorical values. The PUMSUtils package contains functions for translating those to a readable form. It also contains functions for computing statistics over groups of data. Here we compute totals for owned, rented and vacant housing in Washington State, using the 'TEN' field. Compare these results to the selection from wa.gold16 above.

wa.house16$Own.Rent <- acs.translate(wa.house16, 
                                   'TEN', 
                                   c(1:4, NA), 
                                   c('own', 'own', 'rent', 'rent', 'vacant'))   
group.count(wa.house16, 'Own.Rent')

The values of the 'TEN' field are:

  1. owned with mortgage
  2. owned free and clear
  3. rented for cash
  4. occupied w/o payment

For the wa.house16 data, you can use ?wa.house16 to get help for this data set. In general, you will have to look at the data dictionary for the definitions. It is on the Census Bureau's technical documentation page.

For the 2016 1-year data, the data dictionary is included in the package as data.dict16, and it can be queried with acs.describe.

acs.describe(data.dict16, 'TEN')

If you are working with other ACS data, you should use the appropriate data dictionary, although the provided one will usually work at least for recent 1-year data. The Census bureau supplies the data dictionaries as machine-generated, human-readable text files. This package works with the data dictionary as a string, so you can just load them with readr::read_file(path).

There are a large number of integer-coded categorical fields in the ACS data. These can be rewritten automatically using the data dictionary and acs.recode. Here we compute counts of Washington residents by employment status from wa.pop16, which contains partial ACS PUMS population data for Washington State for 2016.

wa.pop16$employment <- acs.recode(wa.pop16, 'ESR', data.dict16, 5)
group.count(wa.pop16, 'employment')

If you want to tabulate over multiple variables, you can provide data already grouped using dplyr::group_by. In this case, the grouping variable is added as a new level, and you pass NULL if you do not want to add any new levels.

# Clip number of persons to 5
wa.house16$Size <- clip.column(wa.house16, 'NP', 5)
gp <- group_by(wa.house16, Own.Rent, Size)
group.count(gp, NULL)

Be sure to check the standard errors when tabulating on multiple variables. The sample counts can get small, resulting in large margins-of-error.

Check the PUMSutils documentation for the remaining functions.



davidthaler/PUMSutils documentation built on July 13, 2019, 9:58 a.m.