In luceydav/irsSOI: Tools to load and explore IRS SOI Zip Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = TRUE,
  fig.align="center",
  out.width = "100%"
)

knitr::opts_chunk$set(echo = TRUE)
library(irsSOI)

The NBER Individual Income Tax Statistics - ZIP Code cleans the [IRS SOI data][https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi] and makes data from 2005-2016 available for download. The {irsSOI} package makes the selected data available as a data.table through a complete work flow as shown below.

# Not run but to show if not using built-in data

# Download data from NBER IRS SOI website
path <- "./data/"
download_nber_data(path = path, start_year = 2015, end_year = 2016)

# Load data from specified folder
irs_app_data <- irsSOI::load_soi(path = path)

# Clean and standardize income bands
irs_app_data <- irsSOI::clean_soi(irs_app_data)

# Prepare data for use in Shiny app
irs_app_data <- irsSOI::prepare_app_data(irs_app_data)

We built a function download_nber_data() to download based on specified years and load into a specified folder. Each year takes about 1 minute and 75MB of data. If no path specified, the function makes a file called "data" in the working directory.

Once the chosen years have been downloaded, the load_soi() function takes all the data in the specified folder, loads and row binds all years to return a data.table. Because of changes over time, annual data sometimes has different variable names and threshold for income bands.

In clean_soi(), we perform an number of cleaning functions. For example, zip codes with a leading zero have only 4 digits, and the years of 2007-2008 round by 1000 when other years don't. We have also aggregated income bands where possible. For example, early years had separate bands for \<\$<10k and \<\$25k, and the later years have only \<\$25k, so we aggregated the totals of the two bands into one. We unfortunately also had to combine \<\$100k and \<\$250k, but this would have made a more meaningful high income category. We have also sometimes coalesced variables which were the same, but called different names over time.

Lastly, prepare_app_data() selects columns which are consistent over the period or where consistent fields could be derived. It filters out any zipcodes where there are fewer than 100 returns in any given period of the specified data, and add population data by zip code. Although there are approximate 42,000 zip codes in the country, our data has 24,143. Many zip codes are actually P.O. Boxes with no people, and others have very few people. We estimate the that the lost income in any given year is between $100-150 billion.

data("data_dict")

# Show some descriptions from 2016
d <- data_dict[year == "2016"]
DT::datatable(d[, .(name, desc)], 
              colnames = c("Name", "Description"))

The NBER has coded fields according to their own identifiers. Use field_lookup() to find the description.

field_lookup(select_year = "2016", select_name = "a00100")

Load included sample data.table for the State of CT from 2005-2018 called irs_app_data. We have also

# Load built-in data
data("irs_app_data")

Make a Summary datatable for the State of CT with the make_summary_DT() function.

make_summary_DT(irs_app_data)

Make a graph of the aggregated AGI of the State of CT by income band with the make_agi_graph() function.

make_agi_graph(irs_app_data)

Make a graph of the average Federal tax rate for the State of CT by income band with the make_tax_graph() function.

make_tax_graph(irs_app_data)

# Not run, but to view Shiny app in Package
irsApp(irs_app_data)

We have made a public Shiny IRS Tax Dashboard, which can be filtered by State, County, Post Office City and Zipcode. Once the geography is chosen, it is also possible to filter among the five income bands. Currently, the App has two tabs, one for the Summary Table and another for the AGI and tax rate graphs, but the data set is rich with more fields, such as mortgage and SALT deductions, dependents, Earned Income Credit by zip code, which made it too large for us to include in the first iteration. We will be adding more in a future version, which will use our own as well {golem} package.

luceydav/irsSOI documentation built on May 3, 2024, 3:25 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com