library(knitr) opts_chunk$set( eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE, error = FALSE, collapse = TRUE, comment = "#>", fig.path = "../plots/", fig.width = 10, dpi = 300 ) if (!interactive()) { options(width = 120) set.seed(5) }
doc_dir <- fs::dir_create(here::here("state", "{st}", "{type}", "docs"))
The Accountability Project is an effort to cut across data silos and give journalists, policy professionals, activists, and the public at large a simple way to search across huge volumes of public data about people and organizations.
Our goal is to standardize public data on a few key fields by thinking of each dataset row as a transaction. For each transaction there should be (at least) 3 variables:
This document describes the process used to complete the following objectives:
zip
.year
field from the transaction date.The following packages are needed to collect, manipulate, visualize, analyze,
and communicate these results. The pacman
package will facilitate their
installation and attachment.
if (!require("pacman")) { install.packages("pacman") } pacman::p_load( tidyverse, # data manipulation lubridate, # datetime strings gluedown, # printing markdown jsonlite, # read json files janitor, # clean data frames campfin, # custom irw tools aws.s3, # aws cloud storage readxl, # read excel files refinr, # cluster & merge scales, # format strings knitr, # knit documents rvest, # scrape html glue, # code strings here, # project paths httr, # http requests fs # local storage )
This diary was run using campfin
version r packageVersion("campfin")
.
packageVersion("campfin")
options(options(knitr.kable.NA = ""))
This document should be run as part of the R_tap
project, which lives as a
sub-directory of the more general, language-agnostic
irworkshop/accountability_datacleaning
GitHub repository.
The R_tap
project uses the RStudio projects feature and should be
run as such. The project also uses the dynamic here::here()
tool for file
paths relative to your machine.
# where does this document knit? here::i_am("state/{st}/{type}/docs/{st}_{type}_diary.Rmd")
raw_url <- "https://example.com/source_file.csv" raw_dir <- dir_create(here("state", "{st}", "{type}", "data", "raw")) raw_csv <- path(raw_dir, basename(raw_url))
if (!file_exists(raw_csv)) { download.file(raw_url, raw_csv) }
{stt} <- read_delim( file = raw_csv, delim = ",", escape_backslash = FALSE, escape_double = FALSE, col_types = cols( .default = col_character(), date = col_date_mdy(), amount = col_double() ) )
{stt} <- clean_names({stt}, case = "snake")
There are r comma(nrow({stt}))
rows of r ncol({stt})
columns. Each record
represents a single {Type}...
glimpse({stt}) tail({stt})
Columns vary in their degree of missing values.
col_stats({stt}, count_na)
We can flag any record missing a key variable needed to identify a transaction.
key_vars <- c("date", "last_name", "amount", "committee_name") {stt} <- flag_na({stt}, all_of(key_vars)) sum({stt}$na_flag)
{stt} %>% filter(na_flag) %>% select(all_of(key_vars))
We can also flag any record completely duplicated across every column.
{stt} <- flag_dupes({stt}, -id) sum({stt}$dupe_flag)
{stt} %>% filter(dupe_flag) %>% select(all_of(key_vars)) %>% arrange(date)
col_stats({stt}, n_distinct)
explore_plot({stt}, type)
# fix floating point precision {stt}$amount <- round({stt}$amount, digits = 2)
summary({stt}$amount) mean({stt}$amount <= 0)
These are the records with the minimum and maximum amounts.
glimpse({stt}[c(which.max({stt}$amount), which.min({stt}$amount)), ])
The distribution of amount values are typically log-normal.
{stt} %>% ggplot(aes(amount)) + geom_histogram(fill = dark2["purple"]) + scale_y_continuous(labels = comma) + scale_x_continuous( breaks = c(1 %o% 10^(0:6)), labels = dollar, trans = "log10" ) + labs( title = "{State} {Type} Amount Distribution", caption = "Source: {source}", x = "Amount", y = "Count" )
We can add the calendar year from date
with lubridate::year()
{stt} <- mutate({stt}, year = year(date))
min({stt}$date) sum({stt}$year < 2000) max({stt}$date) sum({stt}$date > today())
It's common to see an increase in the number of contributins in elections years.
{stt} %>% count(year) %>% mutate(even = is_even(year)) %>% ggplot(aes(x = year, y = n)) + geom_col(aes(fill = even)) + scale_fill_brewer(palette = "Dark2") + scale_y_continuous(labels = comma) + scale_x_continuous(breaks = seq(2000, 2020, by = 2)) + theme(legend.position = "bottom") + labs( title = "{State} {Type} by Year", caption = "Source: {source}", fill = "Election Year", x = "Year Made", y = "Count" )
To improve the searchability of the database, we will perform some consistent,
confident string normalization. For geographic variables like city names and
ZIP codes, the corresponding campfin::normal_*()
functions are tailor made to
facilitate this process.
For the street addresss
variable, the campfin::normal_address()
function
will force consistence case, remove punctuation, and abbreviate official
USPS suffixes.
addr_norm <- {stt} %>% distinct(address1, address2) %>% unite( col = address_full, starts_with("address"), sep = " ", remove = FALSE, na.rm = TRUE ) %>% mutate( address_norm = normal_address( address = address_full, abbs = usps_street, na_rep = TRUE ) ) %>% select(-address_full)
addr_norm
{stt} <- left_join({stt}, addr_norm, by = c("address1", "address2"))
For ZIP codes, the campfin::normal_zip()
function will attempt to create
valid five digit codes by removing the ZIP+4 suffix and returning leading
zeroes dropped by other programs like Microsoft Excel.
{stt} <- {stt} %>% mutate( zip_norm = normal_zip( zip = zip, na_rep = TRUE ) )
progress_table( {stt}$zip, {stt}$zip_norm, compare = valid_zip )
Valid two digit state abbreviations can be made using the
campfin::normal_state()
function.
{stt} <- {stt} %>% mutate( state_norm = normal_state( state = state, abbreviate = TRUE, na_rep = TRUE, valid = valid_state ) )
{stt} %>% filter(state != state_norm) %>% count(state, state_norm, sort = TRUE)
progress_table( {stt}$state, {stt}$state_norm, compare = valid_state )
Cities are the most difficult geographic variable to normalize, simply due to the wide variety of valid cities and formats.
The campfin::normal_city()
function is a good start, again converting case,
removing punctuation, but expanding USPS abbreviations. We can also remove
invalid_city
values.
norm_city <- {stt} %>% distinct(city, state_norm, zip_norm) %>% mutate( city_norm = normal_city( city = city, abbs = usps_city, states = c("{ST}", "DC", "{STATE}"), na = invalid_city, na_rep = TRUE ) )
We can further improve normalization by comparing our normalized value against the expected value for that record's state abbreviation and ZIP code. If the normalized value is either an abbreviation for or very similar to the expected value, we can confidently swap those two.
norm_city <- norm_city %>% rename(city_raw = city) %>% left_join( y = zipcodes, by = c( "state_norm" = "state", "zip_norm" = "zip" ) ) %>% rename(city_match = city) %>% mutate( match_abb = is_abbrev(city_norm, city_match), match_dist = str_dist(city_norm, city_match), city_swap = if_else( condition = !is.na(match_dist) & (match_abb | match_dist == 1), true = city_match, false = city_norm ) ) %>% select( -city_match, -match_dist, -match_abb )
{stt} <- left_join( x = {stt}, y = norm_city, by = c( "city" = "city_raw", "state_norm", "zip_norm" ) )
The OpenRefine algorithms can be used to group similar strings and replace the less common versions with their most common counterpart. This can greatly reduce inconsistency, but with low confidence; we will only keep any refined strings that have a valid city/state/zip combination.
good_refine <- {stt} %>% mutate( city_refine = city_swap %>% key_collision_merge() %>% n_gram_merge(numgram = 1) ) %>% filter(city_refine != city_swap) %>% inner_join( y = zipcodes, by = c( "city_refine" = "city", "state_norm" = "state", "zip_norm" = "zip" ) )
good_refine %>% count( state_norm, zip_norm, city_swap, city_refine, sort = TRUE )
Then we can join the refined values back to the database.
{stt} <- {stt} %>% left_join(good_refine, by = names(.)) %>% mutate(city_refine = coalesce(city_refine, city_swap))
Our goal for normalization was to increase the proportion of city values known to be valid and reduce the total distinct values by correcting misspellings.
many_city <- c(valid_city, extra_city) progress <- progress_table( str_to_upper({stt}$city), {stt}$city_norm, {stt}$city_swap, {stt}$city_refine, compare = many_city ) %>% mutate(stage = as_factor(stage)) progress %>% mutate(across(stage, md_code)) %>% kable(digits = 3)
You can see how the percentage of valid values increased with each stage.
raw_in <- percent(prop_in({stt}$city, valid_city)) progress %>% ggplot(aes(x = stage, y = prop_in)) + geom_hline(yintercept = 0.99) + geom_col(fill = dark2["purple"]) + coord_cartesian(ylim = c(0.75, 1)) + scale_y_continuous(labels = percent) + labs( title = "{State} City Normalization Progress", subtitle = glue("Raw at {raw_in} before conversion to uppercase"), x = "Stage", y = "Percent Valid" )
More importantly, the number of distinct values decreased each stage. We were able to confidently change many distinct invalid values to their valid equivalent.
progress %>% select( stage, all = n_distinct, bad = n_diff ) %>% mutate(good = all - bad) %>% pivot_longer(c("good", "bad")) %>% mutate(name = name == "good") %>% ggplot(aes(x = stage, y = value)) + geom_col(aes(fill = name)) + scale_fill_brewer(palette = "Dark2", direction = -1) + scale_y_continuous(labels = comma) + theme(legend.position = "bottom") + labs( title = "{State} City Normalization Progress", subtitle = "Distinct values, valid and invalid", x = "Stage", y = "Distinct Values", fill = "Valid" )
Before exporting, we can remove the intermediary normalization columns and
rename all added variables with the _clean
suffix.
{stt} <- {stt} %>% select( -city_norm, -city_swap, city_clean = city_refine ) %>% rename_all(~str_replace(., "_norm", "_clean")) %>% rename_all(~str_remove(., "_raw")) %>% relocate(address_clean, city_clean, state_clean, .before = zip_clean)
glimpse(sample_n({stt}, 1000))
r comma(nrow({stt}))
records in the database.r comma(sum({stt}$dupe_flag))
duplicate records in the database.amount
and date
seem reasonable.r comma(sum({stt}$na_flag))
records missing key variables.campfin::normal_*()
.year
variable has been created with lubridate::year()
.Now the file can be saved on disk for upload to the Accountability server. We will name the object using a date range of the records included.
min_dt <- str_remove_all(min({stt}$date), "-") max_dt <- str_remove_all(max({stt}$date), "-") csv_ts <- paste(min_dt, max_dt, sep = "-")
clean_dir <- dir_create(here("state", "{st}", "{type}", "data", "clean")) clean_csv <- path(clean_dir, glue("{st}_{type}_{csv_ts}.csv")) clean_rds <- path_ext_set(clean_csv, "rds") basename(clean_csv)
write_csv({stt}, clean_csv, na = "") write_rds({stt}, clean_rds, compress = "xz") (clean_size <- file_size(clean_csv))
We can use the aws.s3::put_object()
to upload the text file to the IRW server.
aws_key <- path("csv", basename(clean_csv)) if (!object_exists(aws_key, "publicaccountability")) { put_object( file = clean_csv, object = aws_key, bucket = "publicaccountability", acl = "public-read", show_progress = TRUE, multipart = TRUE ) } aws_head <- head_object(aws_key, "publicaccountability") (aws_size <- as_fs_bytes(attr(aws_head, "content-length"))) unname(aws_size == clean_size)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.