library(knitr)
opts_chunk$set(
  eval = TRUE,
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  error = FALSE,
  collapse = TRUE,
  comment = "#>",
  fig.path = "../plots/",
  fig.width = 10,
  dpi = 300
)
if (!interactive()) {
  options(width = 120)
  set.seed(5)
}
doc_dir <- fs::dir_create(here::here("state", "{st}", "{type}", "docs"))

Project

The Accountability Project is an effort to cut across data silos and give journalists, policy professionals, activists, and the public at large a simple way to search across huge volumes of public data about people and organizations.

Our goal is to standardize public data on a few key fields by thinking of each dataset row as a transaction. For each transaction there should be (at least) 3 variables:

  1. All parties to a transaction.
  2. The date of the transaction.
  3. The amount of money involved.

Objectives

This document describes the process used to complete the following objectives:

  1. How many records are in the database?
  2. Check for entirely duplicated records.
  3. Check ranges of continuous variables.
  4. Is there anything blank or missing?
  5. Check for consistency issues.
  6. Create a five-digit ZIP Code called zip.
  7. Create a year field from the transaction date.
  8. Make sure there is data on both parties to a transaction.

Packages

The following packages are needed to collect, manipulate, visualize, analyze, and communicate these results. The pacman package will facilitate their installation and attachment.

if (!require("pacman")) {
  install.packages("pacman")
}
pacman::p_load(
  tidyverse, # data manipulation
  lubridate, # datetime strings
  gluedown, # printing markdown
  jsonlite, # read json files
  janitor, # clean data frames
  campfin, # custom irw tools
  aws.s3, # aws cloud storage
  readxl, # read excel files
  refinr, # cluster & merge
  scales, # format strings
  knitr, # knit documents
  rvest, # scrape html
  glue, # code strings
  here, # project paths
  httr, # http requests
  fs # local storage 
)

This diary was run using campfin version r packageVersion("campfin").

packageVersion("campfin")
options(options(knitr.kable.NA = ""))

This document should be run as part of the R_tap project, which lives as a sub-directory of the more general, language-agnostic irworkshop/accountability_datacleaning GitHub repository.

The R_tap project uses the RStudio projects feature and should be run as such. The project also uses the dynamic here::here() tool for file paths relative to your machine.

# where does this document knit?
here::i_am("state/{st}/{type}/docs/{st}_{type}_diary.Rmd")

Source

Download

raw_url <- "https://example.com/source_file.csv"
raw_dir <- dir_create(here("state", "{st}", "{type}", "data", "raw"))
raw_csv <- path(raw_dir, basename(raw_url))
if (!file_exists(raw_csv)) {
  download.file(raw_url, raw_csv)
}

Read

{stt} <- read_delim(
  file = raw_csv,
  delim = ",",
  escape_backslash = FALSE,
  escape_double = FALSE,
  col_types = cols(
    .default = col_character(),
    date = col_date_mdy(),
    amount = col_double()
  )
)
{stt} <- clean_names({stt}, case = "snake")

Explore

There are r comma(nrow({stt})) rows of r ncol({stt}) columns. Each record represents a single {Type}...

glimpse({stt})
tail({stt})

Missing

Columns vary in their degree of missing values.

col_stats({stt}, count_na)

We can flag any record missing a key variable needed to identify a transaction.

key_vars <- c("date", "last_name", "amount", "committee_name")
{stt} <- flag_na({stt}, all_of(key_vars))
sum({stt}$na_flag)
{stt} %>% 
  filter(na_flag) %>% 
  select(all_of(key_vars))

Duplicates

We can also flag any record completely duplicated across every column.

{stt} <- flag_dupes({stt}, -id)
sum({stt}$dupe_flag)
{stt} %>% 
  filter(dupe_flag) %>% 
  select(all_of(key_vars)) %>% 
  arrange(date)

Categorical

col_stats({stt}, n_distinct)
explore_plot({stt}, type)

Amounts

# fix floating point precision
{stt}$amount <- round({stt}$amount, digits = 2)
summary({stt}$amount)
mean({stt}$amount <= 0)

These are the records with the minimum and maximum amounts.

glimpse({stt}[c(which.max({stt}$amount), which.min({stt}$amount)), ])

The distribution of amount values are typically log-normal.

{stt} %>%
  ggplot(aes(amount)) +
  geom_histogram(fill = dark2["purple"]) +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(
    breaks = c(1 %o% 10^(0:6)),
    labels = dollar,
    trans = "log10"
  ) +
  labs(
    title = "{State} {Type} Amount Distribution",
    caption = "Source: {source}",
    x = "Amount",
    y = "Count"
  )

Dates

We can add the calendar year from date with lubridate::year()

{stt} <- mutate({stt}, year = year(date))
min({stt}$date)
sum({stt}$year < 2000)
max({stt}$date)
sum({stt}$date > today())

It's common to see an increase in the number of contributins in elections years.

{stt} %>% 
  count(year) %>% 
  mutate(even = is_even(year)) %>% 
  ggplot(aes(x = year, y = n)) +
  geom_col(aes(fill = even)) + 
  scale_fill_brewer(palette = "Dark2") +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(breaks = seq(2000, 2020, by = 2)) +
  theme(legend.position = "bottom") +
  labs(
    title = "{State} {Type} by Year",
    caption = "Source: {source}",
    fill = "Election Year",
    x = "Year Made",
    y = "Count"
  )

Wrangle

To improve the searchability of the database, we will perform some consistent, confident string normalization. For geographic variables like city names and ZIP codes, the corresponding campfin::normal_*() functions are tailor made to facilitate this process.

Address

For the street addresss variable, the campfin::normal_address() function will force consistence case, remove punctuation, and abbreviate official USPS suffixes.

addr_norm <- {stt} %>% 
  distinct(address1, address2) %>% 
  unite(
    col = address_full,
    starts_with("address"),
    sep = " ",
    remove = FALSE,
    na.rm = TRUE
  ) %>% 
  mutate(
    address_norm = normal_address(
      address = address_full,
      abbs = usps_street,
      na_rep = TRUE
    )
  ) %>% 
  select(-address_full)
addr_norm
{stt} <- left_join({stt}, addr_norm, by = c("address1", "address2"))

ZIP

For ZIP codes, the campfin::normal_zip() function will attempt to create valid five digit codes by removing the ZIP+4 suffix and returning leading zeroes dropped by other programs like Microsoft Excel.

{stt} <- {stt} %>% 
  mutate(
    zip_norm = normal_zip(
      zip = zip,
      na_rep = TRUE
    )
  )
progress_table(
  {stt}$zip,
  {stt}$zip_norm,
  compare = valid_zip
)

State

Valid two digit state abbreviations can be made using the campfin::normal_state() function.

{stt} <- {stt} %>% 
  mutate(
    state_norm = normal_state(
      state = state,
      abbreviate = TRUE,
      na_rep = TRUE,
      valid = valid_state
    )
  )
{stt} %>% 
  filter(state != state_norm) %>% 
  count(state, state_norm, sort = TRUE)
progress_table(
  {stt}$state,
  {stt}$state_norm,
  compare = valid_state
)

City

Cities are the most difficult geographic variable to normalize, simply due to the wide variety of valid cities and formats.

Normal

The campfin::normal_city() function is a good start, again converting case, removing punctuation, but expanding USPS abbreviations. We can also remove invalid_city values.

norm_city <- {stt} %>% 
  distinct(city, state_norm, zip_norm) %>% 
  mutate(
    city_norm = normal_city(
      city = city, 
      abbs = usps_city,
      states = c("{ST}", "DC", "{STATE}"),
      na = invalid_city,
      na_rep = TRUE
    )
  )

Swap

We can further improve normalization by comparing our normalized value against the expected value for that record's state abbreviation and ZIP code. If the normalized value is either an abbreviation for or very similar to the expected value, we can confidently swap those two.

norm_city <- norm_city %>% 
  rename(city_raw = city) %>% 
  left_join(
    y = zipcodes,
    by = c(
      "state_norm" = "state",
      "zip_norm" = "zip"
    )
  ) %>% 
  rename(city_match = city) %>% 
  mutate(
    match_abb = is_abbrev(city_norm, city_match),
    match_dist = str_dist(city_norm, city_match),
    city_swap = if_else(
      condition = !is.na(match_dist) & (match_abb | match_dist == 1),
      true = city_match,
      false = city_norm
    )
  ) %>% 
  select(
    -city_match,
    -match_dist,
    -match_abb
  )
{stt} <- left_join(
  x = {stt},
  y = norm_city,
  by = c(
    "city" = "city_raw", 
    "state_norm", 
    "zip_norm"
  )
)

Refine

The OpenRefine algorithms can be used to group similar strings and replace the less common versions with their most common counterpart. This can greatly reduce inconsistency, but with low confidence; we will only keep any refined strings that have a valid city/state/zip combination.

good_refine <- {stt} %>% 
  mutate(
    city_refine = city_swap %>% 
      key_collision_merge() %>% 
      n_gram_merge(numgram = 1)
  ) %>% 
  filter(city_refine != city_swap) %>% 
  inner_join(
    y = zipcodes,
    by = c(
      "city_refine" = "city",
      "state_norm" = "state",
      "zip_norm" = "zip"
    )
  )
good_refine %>%
  count(
    state_norm, 
    zip_norm, 
    city_swap, 
    city_refine,
    sort = TRUE
  )

Then we can join the refined values back to the database.

{stt} <- {stt} %>% 
  left_join(good_refine, by = names(.)) %>% 
  mutate(city_refine = coalesce(city_refine, city_swap))

Progress

Our goal for normalization was to increase the proportion of city values known to be valid and reduce the total distinct values by correcting misspellings.

many_city <- c(valid_city, extra_city)
progress <- progress_table(
  str_to_upper({stt}$city),
  {stt}$city_norm,
  {stt}$city_swap,
  {stt}$city_refine,
  compare = many_city
) %>% mutate(stage = as_factor(stage))
progress %>% 
    mutate(across(stage, md_code)) %>% 
    kable(digits = 3)

You can see how the percentage of valid values increased with each stage.

raw_in <- percent(prop_in({stt}$city, valid_city))
progress %>% 
  ggplot(aes(x = stage, y = prop_in)) +
  geom_hline(yintercept = 0.99) +
  geom_col(fill = dark2["purple"]) +
  coord_cartesian(ylim = c(0.75, 1)) +
  scale_y_continuous(labels = percent) +
  labs(
    title = "{State} City Normalization Progress",
    subtitle = glue("Raw at {raw_in} before conversion to uppercase"),
    x = "Stage",
    y = "Percent Valid"
  )

More importantly, the number of distinct values decreased each stage. We were able to confidently change many distinct invalid values to their valid equivalent.

progress %>% 
  select(
    stage, 
    all = n_distinct,
    bad = n_diff
  ) %>% 
  mutate(good = all - bad) %>% 
  pivot_longer(c("good", "bad")) %>% 
  mutate(name = name == "good") %>% 
  ggplot(aes(x = stage, y = value)) +
  geom_col(aes(fill = name)) +
  scale_fill_brewer(palette = "Dark2", direction = -1) +
  scale_y_continuous(labels = comma) +
  theme(legend.position = "bottom") +
  labs(
    title = "{State} City Normalization Progress",
    subtitle = "Distinct values, valid and invalid",
    x = "Stage",
    y = "Distinct Values",
    fill = "Valid"
  )

Before exporting, we can remove the intermediary normalization columns and rename all added variables with the _clean suffix.

{stt} <- {stt} %>% 
  select(
    -city_norm,
    -city_swap,
    city_clean = city_refine
  ) %>% 
  rename_all(~str_replace(., "_norm", "_clean")) %>% 
  rename_all(~str_remove(., "_raw")) %>% 
  relocate(address_clean, city_clean, state_clean, .before = zip_clean)

Conclude

glimpse(sample_n({stt}, 1000))
  1. There are r comma(nrow({stt})) records in the database.
  2. There are r comma(sum({stt}$dupe_flag)) duplicate records in the database.
  3. The range and distribution of amount and date seem reasonable.
  4. There are r comma(sum({stt}$na_flag)) records missing key variables.
  5. Consistency in geographic data has been improved with campfin::normal_*().
  6. The 4-digit year variable has been created with lubridate::year().

Export

Now the file can be saved on disk for upload to the Accountability server. We will name the object using a date range of the records included.

min_dt <- str_remove_all(min({stt}$date), "-")
max_dt <- str_remove_all(max({stt}$date), "-")
csv_ts <- paste(min_dt, max_dt, sep = "-")
clean_dir <- dir_create(here("state", "{st}", "{type}", "data", "clean"))
clean_csv <- path(clean_dir, glue("{st}_{type}_{csv_ts}.csv"))
clean_rds <- path_ext_set(clean_csv, "rds")
basename(clean_csv)
write_csv({stt}, clean_csv, na = "")
write_rds({stt}, clean_rds, compress = "xz")
(clean_size <- file_size(clean_csv))

Upload

We can use the aws.s3::put_object() to upload the text file to the IRW server.

aws_key <- path("csv", basename(clean_csv))
if (!object_exists(aws_key, "publicaccountability")) {
  put_object(
    file = clean_csv,
    object = aws_key, 
    bucket = "publicaccountability",
    acl = "public-read",
    show_progress = TRUE,
    multipart = TRUE
  )
}
aws_head <- head_object(aws_key, "publicaccountability")
(aws_size <- as_fs_bytes(attr(aws_head, "content-length")))
unname(aws_size == clean_size)


irworkshop/campfin documentation built on Oct. 19, 2023, 8:02 a.m.