knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(knitr)
library(dplyr) library(readr) library(stringr)
This vignette contains an example of the workflow used by the wranglers on The Accountability Project, a tool created by The Investigative Reporting Workshop in Washington, DC. The Accountability Project curates, cleans, and indexes public data to give journalists, researchers, and the public a simple way to search across otherwise siloed records. The data focuses on people, organizations and locations.
Since state-level campaign finance data is typically reported by the campaigns and provided to them by the contributors themselves, there is often great disparity in data quality. The campfin package was created to reduce this disparity in a consistent, confident, and programmatic way.
In this vignette, we will clean some exaggerated fake messy contribution data
from a fictitious campaign in Vermont. This data is found in the
vt_contribs.csv file included with our package.
ex_file <- system.file("extdata", "vt_contribs.csv", package = "campfin")
kable(read_csv(ex_file, col_types = cols(.default = col_character())))
What are some of the potential problems we can see in this data?
datecolumn is not parsed as an R date, making it impossible to perform mathematical calculations like
amountvalue and another that's zero. Not necessarily uncommon, but worth noting.
citywe see many of the same problems, plus:
statewe see a mix of full and abbreviated state names.
While this data is obviously much smaller and more full of errors than real campaign finance data, these errors are not uncommon and need to be addressed. The campfin package contains many of the tools we need to first find and then fix these common problems.
In most cases, the first step is to download and read the file from a state
agency. When reading the data with the popular
col_date_mdy() function can be used as a quick shortcut for
readr::col_date(format = "%m/%d/%Y"), the format most commonly found in U.S.
campaign finance data.
vt <- read_csv( file = ex_file, trim_ws = FALSE, na = c("", "NA", "N/A"), col_types = cols( amount = col_number(), date = col_date_mdy() ) )
We can see how the new
date column is an actual date object, allowing for
Next, we should try to normalize our data as much as possible. We can use some simple counting functions and built in vectors to check the cleanliness of our raw data.
prop_in(vt$city, str_to_lower(valid_city)) prop_in(vt$state, valid_state) prop_in(vt$zip, valid_zip)
col_stats(vt, n_distinct) col_stats(vt, count_na)
A typical benchmark is to reach greater than 95% valid. That is, we want to normalize our data enough that less than 5% of our data can not be easily confirmed as valid using a fairly comprehensive list of cities, states, and ZIP codes.
We will first try to reach this threshold by normalizing our data. This process
involves reducing inconsistencies through string manipulation. There are
normal_*() functions for each of the 4 types of geographic variables.
Typically we use
dplyr::mutate() to create new, normalized versions of our
messy columns, preserving the old data for transparency. Here, we will just
overwrite our example data for simplicity.
vt <- vt %>% mutate( address = normal_address( address = address, abbs = usps_street, na = invalid_city, na_rep = TRUE ), city = normal_city( city = city, abbs = usps_city, states = "VT", na = invalid_city ), state = normal_state( state = state, abbreviate = TRUE, na_rep = TRUE, valid = valid_state ), zip = normal_zip( zip = zip, na_rep = TRUE ) )
We can see how these functions and our built in data was used to normalize the geographic contributor data and remove anything that didn't present real information. This format is much more easily explored and search.
vt %>% select(address, city, state, zip)
However, the problem has not been solved. City names are the most troublesome; There are so many city names and such great variety (compared to states and ZIP codes), that it can be difficult to normalize and difficult to assess.
valid_city vector contains many city names, but far less than exist in the
country, especially when you account for neighborhoods that aren't really
cities, but shouldn't be changed (some of these are contained in our curated
length(valid_city) sample(valid_city, 6) sample(extra_city, 6) # combine both vectors many_city <- c(valid_city, extra_city)
Still, checking against this list is a good way to check for values that need additional attention.
(bad <- vt %>% select(1, 7:9) %>% filter(!is.na(city)) %>% mutate(valid = city %in% many_city) %>% filter(!valid))
It might not be clear what's actually wrong with these values. A common way to check is by comparing them against expected city for a given ZIP code.
bad <- left_join( x = bad, y = zipcodes, by = c("zip", "state"), suffix = c("_raw", "_match") )
Now the problems become clear. Two city names are misspelled and the third is an obvious abbreviation. When dealing with millions of city names, we need a way to check each raw value against it's expected ZIP code match.
is_abbrev() functions can be used to compared the value
we have with the value we expect. By only checking against the corresponding
city to that record's ZIP code, we are making extremely confident changes
(compared to the incredibly useful clustering algorithms like those provided by
First, we can use
str_dist() to check the distance between the two strings;
distance is defined as the number of changes we'd need to make to our normalized
value to get our expected matched value. If that distance is small (usually 1 or
2), we can pretty confidently use the matched value.
But the string distance does not catch colloquial city abbreviations (e.g., NYC,
BOS, LA, CHI, ABQ, BRNX, DFW, OKC). Many residents get so used to writing their
city's name they use abbreviations and assume them to be universally understood.
is_abbrev() function can be used to check to one string might be an
abbreviation for another. Every abbreviation generated by the
function satisfied the requirements of
is_abbrev(abb = "NYC", full = "New York City") is_abbrev(abb = "DC", full = "Washington")
bad <- bad %>% mutate( match_dist = str_dist(city_raw, city_match), match_abb = is_abbrev(city_raw, city_match) )
Here's what this process would look like when employed on an entire data frame. It's important to ensure that the number of rows in our campaign finance data is kept consistent throughout the wrangling process and that original columns are left unchanged.
vt <- vt %>% rename(city_raw = city) %>% # match city by ZIP left_join(zipcodes) %>% rename(city_match = city) %>% mutate( # check against match match_dist = str_dist(city_raw, city_match), match_abb = is_abbrev(city_raw, city_match), city = ifelse(match_abb | match_dist == 1, city_match, city_raw) ) %>% # remove intermediary columns select(-city_raw, -city_match, -match_dist, -match_abb)
zip value is contained in our list of valid
vt %>% select(1, 7:9) %>% filter(!is.na(city)) %>% mutate( all_valid = all( city %in% valid_city, state %in% valid_state, zip %in% valid_zip ) ) %>% kable()
Once our data is as normal as we can confidently make it, we can begin to
explore. First, we'll explore the data for missing values with
which takes a tidyselect input of columns (or something like
(vt <- flag_na(vt, name))
Next, we'll want to check for duplicate rows using
flag_dupes(), which takes
the same kind of arguments. Here, we can ignore the supposedly unique
variable. It's possible for a person to make the same contribution on the same
date, but we should flag them nonetheless.
(vt <- flag_dupes(vt, -id, .both = TRUE))
This normalized data is now ready to be uploaded to the Accountability Project and searched alongside 1 billion other records! These cleaned names and addresses might bring up search results alongside one of our other sets of public data: campaign expenditures, registered voters, nonprofit organizations, stimulus spending, government contracts, lobbyist registrations, etc.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.