#! rmarkdown::render('workflow.Rmd') knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
This is a typical workflow for processing election returns:
The sections that follow describe package functionality for each step and offer some guidance.
Column type problems often arise when reading returns from delimited text
(e.g., CSV) files. We also may encounter column names that are duplicated,
missing, or contain special characters. Base R's read.csv
doesn't handle these
situations as safely as fread
from data.table
and read_csv
from readr
.
fread
and read_csv
produce warnings on parser errors that we can resolve by
setting column types explicitly via the colClasses
argument for fread
or
col_types
for read_*
. Other than the two functions' return types,
data.table
and tibble
, the most relevant difference in implementation for us
is that the read_
family infers column types from the first guess_max
rows,
whereas fread
uses a sample.
We'd ideally read returns from a delimited-text format only once, and for any intermediate writes to the disk during processing use a binary format that preserves types, e.g. an R data format. But we haven't always done this, and in the 2016 returns there are a lot of intermediate CSVs.
Read these intermediate CSVs with the read_legacy_csv
function. It promotes
parser warnings to errors, and can read expected columns with the appropriate types
if use_types = TRUE
.
We deal with quite a few Excel files. Some can be handled with the readxl package. It works well when spreadsheets already look like sane tabular data, with a single header row followed by observations whose identifiers are in columns.
This isn't always so. Consider an excerpt:
The top of the sheet has three headers, which give:
1:3
4
5
But there's also an interstitial header spanning A6:I6
("ADA"). This is a
jurisdiction, and describes all the rows below it until the next such header.
(Which happens to be "ADA (Continued)" in A31:I31
). These headers appear about
every 25 rows. Otherwise, A
gives precinct numbers.
At a high level, we want to do the following:
office
, party
and candidate
in B:I
with the votes
in its column;jurisdiction
in A
with all the votes below it, until the
next interstitial header;precinct
in A
with all the votes in its row;The desired result would start like this:
office party candidate jurisdiction precinct votes 1: UNITED STATES PRESIDENT IND Darrell L. Castle ADA 1401 4 2: UNITED STATES PRESIDENT DEM Hillary Rodham Clinton ADA 1401 206 3: UNITED STATES PRESIDENT CON Scott Copeland ADA 1401 0 4: UNITED STATES PRESIDENT IND Rocky De La Fuente ADA 1401 0 5: UNITED STATES PRESIDENT LIB Gary Johnson ADA 1401 35 6: UNITED STATES PRESIDENT IND Evan McMullin ADA 1401 63
The package includes functionality for arriving at such a result with reasonable speed and safely. See the vignette on Excel-file processing for more details.
Each row in released precinct data gives the votes
cast via some mode
(e.g.,
by mail) for a candidate
running for an office
either on a party
line or
as a writein
, in a particular precinct
within an election jurisdiction
.
So, observations should be unique by jurisdiction
, precinct
, candidate
,
office
, party
, writein
and mode
.
We wrangle the raw data into these columns. For most states this requires at
minimum splitting a single raw column between jurisdiction
and precinct
, and
another between office
and district
.
For readable and maintainable code, we recommend using the stringr
package for
this, in combination with either dplyr
or data.table
. There is some
syntactic sugar in the package for text-value wrangling with regular
expressions.
These infix operators are vectorized over a left-hand-side character vector and
apply a pattern from their right-hand side (ignoring case). They correspond with
functions in stringr
:
| function | stringr verb |
|:--------:|:------------ |
| %=%
| detect |
| %-%
| remove_all |
| %e%
| extract |
These functions operate on specified columns of a data.table
:
| function | operation |
|:--------:|:------------------- |
| move
| extract then remove |
We standardize certain values of office
and candidate
. For example, in MEDSL
data, all presidential candidates run for the office
of US President
; to
achieve this consistency we replace alternative references to the presidency.
The standardized values of office
are US President
, US Senate
, US House
,
State Senate
, State House
, and Governor
. Other values are left as they
appear in the raw data. In the future, this might change. See the
normalize_office
function.
library(medslcleaner) d = data.frame( office = 'President and Vice President of the United States', candidate = 'Write-ins') normalize_office(d)
We standardize presidential candidate names. Typically this is a one-liner using
normalize_presidential_candidates
, which searches for known candidate last
names where office
is US President
and assigns the standardized form.
Generally, we want to standardize candidates' names across all the jurisdictions in which they appear on the ballot. In practice, candidate names appear consistently at least within jurisdictions, and usually within states.
d = data.frame( office = c('US President'), candidate = c('TRUMP', 'CLINTON')) normalize_presidential_candidates(d)
We standardize the names of parties where it seems useful. Exactly how could
change in the future, but we use expand_party_abbr()
.
d = data.frame(party = c('DEM', 'REP')) expand_party_abbr(d$party)
The dataverse
variable takes one of these values: president
, senate
,
house
, state
, local
, and all
. These values determine the dataverse into
which we'll release the data.
To assign standardized offices to their dataverses use assign_fixed
. If
office
is US President
, dataverse
will be set to president
, and so forth
for US Senate
, US House
, State Senate
, and State House
.
Then use the dvna
function to show the offices not yet assigned to a
dataverse. The output also gives the number of jurisdictions in which they
appeared on the ballot. Where it's unclear whether an office is state or local,
this can be a useful heuristic.
The ogrep
function is similar but only shows offices that match a
(regular-expression, case-insensitive) pattern
. Develop an ogrep
pattern
that matches the desired offices, then use this pattern with the assign_match
function to assign all rows with matching office
values to a given
dataverse
.
We iterate over these steps until all offices have been assigned to dataverses,
and dvna
no longer produces an error.
First assign standardized offices:
d = data.frame(office = c('US House', 'County Commissioner'), jurisdiction = c('Appleton', 'Racine')) d d = assign_fixed(d) d
dvna
would now show unassigned offices and stop. To assign them to dataverses:
# We test out a pattern that matches some number of the `office` values ogrep(d, 'County') # Then assign matching rows a given `dataverse` value, in this case `local` d = assign_match(d, 'County', 'local') d # `dvna` no longer produces an error because all rows have a `dataverse`; # Leave it in the script here, as confirmation this remains true dvna(d)
Before saving the result, we attach state and county identifiers from package
datasets. In most states, jurisdiction
values are county names, and in some,
municipalities (notable exceptions are Alaska and DC). We map jurisdiction
values to county or county-equivalent FIPS codes using the EAVS dataset
(primarily), the tabular data included with the 2018 Census Partnership
Shapefiles of voting districts, and 2017 Census gazetteers.
The function add_jurisdiction_fips()
will try to match jurisdictions by name
with EAVS data, and if all jurisdictions can be found in the EAVS data, will
merge in the corresponding county FIPS codes. In some states, more work is
required; see for example the scripts for Vermont or Wisconsin.
Given county FIPS codes, we use add_geo_ids
to join other county identifiers
from the county_ids
dataset in the elections
package.
The medslcleaner
package includes a candidates
dataset with various
identifiers for (at this point only) federal candidates, pulled from the FEC and
the @unitedstates project. Linking ballot names to these datasets is already
done, so merging in the identifiers only requires a call to add_candidate_ids
.
If this fails, then the candidates
dataset needs to be rebuilt to include new
candidates; see the data-ext/candidates
directory. In the future this should
be more robust.
We use write_precincts
to write the final result to disk. The important
changes that write_precincts
makes before writing to disk are to keep only the
expected columns and normalize whitespace in character columns. If certain
columns with default values are missing (e.g., stage = "gen"
), they'll be
created.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.