github_vignettes/workflow-overview.md

Workflow Overview

This document outlines the workflow for preparing license data (particularly for dashboards). The analysis is performed using the Southwick Data Server, which keeps all sensitive data confined to a single location. See Data Server Setup & Rules for more information.

R packages

This document outlines an R-based workflow that relies heavily on custom R packages developed by Southwick:

Two additional packages provide functionality that will only be used by specific analysts:

Server Resources

The dashboard analysis files are separated from data files:

| File Path | Function | | --- | ---| | E:/SA/Projects/Data-Dashboards/ | Analysis (code, etc.) by state | | E:/SA/Data-sensitive/Data-Dashboards/ | License Data with personally-identifiable info | | E:/SA/Data-production/Data-Dashboards/ | Anonymized License Data for producing dashboards |

Software

A number of software applications are available on the Data Server in E:/SA/Program Files/. These don't (currently) have shortcuts and file associations, so you might need to add those manually by finding the file exe (e.g., right click > open with > browse for program). Notable applications:

Analysis Steps

A suggested data processing sequence is outlined below.

  1. Initialize New State
  2. Load Raw date into SQLite
  3. Standardize Data
  4. Prep License Type Categories
  5. Initial Validation
  6. Finalize Production Data
  7. Final Validation

After data processing a dashboard summary data table can be built using provided functions. see License History & Summary Data for more info.

0-Initialize new state

From an R console on the server, run lictemplate::new_project("state-abbreviation", "period") to populate template files and folders for dashbaord production. Then open the corresponding ".Rproj" file created and run renv::restore() to setup the packages needed for the template workflow.

1-Load Raw data into SQLite

The purpose of this step is to get the raw license data into a format that is easily usable in analysis. SQLite databases are useful because they are easily queriable with R. The raw data should be pulled in mostly as-is, to provide an accurate (and complete) representation of the raw data in a database.

Check: At this stage you should also check that none of the columns (variables) specified in the Data Request are missing from the raw data.

Note about row IDs: Adding a unique row ID ensures a means of joining production data back to raw data in the future.

Typical Data Tables:

2-Standardize Data

Saving a standardized intermediate database simplifies downstream validation and final preparation. The Database Schema provides details about variables to include and how categorical variables should be coded.

Check:: This step naturally involves a certain amount of data validation, and recoding summaries for relevant variables (gender, residency, etc.) should be documented for future reference. Of particular interest are values that get stored as NA (missing) in the standardized tables. These may be junk data (e.g., data entry errors) or stand-ins for missing values (e.g., "U" for unknown gender).

Standardized data rules can be checked with a function: salicprep::data_check_standard()

3-Prep License Type Categories

The license type table from the state may require some manual editing. We will need to create two variables (type, duration).

Type

The "type" variable provides logic for identifying hunters and anglers based on license purchases:

Duration

The "duration" variable is used to identify multi-year or lifetime licenses. Lifetimes are given the value of 99, multi-years are given values according to their duration (e.g., 3-year = 3), and all others are given the value of 1.

4-Initial Validation

This step is intended to catch any obvious data problems early. Most data issues can be revealed by looking at how counts change year-to-year (the customer counts in particular don’t usually change much). Several checks are useful:

5-Finalize Production Data

The anonymized production data is created at this stage.

Customer Deduplication: No duplicates should be included in the production customer table, and we will want to check the state-supplied customer ID at this stage by using first name, last name, and date of birth (if provided by the agency). See Customer Deduplication for more info.

Residency Identifcation:: State agencies may or may not provide a transaction-level residency variable. If not, a sequence of steps will need to be performed to identify residency. See Residency Identification for more info.

6-Final Validation

This step involves summarizing the data in various ways to gain confidence in the trends it presents (overall, by demographic, etc.), and potentially identify any problems in the data (which may require discussion with state agency folks to sort out). To some degree it repeats the initial validation, but summaries may look different for production data due to customer deduplication, etc.



southwick-associates/salicprep documentation built on Oct. 6, 2020, 12:03 p.m.