The migbirdHIP
package was created by the U.S. Fish and Wildlife Service (USFWS) to process, clean, and visualize Harvest Information Program (HIP) registration data.
Every year, migratory game bird hunters must register for HIP in the state(s) they plan to hunt by providing their name, address, date of birth, and hunting success from the previous year. All 49 contiguous states are responsible for collecting this information and submitting hunter registrations to the USFWS. Raw data submitted by the states are processed in this package.
The USFWS has been using HIP registrations since 1999 to select hunters for the Migratory Bird Harvest Survey (Diary Survey) and the Parts Collection Survey (Wing Survey). Only a small proportion of HIP registrants are chosen to complete these surveys each year. The USFWS ultimately uses these data to generate annual harvest estimates. To read more about HIP and the surveys, visit: https://www.fws.gov/harvestsurvey/ and https://www.fws.gov/library/collections/migratory-bird-hunting-activity-and-harvest-reports
The package can be installed from the USFWS GitHub repository using:
devtools::install_github( "USFWS/migbirdHIP", quiet = T, upgrade = F, build_vignettes = T )
To install a past release, use the example below and substitute the appropriate version number. Releases are documented in the README.
devtools::install_github("USFWS/migbirdHIP@v1.2.8", build_vignettes = T)
Load migbirdHIP
after installation. The package will return a startup message with the version you have installed and what season of HIP data the package version was intended for.
library(migbirdHIP)
The flowchart below is a visual guide to the order in which functions are used. Some functions are only used situationally and some issues with the data cannot be solved using a function at all. The general process of handling HIP data is demonstrated in the flowchart; every exported function in the migbirdHIP
package is included in it and the vignette.
knitr::include_graphics( paste0(here::here(), "/man/figures/migbirdHIP_flowchart.svg"))
The fileRename()
function renames non-standard file names to the standard format. We expect a file name to be a 2-letter capitalized state abbreviation followed by YYYYMMDD (indicating the date data were submitted).
Some states submit files using a 5-digit file name format, containing 2 letters for the state abbreviation followed by a 3-digit Julian date. To convert these 5-digit filenames to the standard format (a requirement to read data properly with read_hip()), supply fileRename()
with the directory containing HIP data. File names will be automatically overwritten with the YYYYMMDD format date corresponding to the submitted Julian date.
This function also converts lowercase state abbreviations to uppercase.
The current hunting season year must be supplied to the year
parameter to accurately convert dates.
fileRename(path = "C:/HIP/raw_data/DL0901", year = 2024)
Check if any files in the input folder have already been written to the processed folder using fileCheck()
.
fileCheck( raw_path = "C:/HIP/raw_data/DL0901/", processed_path = "C:/HIP/corrected_data/" )
Read HIP data from fixed-width .txt
files using read_hip()
. Files must adhere to a 10-character naming convention to successfully be read in (2-letter capitalized state abbreviation followed by YYYYMMDD
); if files were submitted with a 5-digit format or lowercase state abbreviation, run fileRename first.
Note: In the read_hip()
example below, we use an internal directory to read in fake HIP data files, which allows the function to demonstrate its errors and messages. (The testing data were generated with data-raw/create_fake_HIP_data.R
and are stored under inst/extdata/
.) All other examples in this vignette will use C:/
drive examples for clarity.
raw_data <- read_hip(paste0(here::here(), "/inst/extdata/DL0901/"))
read_hip()
function allows data to be read in for:state = NA
, the default)state = "DE"
)season = FALSE
, the default; path
must be to the download's subdirectory)season = TRUE
, and path
is to the directory containing all download subdirectories)Use unique = TRUE
to read in a frame without exact duplicates, or unique = FALSE
to read in all registrations including exact duplicates.
Data are read in an expected format beginning with the title
field and ending with email
field. Additional fields are created by read_hip()
: dl_state
, dl_date
, source_file
, dl_cycle
, dl_key
, record_key
.
read_hip()
function returns a message if:dl_state
not found in the list of 49 continental US statesMMDDYYYY
or DDMMYYYY
format suspected in dl_date
NA
values detected in one or more ID fields (firstname
, lastname
, state
, birth_date
) for >10% of a file and/or >100 registrations0
in every bag fieldNA
in every bag fieldNA
in dl_state
NA
in dl_date
hunt_mig_birds
does not equal 2
when non-permit bags are 0
and one of band_tailed_pigeon
, brant
, and/or seaduck
is 2
.As an example, we will use the default read_hip()
settings to read in all of the states from download 0901, using fake HIP data files located in the extdata/DL0901/
directory of the R package.
During pre-processing, R may throw an error that says something like "invalid UTF-8 byte sequence detected". The error usually includes a field name but no other helpful information. The glyphCheck()
function identifies values containing non-UTF-8 glyphs/characters and prints them with the source file in the console so they can be edited.
glyphCheck(raw_data)
Find and print any rows that have a line shift error with shiftCheck()
.
shiftCheck(raw_data)
The clean()
function performs data cleaning and filters out bad registrations.
clean_data <- clean(raw_data)
NA
or 0
for every bag fieldfirstname
, lastname
, state
, or birth_date
is NA
address
AND email
are NA
, city
zip
AND email
are NA
firstname
lastname
suffix
zip
hunt_mig_birds
HuntY
is 0
and if band_tailed_pigeon
, brant
, or seaduck
is 2
, change hunt_mig_birds
field from 0
to 2
band_tailed_pigeon
2
for crane, change the 2
to a 0
cranes
2
for band_tailed_pigeon, change the 2
to a 0
In addition to the changes listed above, the internal zipCheck()
function returns a message from clean()
if zip codes are detected that do not correspond to provided state of residence for >10% of a file and/or >100 registrations.
The issueCheck()
function assesses the validity of the issue_date
field based on the current hunting season's HIP issuance start and end dates (not season open and close dates) for every state except Mississippi. A plot is automatically returned for past and future registrations; to skip plotting, specify plot = F
.
current_data <- issueCheck(clean_data, year = 2024, plot = F)
current_data <- clean_data
issue_date
falls between the start and end dates of HIP issuance for its state; if not already, change registration_yr
to current season.issue_date
is before the start date of HIP issuance for its state; these records are filtered out.issue_date
after the last day of hunting for its state; the registration_yr
is changed to registration_yr + 1
.issue_date
: a record's issue_date
cannot be evaluated, likely because it's formatted incorrectly or equals "00/00/0000"
; these records are filtered out.issue_date
is after the file was submitted.The duplicateFinder()
function finds hunters that have more than one registration. Registrations are grouped by firstname
, lastname
, state
, birth_date
, dl_state
, and registration_yr
to identify unique hunters. If the same hunter has 2 or more registrations, the fields that are not identical are counted and summarized.
Plot the duplicates with duplicatePlot.
duplicateFinder(current_data)
We sometimes receive multiple HIP registrations per person which must be resolved by duplicateFix()
. Duplicates are identified when more than one registration has the same firstname
, lastname
, state
, birth_date
, dl_state
, and registration_yr
. Only 1 HIP registration per hunter can be kept. For in-line permit states (r unique(migbirdHIP:::REF_PMT_INLINE$dl_state)
), permit records are submitted separately from HIP registrations. Multiple permits are allowed. We differentiate HIP
and PMT
records from in-line permit states, we check the values in the non-permit species fields (ducks_bag
, geese_bag
, dove_bag
, woodcock_bag
, coots_snipe
, rails_gallinules
). HIP registrations contain non-zero values in those columns, but permit records always have 0
values.
deduplicated_data <- duplicateFix(current_data)
To decide which HIP registration to keep from a group, we follow a series of logic.
These states include: r paste0(migbirdHIP:::REF_STATES_SD_BR, collapse = ", ")
These states include: r paste0(migbirdHIP:::REF_ABBR_49_STATES[!migbirdHIP:::REF_ABBR_49_STATES %in% c(migbirdHIP:::REF_STATES_SD_BR, migbirdHIP:::REF_STATES_SD_ONLY)], collapse = ", ")
A new field called record_type
is added to the data after the above deduplicating process. Every HIP registration is labeled HIP
. In-line permit states WA and OR send HIP and permit records separately, which are labeled HIP
and PMT
respectively.
Running bagCheck()
ensures species "bag" values are in order. This function searches for values in species group columns that are not typical or expected by the US Fish and Wildlife Service. If a value outside of the normal range is detected, an output tibble is created. Each row in the output contains the state, species, unusual stratum value, and a list of the normal values we would expect.
If a value for a species group is given in the HIP data that doesn't match anything in our records, the species reported in the output will have NA
values in the normal_strata
column. These species are not hunted in the reported states.
bagCheck(deduplicated_data)
After data are cleaned and checked in the steps above, we run proof()
to check for errors. Note that no actual corrections or data changes take place as a result of this function. The year of the Harvest Information Program must be supplied to the year
parameter, which aids in checking registration_yr
, issue_date
and birth_date
.
Values that are considered irregular are flagged in a new field called errors
. For each existing field (r paste(migbirdHIP:::REF_FIELDS_ALL, collapse = ", ")
), values are compared to standard expected formats. If they do not conform, the field name is pasted as a string in the errors
column. Each row can have anywhere from zero errors (NA
) to all field names listed. Multiple errors per registration are hyphen delimited. A typical errors
value would look like: "suffix-address-zip"
, "zip"
, or "middle-email"
.
proofed_data <- proof(deduplicated_data, year = 2024)
errors
field and why:"test_record"
firstname
and lastname
are "TEST"
lastname
is "INAUDIBLE"
firstname
is one of: "INAUDIBLE", "BLANK", "USER", "TEST", "RESIDENT"
"title"
if title
is not one of: NA
, 0
, 1
or 2
."firstname"
firstname
contains anything other than letters, apostrophe(s), space(s), and/or hyphen(s)firstname
contains less than 2 lettersfirstname
contains "AKA"
"middle"
if middle
is not exactly 1 letter or NA
"lastname"
lastname
contains anything except letters, apostrophe(s), space(s), hyphen(s), and/or period(s)lastname
contains less than 2 letters"suffix"
suffix
should be one of: I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XIX, XX, 1ST, 2ND, 3RD, 4TH, 5TH, 6TH, 7TH, 8TH, 9TH, 10TH, 11TH, 12TH, 13TH, 14TH, 15TH, 16TH, 17TH, 18TH, 19TH, 20TH, JR, SR
.XVIII
is excluded (exceeds 4 character limit)"address"
if address
contains a |
, tab, or non-UTF8 character"city"
city
contains anything other than letters, space(s), hyphen(s), apostrophe(s), and/or period(s)city
contains less than 3 letters"state"
if state
is not contained in the following list of abbreviations for US states and territories and Canadian provinces and territories:r paste0(c(migbirdHIP:::REF_ABBR_49_STATES, migbirdHIP:::REF_ABBR_USA, migbirdHIP:::REF_ABBR_CANADA), collapse = ", ")
"zip"
address
doesn't have a zip
that should be in their reported address state
of residence"birth_date"
if the year in birth_date
is more than 100
or less than 0
years ago"hunt_mig_birds"
if not equal to 1
or 2
"registration_yr"
if not equal to the HIP data collection year"email"
email
does not match universally accepted email regex (see REGEX_EMAIL
)email
is obfuscativenone@gmail.com, nottoday@gmail.co``m, fake.fake@gmail.com``, n@a.com, x@y.com, brian@na.com
, etc (see REGEX_EMAIL_OBFUSCATIVE_LOCALPART
, REGEX_EMAIL_OBFUSCATIVE_DOMAIN
, and REGEX_EMAIL_OBFUSCATIVE_ADDRESS
)aaa@a.com
(see REGEX_EMAIL_REPEATED_CHAR
)tpwd.texas.gov
or some variation (see REGEX_EMAIL_OBFUSCATIVE_TPWD
)walmart.com
domain is preceded by only numbers (see REGEX_EMAIL_OBFUSCATIVE_WALMART
)email
is longer than 100 charactersgmail.net
or hotmail.gov
).comcom
, .ccom
, etc.)gmailcom
).Data can be corrected by running the correct()
function with year of the Harvest Information Program supplied to the year
parameter.
corrected_data <- correct(proofed_data, year = 2024)
title
is changed to NA
if "title"
is in the errors
field
middle
is changed to NA
if "middle"
is in the errors
field
suffix
is changed to NA
if "suffix"
is in the errors
field
email
@gmail
would become @gmail.com
), for domains including:gmail, yahoo, hotmail, aol, icloud, comcast, outlook, sbcglobal, att, msn, live, bellsouth, charter, ymail, me, verizon, cox, earthlink, protonmail, pm, mail, duck, ducks
com, net, edu, gov, org
navymil, mailmil, armymil; add multiple periods if missing to usnavymil, usafmil, usarmymil, usacearmymil
errors
is updated by re-running proof()
inside of correct()
Plot duplicates with duplicatePlot()
.
duplicatePlot(current_data)
The errorPlotFields()
function can be run on all states...
errorPlotFields(proofed_data, loc = "all", year = 2024)
... or it can be limited to just one.
errorPlotFields(proofed_data, loc = "LA", year = 2024)
It is possible to add any ggplot2
components to these plots. A plot can be altered with facet_wrap
using either dl_cycle
or dl_date
. The example below demonstrates how this package's functions can interact with tidyverse
and shows an example of an errorPlotFields
with facet_wrap
(using a subset of 4 download cycles).
errorPlotFields( hipdata2024 |> filter(str_detect(dl_cycle, "0800|0901|0902|1001")), year = 2024) + theme( axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1), legend.position = "bottom") + facet_wrap(~dl_cycle, ncol = 2)
The errorPlotStates()
function plots error proportions per state. A threshold value must be set to only view states above a certain proportion of error (in the example below, threshold = 0.05
indicates an error tolerance of 5%). Bar labels are error counts.
errorPlotStates(proofed_data, threshold = 0.05)
This function should not be used unless you want to visualize an entire season of data. The errorPlotDL()
function plots proportion of error per download cycle over the course of the hunting season. Location may be specified with the loc
parameter to see a particular state over time.
errorPlotDL(hipdata2024, loc = "MI")
knitr::include_graphics(paste0(here::here(), "/man/figures/errorPlot_dl.png"))
The pullErrors()
function can be used to view all of the actual values that were flagged as errors in a particular field. In this example, we find that the suffix
field contains several values that are not accepted.
pullErrors(proofed_data, field = "suffix")
Running pullErrors()
on a field that has no errors will return a message.
pullErrors(proofed_data, field = "dove_bag")
The errorTable()
function returns error data as a tibble, which can be assessed as needed or exported to create records of download cycle errors. The basic function reports errors by both location and field.
errorTable(proofed_data)
Errors can be reported by only location by turning off the field
parameter.
errorTable(proofed_data, field = "none")
Errors can be reported by only field by turning off the loc
parameter.
errorTable(proofed_data, loc = "none")
Location can be specified using one of the 49 contiguous state abbreviations.
errorTable(proofed_data, loc = "CA")
Field can be specified (one of: r paste(c("all", "none", migbirdHIP:::REF_FIELDS_ALL), collapse = ", ")
).
errorTable(proofed_data, field = "suffix")
Total errors for a location can be pulled.
errorTable(proofed_data, loc = "CA", field = "none")
Total errors for a field in a particular location can be pulled.
errorTable(proofed_data, loc = "CA", field = "dove_bag")
After the data have been corrected, the data are ready to be written out. Use write_hip()
to do final processing to the data, which includes 1) adding in FWS strata and 2) setting NA
values to blank strings. If split = FALSE
, the final table will be saved as a single .csv
to your specified path. If split = TRUE
(default), one .csv
file per each input .txt
source file will be written to the specified directory.
write_hip(corrected_data, path = "C:/HIP/processed_data/")
The writeReport()
function can be used to automatically generate an R markdown document with figures, tables, and summary statistics. This can be done at the end of a download cycle.
writeReport( raw_path = "C:/HIP/DL0901/", temp_path = "C:/HIP/corrected_data", year = 2024, dl = "0901", dir = "C:/HIP/dl_reports", file = "DL0901_report")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.