In USFWS/migbirdHarvestData: Process USFWS Migratory Bird Harvest Information Program Data in R

Introduction
Part A: Data Import and Cleaning
- fileRename
- fileCheck
- read_hip
- glyphCheck
- shiftCheck
- clean
- issueCheck
- duplicateFinder
- duplicateFix
- bagCheck
- proof
- correct
Part B: Data Visualization and Tabulation
- Visualization
- Tabulation
  - pullErrors
  - errorTable
Part C: Writing Data and Reports
- write_hip
- writeReport

Introduction {#introduction}

The migbirdHIP package was created by the U.S. Fish and Wildlife Service (USFWS) to process, clean, and visualize Harvest Information Program (HIP) registration data.

Every year, migratory game bird hunters must register for HIP in the state(s) they plan to hunt by providing their name, address, date of birth, and hunting success from the previous year. All 49 contiguous states are responsible for collecting this information and submitting hunter registrations to the USFWS. Raw data submitted by the states are processed in this package.

The USFWS has been using HIP registrations since 1999 to select hunters for the Migratory Bird Harvest Survey (Diary Survey) and the Parts Collection Survey (Wing Survey). Only a small proportion of HIP registrants are chosen to complete these surveys each year. The USFWS ultimately uses these data to generate annual harvest estimates. To read more about HIP and the surveys, visit: https://www.fws.gov/harvestsurvey/ and https://www.fws.gov/library/collections/migratory-bird-hunting-activity-and-harvest-reports

Installation {#installation}

The package can be installed from the USFWS GitHub repository using:

devtools::install_github(
  "USFWS/migbirdHIP",
  quiet = T,
  upgrade = F,
  build_vignettes = T
)

Releases {#releases}

To install a past release, use the example below and substitute the appropriate version number. Releases are documented in the README.

devtools::install_github("USFWS/migbirdHIP@v1.2.8", build_vignettes = T)

Load {#load}

Load migbirdHIP after installation. The package will return a startup message with the version you have installed and what season of HIP data the package version was intended for.

library(migbirdHIP)

Functions overview {#functions-overview}

The flowchart below is a visual guide to the order in which functions are used. Some functions are only used situationally and some issues with the data cannot be solved using a function at all. The general process of handling HIP data is demonstrated in the flowchart; every exported function in the migbirdHIP package is included in it and the vignette.

knitr::include_graphics(
  paste0(here::here(), "/man/figures/migbirdHIP_flowchart.svg"))

Part A: Data Import and Cleaning {#part-a-data-import-and-cleaning}

fileRename {#filerename}

The fileRename() function renames non-standard file names to the standard format. We expect a file name to be a 2-letter capitalized state abbreviation followed by YYYYMMDD (indicating the date data were submitted).

Some states submit files using a 5-digit file name format, containing 2 letters for the state abbreviation followed by a 3-digit Julian date. To convert these 5-digit filenames to the standard format (a requirement to read data properly with read_hip()), supply fileRename() with the directory containing HIP data. File names will be automatically overwritten with the YYYYMMDD format date corresponding to the submitted Julian date.

This function also converts lowercase state abbreviations to uppercase.

The current hunting season year must be supplied to the year parameter to accurately convert dates.

fileRename(path = "C:/HIP/raw_data/DL0901", year = 2024)

fileCheck {#filecheck}

Check if any files in the input folder have already been written to the processed folder using fileCheck().

fileCheck(
  raw_path = "C:/HIP/raw_data/DL0901/",
  processed_path = "C:/HIP/corrected_data/"
)

read_hip {#read_hip}

Read HIP data from fixed-width .txt files using read_hip(). Files must adhere to a 10-character naming convention to successfully be read in (2-letter capitalized state abbreviation followed by YYYYMMDD); if files were submitted with a 5-digit format or lowercase state abbreviation, run fileRename first.

Note: In the read_hip() example below, we use an internal directory to read in fake HIP data files, which allows the function to demonstrate its errors and messages. (The testing data were generated with data-raw/create_fake_HIP_data.R and are stored under inst/extdata/.) All other examples in this vignette will use C:/ drive examples for clarity.

raw_data <- read_hip(paste0(here::here(), "/inst/extdata/DL0901/"))

The `read_hip()` function allows data to be read in for:

all states (e.g., state = NA, the default)
a specific state (e.g., state = "DE")
a specific download (e.g., season = FALSE, the default; path must be to the download's subdirectory)
an entire season (e.g., season = TRUE, and path is to the directory containing all download subdirectories)

Use unique = TRUE to read in a frame without exact duplicates, or unique = FALSE to read in all registrations including exact duplicates.

Data are read in an expected format beginning with the title field and ending with email field. Additional fields are created by read_hip(): dl_state, dl_date, source_file, dl_cycle, dl_key, record_key.

The `read_hip()` function returns a message if:

There is a dl_state not found in the list of 49 continental US states
MMDDYYYY or DDMMYYYY format suspected in dl_date
Blank files are found in the directory
NA values detected in one or more ID fields (firstname, lastname, state, birth_date) for >10% of a file and/or >100 registrations
All emails are missing from a file
Test records are found
Any registration has a 0 in every bag field
Any registration has an NA in every bag field
Any registration contains a bag value that is not a 1-digit number
There is an NA in dl_state
There is an NA in dl_date
For presumed solo permit DNHs; if any OR or WA hunt_mig_birds does not equal 2 when non-permit bags are 0 and one of band_tailed_pigeon, brant, and/or seaduck is 2.

As an example, we will use the default read_hip() settings to read in all of the states from download 0901, using fake HIP data files located in the extdata/DL0901/ directory of the R package.

glyphCheck {#glyphcheck}

During pre-processing, R may throw an error that says something like "invalid UTF-8 byte sequence detected". The error usually includes a field name but no other helpful information. The glyphCheck() function identifies values containing non-UTF-8 glyphs/characters and prints them with the source file in the console so they can be edited.

glyphCheck(raw_data)

shiftCheck {#shiftcheck}

Find and print any rows that have a line shift error with shiftCheck().

shiftCheck(raw_data)

clean {#clean}

The clean() function performs data cleaning and filters out bad registrations.

clean_data <- clean(raw_data)

Registrations are dropped if:

Any bag value is not a 1-digit number.
NA or 0 for every bag field
Test record is detected
One of:
- firstname, lastname, state, or birth_date is NA
- address AND email are NA, city
- AND/OR zip AND email are NA

Changes include:

firstname
- Change to uppercase
- If a suffix value is detected (e.g. JR, SR, 1ST-20TH and 1-20 in Roman numerals, excluding XVIII) delete it. Delete white space around string.
lastname
- Change to uppercase
- If a suffix value is detected (e.g. JR, SR, 1ST-20TH and 1-20 in Roman numerals, excluding XVIII) delete it. Delete white space around string.
suffix
- Change to uppercase
- If a suffix value is detected in firstname or lastname, replace the suffix field with that value. Values that are searched for include JR, SR, 1ST-20TH and 1-20 in Roman numerals, excluding XVIII. Periods and commas are deleted.
zip
- Remove ending hyphen from zip codes with only 5 digits
- Remove ending 0 from zip codes with 10 digits
- Insert a hyphen in continuous 9 digit zip codes
- Insert a hyphen in 9 digit zip codes with a middle space
- Delete trailing -0000 and -____
hunt_mig_birds
- For Oregon and/or Washington, if HuntY is 0 and if band_tailed_pigeon, brant, or seaduck is 2, change hunt_mig_birds field from 0 to 2
band_tailed_pigeon
- If any permit file states submitted a 2 for crane, change the 2 to a 0
cranes
- If any permit file states submitted a 2 for band_tailed_pigeon, change the 2 to a 0

In addition to the changes listed above, the internal zipCheck() function returns a message from clean() if zip codes are detected that do not correspond to provided state of residence for >10% of a file and/or >100 registrations.

issueCheck {#issuecheck}

The issueCheck() function assesses the validity of the issue_date field based on the current hunting season's HIP issuance start and end dates (not season open and close dates) for every state except Mississippi. A plot is automatically returned for past and future registrations; to skip plotting, specify plot = F.

current_data <- issueCheck(clean_data, year = 2024, plot = F)

current_data <- clean_data

Criteria evaluated and changes made:

Current record: a record's issue_date falls between the start and end dates of HIP issuance for its state; if not already, change registration_yr to current season.
Past record: a record's issue_date is before the start date of HIP issuance for its state; these records are filtered out.
Future record: a record has an issue_date after the last day of hunting for its state; the registration_yr is changed to registration_yr + 1.
Bad issue_date: a record's issue_date cannot be evaluated, likely because it's formatted incorrectly or equals "00/00/0000"; these records are filtered out.
Return message if any record's issue_date is after the file was submitted.

duplicateFinder {#duplicatefinder}

The duplicateFinder() function finds hunters that have more than one registration. Registrations are grouped by firstname, lastname, state, birth_date, dl_state, and registration_yr to identify unique hunters. If the same hunter has 2 or more registrations, the fields that are not identical are counted and summarized.

Plot the duplicates with duplicatePlot.

duplicateFinder(current_data)

duplicateFix {#duplicatefix}

We sometimes receive multiple HIP registrations per person which must be resolved by duplicateFix(). Duplicates are identified when more than one registration has the same firstname, lastname, state, birth_date, dl_state, and registration_yr. Only 1 HIP registration per hunter can be kept. For in-line permit states (r unique(migbirdHIP:::REF_PMT_INLINE$dl_state)), permit records are submitted separately from HIP registrations. Multiple permits are allowed. We differentiate HIP and PMT records from in-line permit states, we check the values in the non-permit species fields (ducks_bag, geese_bag, dove_bag, woodcock_bag, coots_snipe, rails_gallinules). HIP registrations contain non-zero values in those columns, but permit records always have 0 values.

deduplicated_data <- duplicateFix(current_data)

To decide which HIP registration to keep from a group, we follow a series of logic.

For sea duck and brant states:

These states include: r paste0(migbirdHIP:::REF_STATES_SD_BR, collapse = ", ")

Keep registration(s) with the most recent issue date.
Exclude registrations with all 1s or all 0s in bag columns from consideration.
Keep any registrations that have a 2 for either brant or sea duck (for seaduck and brant states), or 2 for seaduck (Maine only).
If more than one registration remains, choose to keep one randomly.

For HIP registrations from in-line permit states and all other states:

These states include: r paste0(migbirdHIP:::REF_ABBR_49_STATES[!migbirdHIP:::REF_ABBR_49_STATES %in% c(migbirdHIP:::REF_STATES_SD_BR, migbirdHIP:::REF_STATES_SD_ONLY)], collapse = ", ")

Keep registration(s) with the most recent issue date.
Exclude registrations with all 1s or all 0s in bag columns from consideration, if possible.
If more than one registration remains, choose to keep one randomly.

A new field called record_type is added to the data after the above deduplicating process. Every HIP registration is labeled HIP. In-line permit states WA and OR send HIP and permit records separately, which are labeled HIP and PMT respectively.

bagCheck {#bagcheck}

Running bagCheck() ensures species "bag" values are in order. This function searches for values in species group columns that are not typical or expected by the US Fish and Wildlife Service. If a value outside of the normal range is detected, an output tibble is created. Each row in the output contains the state, species, unusual stratum value, and a list of the normal values we would expect.

If a value for a species group is given in the HIP data that doesn't match anything in our records, the species reported in the output will have NA values in the normal_strata column. These species are not hunted in the reported states.

bagCheck(deduplicated_data)

proof {#proof}

After data are cleaned and checked in the steps above, we run proof() to check for errors. Note that no actual corrections or data changes take place as a result of this function. The year of the Harvest Information Program must be supplied to the year parameter, which aids in checking registration_yr, issue_date and birth_date.

Values that are considered irregular are flagged in a new field called errors. For each existing field (r paste(migbirdHIP:::REF_FIELDS_ALL, collapse = ", ")), values are compared to standard expected formats. If they do not conform, the field name is pasted as a string in the errors column. Each row can have anywhere from zero errors (NA) to all field names listed. Multiple errors per registration are hyphen delimited. A typical errors value would look like: "suffix-address-zip", "zip", or "middle-email".

proofed_data <- proof(deduplicated_data, year = 2024)

What gets flagged in the `errors` field and why:

"test_record"
- firstname and lastname are "TEST"
- lastname is "INAUDIBLE"
- firstname is one of: "INAUDIBLE", "BLANK", "USER", "TEST", "RESIDENT"
"title" if title is not one of: NA, 0, 1 or 2.
"firstname"
- firstname contains anything other than letters, apostrophe(s), space(s), and/or hyphen(s)
- firstname contains less than 2 letters
- firstname contains "AKA"
"middle" if middle is not exactly 1 letter or NA
"lastname"
- lastname contains anything except letters, apostrophe(s), space(s), hyphen(s), and/or period(s)
- lastname contains less than 2 letters
"suffix"
- suffix should be one of: I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XIX, XX, 1ST, 2ND, 3RD, 4TH, 5TH, 6TH, 7TH, 8TH, 9TH, 10TH, 11TH, 12TH, 13TH, 14TH, 15TH, 16TH, 17TH, 18TH, 19TH, 20TH, JR, SR.
- Note that XVIII is excluded (exceeds 4 character limit)
"address" if address contains a |, tab, or non-UTF8 character
"city"
- city contains anything other than letters, space(s), hyphen(s), apostrophe(s), and/or period(s)
- city contains less than 3 letters
"state" if state is not contained in the following list of abbreviations for US states and territories and Canadian provinces and territories:
- r paste0(c(migbirdHIP:::REF_ABBR_49_STATES, migbirdHIP:::REF_ABBR_USA, migbirdHIP:::REF_ABBR_CANADA), collapse = ", ")
"zip"
- If a registration's address doesn't have a zip that should be in their reported address state of residence
"birth_date" if the year in birth_date is more than 100 or less than 0 years ago
"hunt_mig_birds" if not equal to 1 or 2
"registration_yr" if not equal to the HIP data collection year
"email"
- email does not match universally accepted email regex (see REGEX_EMAIL)
- email is obfuscative
  - e.g., none@gmail.com, nottoday@gmail.co``m, fake.fake@gmail.com``, n@a.com, x@y.com, brian@na.com, etc (see REGEX_EMAIL_OBFUSCATIVE_LOCALPART, REGEX_EMAIL_OBFUSCATIVE_DOMAIN, and REGEX_EMAIL_OBFUSCATIVE_ADDRESS)
  - A repeated character is detected, e.g. aaa@a.com (see REGEX_EMAIL_REPEATED_CHAR)
  - The domain is tpwd.texas.gov or some variation (see REGEX_EMAIL_OBFUSCATIVE_TPWD)
  - A walmart.com domain is preceded by only numbers (see REGEX_EMAIL_OBFUSCATIVE_WALMART)
- email is longer than 100 characters
- A common domain name (e.g., gmail, yahoo) has a common typo
- A common domain name doesn't have a matching top-level domain (e.g., gmail.net or hotmail.gov)
- The address has a bad top-level domain (e.g., .comcom, .ccom, etc.)
- The email is missing a top-level domain
- The top-level domain period is missing (e.g., gmailcom).

correct {#correct}

Data can be corrected by running the correct() function with year of the Harvest Information Program supplied to the year parameter.

corrected_data <- correct(proofed_data, year = 2024)

Changes and corrections include:

title is changed to NA if "title" is in the errors field
middle is changed to NA if "middle" is in the errors field
suffix is changed to NA if "suffix" is in the errors field
email
- Add endings to common domains if missing (e.g. @gmail would become @gmail.com), for domains including:
  - gmail, yahoo, hotmail, aol, icloud, comcast, outlook, sbcglobal, att, msn, live, bellsouth, charter, ymail, me, verizon, cox, earthlink, protonmail, pm, mail, duck, ducks
- Add top-level domain period(s) if:
  - Missing before com, net, edu, gov, org
  - Missing in navymil, mailmil, armymil; add multiple periods if missing to usnavymil, usafmil, usarmymil, usacearmymil
errors is updated by re-running proof() inside of correct()

Part B: Data Visualization and Tabulation {#part-b-data-visualization-and-tabulation}

duplicatePlot {#duplicateplot}

Plot duplicates with duplicatePlot().

duplicatePlot(current_data)

errorPlotFields {#errorplotfields}

The errorPlotFields() function can be run on all states...

errorPlotFields(proofed_data, loc = "all", year = 2024)

... or it can be limited to just one.

errorPlotFields(proofed_data, loc = "LA", year = 2024)

It is possible to add any ggplot2 components to these plots. A plot can be altered with facet_wrap using either dl_cycle or dl_date. The example below demonstrates how this package's functions can interact with tidyverse and shows an example of an errorPlotFields with facet_wrap (using a subset of 4 download cycles).

errorPlotFields(
  hipdata2024 |>
    filter(str_detect(dl_cycle, "0800|0901|0902|1001")),
    year = 2024) +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1),
    legend.position = "bottom") +
  facet_wrap(~dl_cycle, ncol = 2)

errorPlotStates {#errorplotstates}

The errorPlotStates() function plots error proportions per state. A threshold value must be set to only view states above a certain proportion of error (in the example below, threshold = 0.05 indicates an error tolerance of 5%). Bar labels are error counts.

errorPlotStates(proofed_data, threshold = 0.05)

errorPlotDL {#errorplotdl}

This function should not be used unless you want to visualize an entire season of data. The errorPlotDL() function plots proportion of error per download cycle over the course of the hunting season. Location may be specified with the loc parameter to see a particular state over time.

errorPlotDL(hipdata2024, loc = "MI")

knitr::include_graphics(paste0(here::here(), "/man/figures/errorPlot_dl.png"))

pullErrors {#pullerrors}

The pullErrors() function can be used to view all of the actual values that were flagged as errors in a particular field. In this example, we find that the suffix field contains several values that are not accepted.

pullErrors(proofed_data, field = "suffix")

Running pullErrors() on a field that has no errors will return a message.

pullErrors(proofed_data, field = "dove_bag")

errorTable {#errortable}

The errorTable() function returns error data as a tibble, which can be assessed as needed or exported to create records of download cycle errors. The basic function reports errors by both location and field.

errorTable(proofed_data)

Errors can be reported by only location by turning off the field parameter.

errorTable(proofed_data, field = "none")

Errors can be reported by only field by turning off the loc parameter.

errorTable(proofed_data, loc = "none")

Location can be specified using one of the 49 contiguous state abbreviations.

errorTable(proofed_data, loc = "CA")

Field can be specified (one of: r paste(c("all", "none", migbirdHIP:::REF_FIELDS_ALL), collapse = ", ")).

errorTable(proofed_data, field = "suffix")

Total errors for a location can be pulled.

errorTable(proofed_data, loc = "CA", field = "none")

Total errors for a field in a particular location can be pulled.

errorTable(proofed_data, loc = "CA", field = "dove_bag")

Part C: Writing Data and Reports {#part-c-writing-data-and-reports}

write_hip {#write_hip}

After the data have been corrected, the data are ready to be written out. Use write_hip() to do final processing to the data, which includes 1) adding in FWS strata and 2) setting NA values to blank strings. If split = FALSE, the final table will be saved as a single .csv to your specified path. If split = TRUE (default), one .csv file per each input .txt source file will be written to the specified directory.

write_hip(corrected_data, path = "C:/HIP/processed_data/")

writeReport {#writereport}

The writeReport() function can be used to automatically generate an R markdown document with figures, tables, and summary statistics. This can be done at the end of a download cycle.

writeReport(
  raw_path = "C:/HIP/DL0901/",
  temp_path = "C:/HIP/corrected_data",
  year = 2024,
  dl = "0901",
  dir = "C:/HIP/dl_reports",
  file = "DL0901_report")

USFWS/migbirdHarvestData documentation built on July 16, 2025, 10:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

USFWS/migbirdHarvestData Process USFWS Migratory Bird Harvest Information Program Data in R