In rdinter/usdarnass: USDA NASS Quick Stats API

library("knitr")
opts_chunk$set(
  collapse = TRUE,
  eval = !(Sys.getenv("NASS_KEY") == ""),
  comment = "#>"
)

Introduction

usdarnass provides an alternative for downloading various USDA data from https://quickstats.nass.usda.gov/ through R. You must sign up for an API key from the mentioned website in order for this package to work.

The USDA's documentation on Quick Stats can be found throughout https://www.nass.usda.gov/Quick_Stats/index.php. A short description of what the data entail can be summarised from the Quick Stats description on data.gov:

Quick Stats is the National Agricultural Statistics Service's (NASS) online, self-service tool to access complete results from the 1997, 2002, 2007, and 2012 Censuses of Agriculture as well as the best source of NASS survey published estimates. The census collects data on all commodities produced on U.S. farms and ranches, as well as detailed information on expenses, income, and operator characteristics. The surveys that NASS conducts collect information on virtually every facet of U.S. agricultural production.

There are two main USDA sources within Quick Stats: censuses and surveys. The census values in Quick Stats start in 1997 while the survey values can range all the way back to 1850 and then annually since 1866. Although Agricultural Censuses occur once every 5 years (1997, 2002, 2007, 2012, and most recently 2017), USDA will administer other censuses, for instance the Census of Aquaculture, which explains the additional years of available census data. At this time, there is no support for documenting the various kinds of data which can be extracted from Quick Stats and this package only serves as an R interface for downloading Quick Stats data. It is up to the user to understand the source of the data they download, most of which can be found at https://www.nass.usda.gov/Surveys/.

Package Install

usdarnass is currently on CRAN and can be installed with the typical method of using install.packages("usdarnass") in R. However, if you would like the development version then it can be installed with the use of devtools:

# install.packages("devtools")
devtools::install_github("rdinter/usdarnass")

Load package:

library("usdarnass")

Usage

If a query works on the https://quickstats.nass.usda.gov/ interface, then it will work with the usdarnass package. Keep in mind that there is a 50,000 observation limit for both the web interface and data queries with this package.

There are three main functions for this package with the first as the workhorse:

nass_data() this will return a data.frame to the specifications of the query from all of the arguments set in the function call. This mimics the simple "GET DATA" command off of https://quickstats.nass.usda.gov/ and requires an API key. There is a 50,000 limit for each call.
nass_param() returns all of the possible values for a parameter in a query. Helpful to understand how to subset a query if it runs into the 50,000 limit.
nass_count() returns the number of records for a query. Very useful in conjunction with nass_param() to determine what queries can return data with a nass_data() call.

All of these functions require an API Key for each query, which can be set with another function.

Key Install

nass_set_key("YOUR_KEY_IN_QUOTATIONS", overwrite = TRUE)
# First time, reload your enviornment so you can use the key without restarting R.
readRenviron("~/.Renviron")
# You can check it with:
Sys.getenv("NASS_KEY")

The above script will add a line to your .Renviron file to be re-used whenever you are using the package. If you are not comfortable with the package writing to your home directory, you can add the following line to your .Renviron file manually to produce the same result. The function usethis::edit_r_environ() should open up your .Renviron file for you to make edits to with the following line (use your actual API key for the right hand side value):

NASS_KEY = 'YOUR_KEY_IN_QUOTATIONS'

If you are not comfortable with either of these options, then to use the package you need to ensure that the key parameter is set to your API key in each of your function calls that queries USDA Quick Stats.

Get Data

As previously stated, the workhorse function is the nass_data() function which will make query calls and return a data.frame as long as the query will return 50,000 or fewer observations. Each argument in the function call corresponds to the categories which are displayed on the web interface of Quick Stats. The arguments accept character values and even character vectors in the case of multiple selections in a category. By way of an example, we can put in a query to Quick Stats for the value of agricultural land (and buildings) in Durham and Wake County North Carolina for 2012 (warning: there is a lot of information :

nc_vals <- nass_data(year = 2012,
                     short_desc = "AG LAND, INCL BUILDINGS - ASSET VALUE, MEASURED IN $",
                     county_name = c("Durham", "WAKE"),
                     state_name = "NORTH CAROLINA")
str(nc_vals)

The output of this query has a lot to digest. The main focus is on the Value variable in the resulting data.frame. Please note that the returned Value for the query is of the class character. The parameter numeric_vals can be set to TRUE to have this return a numeric value, however the default is to be a character type as there are some suppressed values which will be coerced to NA when numeric_vals is set to TRUE.

Aside from the output, the particular query used four parameters for its output: year, short_desc, county_name, and state_name. Each of these parameters have a particular set of values which can generate of query, which can be figured out using the nass_param() function. The year argument is the only argument that accepts either a character or numeric value, all other arguments only accept character vectors which can be multiple values in the case of county_name in the call above. The arguments for usdarnass are not case sensitive. Whether the county is called "Durham", "durham", "DURHAM", or "dUrHaM" the parameter will be ultimately passed to the API the same.

In querying the Quick Stats API, you usually do not need to subset many parameters to get to the 50,000 limitation but if you do not get within the 50,000 limitation the call will produce an error. As of right now, there are 18 parameters for each query, although most of those will be NULL values and not required to specify an output.

Parameters

There are 18 parameters which can be included in each query to Quick Stats however some of them are concatenations of other parameter values. If there is a name in quotations following a particular parameter, then that is the drop down menu via the Quick Stats web interface that gives you a value. Not all parameters will have a drop down menu though:

source_desc "Program" - Source of data ("CENSUS" or "SURVEY"). Census program includes the Census of Ag as well as follow up projects. Survey program includes national, state, and county surveys.
sector_desc "Sector" - Five high level, broad categories useful to narrow down choices. ("ANIMALS & PRODUCTS", "CROPS", "DEMOGRAPHICS", "ECONOMICS", or "ENVIRONMENTAL")
group_desc "Group" - Subsets within sector (e.g., under sector_desc = "CROPS", the groups are "FIELD CROPS", "FRUIT & TREE NUTS", "HORTICULTURE", and "VEGETABLES").
commodity_desc "Commodity" - The primary subject of interest (e.g., "CORN", "CATTLE", "LABOR", "TRACTORS", "OPERATORS").
short_desc "Data Item" - A concatenation of six columns: commodity_desc, class_desc, prodn_practice_desc, util_practice_desc, statisticcat_desc, and unit_desc.
domain_desc "Domain" - Generally another characteristic of operations that produce a particular commodity (e.g., "ECONOMIC CLASS", "AREA OPERATED", "NAICS CLASSIFICATION", "SALES"). For chemical usage data, the domain describes the type of chemical applied to the commodity. The domain_desc = "TOTAL" will have no further breakouts; i.e., the data value pertains completely to the short_desc.
domaincat_desc "Domain Category" - Categories or partitions within a domain (e.g., under domain_desc = "SALES", domain categories include \$1,000 TO \$9,999, \$10,000 TO \$19,999, etc).
agg_level_desc "Geographic Level" - Aggregation level or geographic granularity of the data. ("AGRICULTURAL DISTRICT", "COUNTY", "INTERNATIONAL", "NATIONAL", "REGION : MULTI-STATE", "REGION : SUB-STATE", "STATE", "WATERSHED", or "ZIP CODE")
statisticcat_desc "Category" - The aspect of a commodity being measured (e.g., "AREA HARVESTED", "PRICE RECEIVED", "INVENTORY", "SALES").
state_name "State" - State full name.
asd_desc "Ag District" - Ag statistics district name.
county_name "County" - County name.
region_desc "Region" - NASS defined geographic entities not readily defined by other standard geographic levels. A region can be a less than a state (SUB-STATE) or a group of states (MULTI-STATE), and may be specific to a commodity.
zip_5 "Zip Code" - US Postal Service 5-digit zip code.
watershed_desc "Watershed" - Name assigned to the HUC.
year "Year" - The numeric year of the data and can be either a character or numeric vector. Conditional values are also possible, for example a character vector of ">=1999" of "1999<=" will give years greater than or equal to 1999. Right now the logical values can either be greater/less than or equal to with the logical at either the beginning or end of a string with the year.
freq_desc "Period Type" - Length of time covered ("ANNUAL", "SEASON", "MONTHLY", "WEEKLY", "POINT IN TIME"). "MONTHLY" often covers more than one month. "POINT IN TIME" is as of a particular day.
reference_period_desc "Period" - The specific time frame, within a freq_desc.

The descriptions of parameters here are minimal as most of the actual parameters have a vast amount of available options. This is where the nass_param() function comes into play in order to give the full set of values for each parameter.

Get Parameter

The nass_param() function will return a vector of all the possible values for a parameter conditional on the other parameter subsets given above. We can see that there are only two sources of datasets for the Quick Stats queries by asking what the values for the source_desc parameter are:

nass_param("source_desc")

The first argument in nass_param() is the parameter of interest, which can take on any of the 18 values from the parameters section. This argument must be passed in a character format, so make sure to use quotations in your calls. The characters are also not case sensitive in the calls.

This function is most helpful in determining what variables are available for a certain subset. For example, if I was interested in what county level variables in Ohio are available in 2000 I might start by determining what "Group" is available at that level:

nass_param("group_desc",
           state_name = "OHIO",
           agg_level_desc = "COUNTY",
           year = 2000)

Now, if I want to further figure out what commodities are available for the "DAIRY" subset of this data but only after 2000, I would make a call of:

nass_param("commodity_desc",
           group_desc = "dairy",
           state_name = "OHIO",
           agg_level_desc = "COUNTY",
           year = ">2000")

While the year parameter does not need to be a character vector, it does accept relational operators which can modify the subsets even further for queries.

Get Count

Every query with nass_data() has a 50,000 limit of observations. In order to determine the number of observations in a query, the nass_count() function will accept all the same parameters as nass_data() but its output is a numeric of the observations in a query:

nass_count()

Here we see how many observations are currently in the Quick Stats as of r Sys.Date(). Clearly, the entire data can not be downloaded with a query of their API. If you are concerned about downloading all of the Quick Stats data, then it would be more efficient to use their ftp site ftp://ftp.nass.usda.gov/quickstats/.

By way of another example, we can look at how many observations are available related to agricultural land at the county level:

nass_count(commodity_desc = "AG LAND",
           agg_level_desc = "COUNTY")

This particular query would not be able to be run for nass_data() because the number of observations greatly exceeds 50,000 and indeed that query returns an error:

nass_data(commodity_desc = "AG LAND",
          agg_level_desc = "COUNTY")

At this point in time a bit of understanding of the data and the user's goals are needed. If there is only one state of interest for the study, then subsetting the data further to a state is likely the best strategy. However, it is more likely that the user wants the all county level data related to agricultural land. My strategy would be to look at the number of observations for each year of interest to my data request. I would then take a vector of potential values for years and iterate a call to nass_count over each year to determine the count of values for each year:

years        <- 2000:2017
names(years) <- 2000:2017
sapply(years,
       function(x) nass_count(year = x,
                              commodity_desc = "AG LAND",
                              agg_level_desc = "COUNTY"), USE.NAMES = TRUE)

The observations here are only related to 2002, 2007, and 2012 which are agricultural census years and it is highly likely that there are a lot of variables in the category what would likely not be excessive information. It is then best to look at the descriptions of the variables to figure out what data would be most useful:

agland_params <- nass_param("short_desc",
                            commodity_desc = "AG LAND",
                            agg_level_desc = "COUNTY",
                            year = c(2002, 2007, 2012))

length(agland_params)

sapply(agland_params[1:6],
       function(x) nass_count(short_desc = x,
                              commodity_desc = "AG LAND",
                              agg_level_desc = "COUNTY",
                              year = c(2002, 2007, 2012)))

There are many parameter values in that query, but only the first 6 are displayed to save space. While most of these data items fit within the 50,000 limit, not all do. Take for example the "AG LAND - TREATED, MEASURED IN ACRES" category exceeds the limit and would not be able to be downloaded. This is because the treated category actually has multiple domains which can be seen by combining the nass_param() and nass_count():

agland_domain <- nass_param("domain_desc",
                            short_desc = "AG LAND - TREATED, MEASURED IN ACRES",
                            commodity_desc = "AG LAND",
                            agg_level_desc = "COUNTY")
sapply(agland_domain,
       function(x) nass_count(domain_desc = x,
                              short_desc = "AG LAND - TREATED, MEASURED IN ACRES",
                              commodity_desc = "AG LAND",
                              agg_level_desc = "COUNTY"))

To download the above data, one would need to iterate over each of the domain_desc values in order to fit within the limits. The lapply method is likely the easiest way to accomplish this:

agland_domain_download <- lapply(agland_domain,
                                 function(x) nass_data(domain_desc = x,
                                                       short_desc = "AG LAND - TREATED, MEASURED IN ACRES",
                                                       commodity_desc = "AG LAND",
                                                       agg_level_desc = "COUNTY"))
agland_domain_data <- do.call("rbind", agland_domain_download)