knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(regcensusAPI)

Introduction

RegCensusAPI is an API client that connects to the RegData regulatory restrictions data by the Mercatus Center at George Mason University. RegData uses machine learning algorithms to quantify the number of regulatory restrictions in a jurisdiction. Currently, RegData is available for three countries - Australia, Canada, and the United States. In addition, there are regulatory restrictions data for jurisdictions (provinces in Canada and states in Australia and US) within these countries. You can find out more about RegData from http://www.quantgov.org.

This R API client connects to the api located at the QuantGov website. More advanced users who want to interact with the API directly can use the link above to pull data from the RegData API. Python users can access the same features provided in this package in the python package regcensus.

Structure of the API

The API organizes data around topics, which are then divided into series. Within each series are values, which are the ultimate elements of interest. Values are available by three sub-groups: agency, industry, and occupation. Presently, there are no series with occupation subgroup. However, these are available for future use. Topics broadly define the data available. For example, RegData for regulatory restrictions is falls under the broad topic "Regulatory Restrictions." Within Regulatory Restrictions topic, there are a number of series available. These include Total Restrictions, Total Wordcount, Total "Shall," etc.

A fundamental concept in RegData is the document. In RegData, a set of documents represents a body of text (or corpus) for which we have produced regulatory restriction counts. For example, to produce data on regulatory restrictions imposed by the US Federal government, RegData uses the Code of Federal Regulations (CFR) as the source documents. Within the CFR, RegData identifies a unit of regulation as the title-part combination. The CFR is organized into 50 titles, and within each title are parts, which could have subparts, but not always. Under the parts are sections. Determining this unit of analyses is critical for the context of the data produced by RegData. Producing regulatory restriction data for US states follows the same strategy but uses the state-specific regulatory code.

In requesting data through the API, you must specify the document type and the indicate a preference for summary or document-level. By default, RegCensus API returns summarized data for the period of interest. This means that if you do not specify the the summary preference, you will receive the summarized data for a period. The get_series_period helper function (described below) returns the periods available for each series.

RegCensus API defines a number of periods depending on the series. For example, the total restrictions series of Federal regulations uses two main periods: daily and annual. The daily data produces the number of regulatory restrictions issued on a particular date by the US Federal government. The same data are available on an annual basis.

There are six helper functions to retrieve information about these key components of regdata in a list format to help the user get a sense of the data available to them. These functions provide the following information: topics, documents, jurisdictions, series, agencies, and years with data. For example, to view the list of topics call get_topics. When an topic id parameter is supplied, the function returns the details about a specific topic.

get_topics()

Each topic comprises one or more series. The get_series function returns the list of all series when no series id is provided. You can request the series under a particular topic by passing the topic id as paramter to the get_series function.

#get_series(id = 91)

The get_series function takes two parameters: the series id and the subset of data - one of "all", "jurisdiction", "topic", "agency", or "industry". When any of the subset options must be used with the id, where the id represents an id within the subset. For example, get_series(id=1,by='agencies') will return the list of series for agency with ID 1.

There are other helper functions that give you a tour around RegData. To see the jurisdictions with data in RegData, call get_jurisdiction. This function returns the complete list in a list format.

get_jurisdictions()

The get_series_period function returns a list of all series and the years with data available. The output from this function can serve as a reference for the valid values that can be passed to parameters in the get_values function. The number of records returned is the unique combination of series and jurisdictions that are available in RegData. The function takes the optional argument jurisdiction id. In this case, we are calling for data from the United States.

get_series_periods(jurisdiction = 38)

Metadata

The get functions return the details about RegData metadata. These metadata are not included in the get_values functions that will be described later.

Jurisdictions

Use the get_jurisdiction function to return a data frame with all the jurisdictions. When you supply the jurisdiction parameter, the function returns the details of just that jurisdiction. Use the output from the get_jurisdiction function to merge with data from the get_values function. Jurisdictions are constantly added to each new iteration of RegData so consult the data with get_jurisdictions to see a complete list.

#get_jurisdictions(38)

Agencies

The get_agencies function returns a data frame of all agencies with data in RegData. If an ID is supplied, the data frame returns the details about a single agency specified by the id. The data frame includes characteristics of the agencies. Currently, agency data are only available for federal RegData.

get_agencies()

Use the value of the agency_id field when pulling values with the get_values function.

Industries

The get_industries function returns a data frame of industries with data in the API. Presently the only classification system available is the North American Industry Classification System (NAICS). NAICS is used for both countries in North America and Australia, even though the latter uses the Australia and New Zealand Standard Industrial Classification (ANZSIC) system. Presently, industry regulations for Australia are based on the NAICS. As RegData expands to other countries, the industry codes will be country specific as well as contain mapping to the Standard Industry Codes (SIC) system.

get_industries(38)

A comprehensive list of NAICS codes can be viewed at the US Census website.

Values

The get_values_ function is the primary function for obtaining RegData from the RegCensus API. The function takes the following parameters (required fields are in bold and optional parameters are italicized):

  • *list of jurisdiction ids, using the 'c' function, for example, c(1,2)*
  • *list of series ids*
  • *list of years*
  • *list of agencies*
  • *list of industries*
  • *dateIsRange* - specify if the list of years provided for the parameter years is a range. Default is TRUE
  • In the example below, we are interested in the total number of restrictions and total number of words (get_topics(1)) for the US (get_jurisdictions(38)) for all years between and including 2010 and 2018.

    get_values(jurisdiction = 38, series = c(1,2), date = c(2010,2018), document_type = 1)
    

    Values by Subgroup

    You can obtain data for any of the three subgroups for each series - agencies, industries, and occupations (when they become available).

    Values by Agencies

    To obtain the restrictions for a specific agency (or agencies), the series id supplied must be in the list of available series by agency. To recap, the list of available series for an agency is available via the get_series(id,by='agency') function, and the list of agencies with data is available via get_agencies function.

    #Identify all agencies
    get_agencies()
    
    
    #Call the get_values() for this agency and series 91 and 92
    get_values(jurisdiction = 38, series = c(13), date = c(1990,2018), agency = c(66), document_type = 1)
    

    Values by Agency and Industry

    Some agency series may also have data by industry. For example, under the Total Restrictions topic, RegData includes the industry-relevant restrictions, which estimates the number of restrictions that apply to a given industry. These are available in both the main series - Total Restrictions, and the sub-group Restrictions by Agency.

    To pull industry-relevant restrictions for an agency, call get_agencies with the industry variable. The industry variable is of type string, and valid values include the industry codes specified in the classification system obtained by calling the get_industries(jurisdiction) function.

    In the example below, the series 92 (Restrictions by Agency and Industry), we can request data for the two industries 111 and 33 by the following code snippet.

    get_values(jurisdiction = 38, series = c(30), date = c(1990,2000), industry = c('111','33'), agency = 66)
    

    Merging with Metadata

    To minimize the network bandwidth requirements to use RegCensusAPI, the data returned by get_values functions contain very minimal metadata. Once you pull the values by get_values, you can use various R packages to include the metadata.

    Suppose we want to attach the agency names and other agency characteristics to the data from the last code snippet. First be sure to pull the list of agencies into a separate data frame. Then merge with the values data frame. The key for matching the data will be the agency_id column.

    Using the dplyr package, we can merge the agency data with the values data as in the code snippet below. This code function assumes you have loaded the dplyr and pander packages. For more information about both of these packages, run the 'help' commands (? or help(dplyr)).

    require(dplyr)
    require(pander)
    agencies <- get_agencies()
    agency_by_industry <- get_values(jurisdiction = 38,
                                     series = c(30), date = c(1990,2000), 
                                     industry = c('111','33'), 
                                     document_type = 1,
                                     agency = c(66,111))
    agency_restrictions_ind <- agency_by_industry %>%
      dplyr::left_join(agencies, by=c('agency.agencyID'='agencyID'))
    
    pander::pandoc.table(agency_restrictions_ind)
    

    Advanced Functionality

    Users also have available several advanced and specialized functions for particular research uses. These include specialized get_values functions and other functions to make document-level calls on the API.

    Specialized get_values functions

    Users also have access to two new functions that act similarly to get_values. These provide more flexibility in making calls to the API without needing to create as many arguments as get_values.

    Values by Country

    This function, notated as get_country_values, will pull all aggregated values by country on both the national and sub-national level. This provides a quicker method of collecting larger amounts of data without needing to specify specific sub-national entities. This method is ideal for pulling aggregate data for cross-jurisdictional comparisons or for pulling all values for one country to do a meta-level analysis. Currently, the countries available are the United States, Canada, and Australia on both the national and sub-national. In future versions of RegData, data will be available for the United Kingdom.

    The example below pulls all aggregated industry values for the United States; including national values, sub-national for all 50 states and the Disctrict of Columbia for 2010 and 2011

    get_country_values(jurisdiction = 38, series = c(1,2), date = c(2010,2011), date_is_range=TRUE, filtered=TRUE)
    

    Values by Industry

    Much in the same way get_country_values pulls all jurisdictions, get_industry_values pulls all industry values for a specified jurisdiction. This allows users to gather all aggregated industry values by jurisdiction, industry name, and/or NAICS code between the 2-digit to 6-digit level. This function is ideal for pulling data for intra-industry and inter-industry comparisons.

    The example below gathers data for all industries at the 3-Digit NAICS level on the national level of the United States for 2010 and 2011.

    get_industry_values(industry_type = "3-Digit", jurisdictions = 38, series = c(9), dat = c(2010, 2011), date_is_range = TRUE, filtered = TRUE)
    

    Document-level data

    Where the previous instructions in this vignette account for aggregated data, document-level data will allow users to view or download specific documents by jurisdiction or agency. These data are large so it is not reccomended to make document-level calls unless there is a specific use case.

    A list of document types can be viewed by jurisdiction (or all jurisdictions) through the get_document_types function. In the example below, we look at all the document types for the state of Nebraska.

    get_document_types(jurisdiction = 78)
    

    Downloading documents allows the user to collect raw data from RegData for their own analysis. This is accomplished through the get_documents function with attributes for jurisdiction and document type. It is HIGHLY reccomended that the user make an API call with a specific jurisdiction in mind. Pulling document-level data will take longer depending on the jurisdiction. Attempting to pull all jurisdictions will result in an error. The example below shows how to make a call. In this case, we do not let the function run in the vignette.

    get_documents(jurisdiction = 78, document_type = 3)



    QuantGov/regcensus-api-client-R documentation built on Jan. 4, 2022, 12:02 a.m.