In iecastro/healthdatacsv: Access data in the healthdata.gov catalog

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  message = FALSE,
  warning = FALSE
)

devtools::load_all()

healthdatacsv

healthdatacsv allows users to query the healthdata.gov API catalog and return tidy data frames. This package focuses on the data.json endpoint and will download datasets if available via file download, namely, in csv format.

Installation

You can install the development version of healthdatacsv from GitHub with:

``` {r, eval = FALSE} install.packages("remotes") remotes::install_github("iecastro/healthdatacsv")

## Examples

```r
library(healthdatacsv)

Basic examples which show you how to use healthdatacsv:

Querying API based on keywords:

fetch_catalog(keyword = "alcohol|drugs")

Querying API and downloading data for single product:

fetch_catalog("Centers for Disease Control and Prevention",
                               keyword = "built environment") %>% 
  fetch_data()

Querying API and downloading data for multiple data products. Data will be nested in a list-column:

library(dplyr) # for table manipulation verbs

# query catalog
fetch_catalog(keyword = "alcohol")

# fetch data
fetch_catalog(keyword = "alcohol") %>% 
  dplyr::slice(1) %>% # dplyr 
  fetch_data()

fetch_data() wraps the vroom function, which helps quickly read relatively large delimited files:

# PRAMS data
# CDC surveillance for reproductive health
# query, filter, and fetch
prams <- fetch_catalog(keyword = "reproductive health") %>% 
  mutate(year = readr::parse_number(product)) %>% 
  filter(year >= 2010) %>% 
  arrange(year) %>% 
  fetch_data()

prams %>% 
  select(product, data_tbl)

Basic workflow

Say you're interested in searching for CDC datasets related to the built environment. You can use healthdatacsv to fetch the catalog of available data products:

cdc_built_env <- fetch_catalog("Centers for Disease Control and Prevention",
                               keyword = "built environment")

cdc_built_env

In this case, there is only one product available. To learn more about the dataset, you can simply pull the description:

cdc_built_env %>%
  dplyr::pull(description)

This dataset relates to state legislation on nutrition, physical activity, and obesity during 2001-2017. Data only includes enacted legislation.

To download the data, we pass the catalog object to the fetch_data function. Since there is only one dataset to download, we can unnest in the same pipe. If the catalog has more than one product that you'd like to keep, it is recommended to unnest each product separately. If the catalog consists of several time points of the same dataset, you could unnest in one pipe, given all column names are the same.

data_raw <- cdc_built_env %>%
  fetch_data() %>% #> 
  tidyr::unnest(data_tbl)

data_raw

Some helper functions

get_agencies() will query and download names of agencies listed in the catalog. You can enter partial name or initials of agency of interest to check if it has data cataloged in the API. Argument relies on regular expression to detect string matches.

get_agencies("NIH|CDC|FDA")

get_agencies("Institute|Drug")

The resulting dataframe will depend on matches to the string supplied. To pull all agencies cataloged, simply use get_agencies with the default argument:

get_agencies()

get_keywords() will extract all keywords cataloged. The function accepts as argument a full agency name (in proper title case) which can be obtained with get_agencies.

get_keywords("National Institutes of Health (NIH)")

Conversely, this function can be run with no argument for a data frame of all keywords and agencies cataloged.

Info

Development of this package partly supported by a research grant from the National Institute on Alcohol Abuse and Alcoholism - NIH Grant #R34AA026745. This product is not endorsed nor certified by either healthdata.gov or NIH/NIAAA.

Please note that the 'healthdatacsv' project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

iecastro/healthdatacsv documentation built on Nov. 10, 2020, 2:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com