knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
claimsdb provides easy access to a sample of health insurance enrollment and claims data from the CMS Data Entrepreneurs' Synthetic Public Use File (DE-SynPUF), as a set of relational tables or as an in-memory database using DuckDB. This package is inspired by and based on the starwarsdb package.
The data are structured as actual Medicare claims data but are fully "synthetic," after a process of alterations meant to reduce the risk of re-identification of real Medicare beneficiaries. The synthetic process that CMS adopted changes the co-variation across variables, so analysts should be cautious about drawing inferences about the actual Medicare population.
The data included in claimsdb comes from 500 randomly selected 2008 Medicare beneficiaries from Sample 2 of the DE-SynPUF, and it includes all the associated claims for these members for 2008-2009. CMS provides resources, including a codebook, FAQs, and other documents with more information about this data.
Formats: Metadata, Local Tables, Remote Database Tables
You can install the development version of claimsdb from GitHub with:
# install.packages("devtools") devtools::install_github("jfangmeier/claimsdb")
library(tibble)
All of the tables are available when you load claimsdb
library(dplyr) library(claimsdb)
The tables are also available by loading just the data from claimsdb
data(package = "claimsdb")
This includes a schema
table that describes each of the tables and the
included variables from the CMS
DE-SynPUF.
schema
schema %>% filter(TABLE == "inpatient") %>% pull(PROPERTIES)
Using the sample data in the tables, you can ask questions such as, what were the average and median prescription drug costs for males and females in 2008 and 2009?
rx_costs <- pde %>% mutate(BENE_YEAR = lubridate::year(SRVC_DT)) %>% group_by(BENE_YEAR, DESYNPUF_ID) %>% summarize(TOTAL_RX_COST = sum(TOT_RX_CST_AMT, na.rm = T), .groups = "drop") bene %>% transmute( BENE_YEAR, DESYNPUF_ID, BENE_SEX_IDENT = case_when( BENE_SEX_IDENT_CD == "1" ~ "Male", TRUE ~ "Female") ) %>% left_join( rx_costs, by = c("BENE_YEAR", "DESYNPUF_ID") ) %>% mutate(TOTAL_RX_COST = ifelse(is.na(TOTAL_RX_COST), 0, TOTAL_RX_COST)) %>% group_by(BENE_YEAR, BENE_SEX_IDENT) %>% summarize( MEAN_RX_COST = mean(TOTAL_RX_COST, na.rm = T), MEDIAN_RX_COST = median(TOTAL_RX_COST, na.rm = T))
Many organizations store claims data in a remote database, so claimsdb also includes all of the tables as an in-memory DuckDB database. This can be a great way to practice working with this type of data, including building queries with dplyr code using dbplyr.
con <- claims_connect() bene_rmt <- tbl(con, "bene") pde_rmt <- tbl(con, "pde") pde_rmt rx_costs_rmt <- pde_rmt %>% # note below that lubridate functions do not currently work on remote databases, # so you need to use date/time functions appropriate for the database. mutate(BENE_YEAR = date_part('year', SRVC_DT)) %>% group_by(BENE_YEAR, DESYNPUF_ID) %>% summarize(TOTAL_RX_COST = sum(TOT_RX_CST_AMT, na.rm = T), .groups = "drop") %>% ungroup() rx_costs_rmt bene_rmt %>% transmute( BENE_YEAR, DESYNPUF_ID, BENE_SEX_IDENT = case_when( BENE_SEX_IDENT_CD == "1" ~ "Male", TRUE ~ "Female") ) %>% left_join( rx_costs_rmt, by = c("BENE_YEAR", "DESYNPUF_ID") ) %>% mutate(TOTAL_RX_COST = ifelse(is.na(TOTAL_RX_COST), 0, TOTAL_RX_COST)) %>% group_by(BENE_YEAR, BENE_SEX_IDENT) %>% summarize( MEAN_RX_COST = mean(TOTAL_RX_COST, na.rm = T), MEDIAN_RX_COST = median(TOTAL_RX_COST, na.rm = T))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.