#| label: setup #| include: false knitr::opts_chunk$set( collapse = TRUE, eval = FALSE, comment = "#>" )
The National Health and Nutrition Examination Survey (NHANES) is one of the most widely used public health datasets in the U.S., spanning over two decades of continuous data collection. Working with NHANES data directly from the CDC presents two recurring problems:
DEMO, DEMO_B, ..., DEMO_L). Combining cycles requires tracking naming conventions and reconciling type differences across waves.nhanesdata addresses both issues by hosting pre-merged, type-harmonized datasets on Cloudflare R2 with public access. A single call to read_nhanes("demo") returns all demographics data from 1999-2023 with a year column identifying each survey cycle.
This package builds on the nhanesA package, which provides the underlying interface to NHANES data in R.
#| label: install #| eval: false # install.packages("pak") pak::pak("kyleGrealis/nhanesdata")
#| label: load-packages #| eval: true #| message: false #| warning: false library(nhanesdata) library(dplyr) library(ggplot2)
#| label: load-demo #| eval: true #| echo: false demo <- read_nhanes("demo")
#| label: load-demo-fake #| eval: false demo <- read_nhanes("demo") glimpse(demo)
Every dataset includes two key columns:
year: Survey cycle start year (1999, 2001, 2003, ..., 2017, 2021)seqn: Respondent sequence number (unique within a cycle, used for joining)Dataset names are case-insensitive: "demo", "DEMO", and "Demo" all work.
#| label: age-analysis #| eval: true #| fig.width: 8 #| fig.height: 6 demo |> filter(!is.na(ridageyr)) |> ggplot(aes(x = ridageyr)) + geom_histogram(binwidth = 5, fill = "steelblue", color = "white") + facet_wrap(~year, ncol = 4) + labs( title = "NHANES Age Distribution by Survey Cycle", x = "Age (years)", y = "Count" ) + theme_minimal()
When you call read_nhanes("demo"), you receive data that the CDC publishes across multiple cycle-specific tables:
| CDC Table | Survey Years | Package Behavior |
|-----------|-------------|------------------|
| DEMO | 1999-2000 | Merged into a single demo dataset |
| DEMO_B | 2001-2002 | with a year column and |
| DEMO_C | 2003-2004 | harmonized data types |
| ... | ... | across all cycles. |
| DEMO_L | 2021-2023 | |
This matters when you need CDC documentation for a specific variable. Use get_url() to retrieve the codebook URL for any cycle-specific table:
#| label: get-url #| eval: false get_url("DEMO") # 1999-2000 codebook get_url("DEMO_I") # 2015-2016 codebook
The 2019-2020 cycle (suffix K) is excluded from all datasets. See vignette("covid-data-exclusion") for details.
Combine datasets using seqn and year as join keys:
#| label: multiple-datasets #| eval: false demo <- read_nhanes("demo") bpx <- read_nhanes("bpx") bmx <- read_nhanes("bmx") analysis_data <- demo |> inner_join(bpx, by = c("seqn", "year")) |> inner_join(bmx, by = c("seqn", "year")) |> select(year, seqn, ridageyr, riagendr, bpxsy1, bmxbmi)
Always join on both seqn and year. Each seqn is unique within its cycle, and joining on both columns ensures participants are matched within the same survey period.
#| label: filter-years #| eval: false demo <- read_nhanes("demo") # Recent cycles only recent <- demo |> filter(year >= 2015) # Compare time periods demo |> mutate( period = case_when( year < 2010 ~ "1999-2009", year < 2020 ~ "2010-2019", TRUE ~ "2020+" ) ) |> group_by(period) |> summarise(n = n())
#| label: term-search #| eval: false term_search("blood pressure")
Returns variable names, table names, descriptions, and collection years. From there you can identify the base table name (e.g., BPX) for use with read_nhanes().
#| label: var-search #| eval: false var_search("BPXSY1")
Shows which cycles contain a specific variable.
#| label: check-var-setup #| echo: false #| eval: true bmx <- read_nhanes("bmx")
#| label: check-var #| eval: true "bmxht" %in% names(bmx)
All search functions are case-insensitive.
#| label: bp-plot-setup #| include: false #| eval: true #| message: false #| warning: false bpx <- read_nhanes("bpx") bp_combined <- demo |> filter(ridageyr >= 18) |> select(seqn, year, ridageyr, riagendr, ridreth1) |> inner_join( bpx |> select(seqn, year, bpxsy1, bpxdi1), by = c("seqn", "year") ) bp_summary <- bp_combined |> filter(!is.na(bpxsy1), !is.na(bpxdi1), bpxsy1 > 0, bpxdi1 > 0) |> mutate( age_group = cut( ridageyr, breaks = c(18, 30, 40, 50, 60, 70, 80, Inf), labels = c( "18-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80+" ), right = FALSE ) ) |> group_by(age_group) |> summarize( n = n(), mean_systolic = mean(bpxsy1), mean_diastolic = mean(bpxdi1), .groups = "drop" )
#| label: complete-example-load #| eval: false demo <- read_nhanes("demo") bpx <- read_nhanes("bpx") bp_analysis <- demo |> filter(ridageyr >= 18) |> select(seqn, year, ridageyr, riagendr, ridreth1) |> inner_join( bpx |> select(seqn, year, bpxsy1, bpxdi1), by = c("seqn", "year") ) |> filter(!is.na(bpxsy1), !is.na(bpxdi1), bpxsy1 > 0, bpxdi1 > 0) |> mutate( age_group = cut( ridageyr, breaks = c(18, 30, 40, 50, 60, 70, 80, Inf), labels = c( "18-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80+" ), right = FALSE ) ) bp_summary <- bp_analysis |> group_by(age_group) |> summarize( n = n(), mean_systolic = mean(bpxsy1), mean_diastolic = mean(bpxdi1), .groups = "drop" )
#| label: complete-example-plot #| eval: true #| echo: true #| fig.width: 7 #| fig.height: 5 bp_summary |> ggplot(aes(x = age_group)) + geom_col(aes(y = mean_systolic), fill = "coral", alpha = 0.7) + geom_col(aes(y = mean_diastolic), fill = "steelblue", alpha = 0.7) + labs( title = "Blood Pressure Increases with Age", subtitle = "Mean systolic (coral) and diastolic (blue) BP by age group", x = "Age Group", y = "Blood Pressure (mmHg)" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
?read_nhanes for function documentationAny scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.