Getting Started with NHANES Data
In nhanesdata: Harmonized Access to NHANES Survey Data

#| label: setup
#| include: false

knitr::opts_chunk$set(
  collapse = TRUE,
  eval = FALSE,
  comment = "#>"
)

Motivation

The National Health and Nutrition Examination Survey (NHANES) is one of the most widely used public health datasets in the U.S., spanning over two decades of continuous data collection. Working with NHANES data directly from the CDC presents two recurring problems:

Server reliability. The CDC's data servers are frequently slow or unresponsive, which breaks reproducible workflows.
Cycle management. The CDC publishes data in two-year cycles with letter suffixes (DEMO, DEMO_B, ..., DEMO_L). Combining cycles requires tracking naming conventions and reconciling type differences across waves.

nhanesdata addresses both issues by hosting pre-merged, type-harmonized datasets on Cloudflare R2 with public access. A single call to read_nhanes("demo") returns all demographics data from 1999-2023 with a year column identifying each survey cycle.

Acknowledgments

This package builds on the nhanesA package, which provides the underlying interface to NHANES data in R.

Installation

#| label: install
#| eval: false

# install.packages("pak")
pak::pak("kyleGrealis/nhanesdata")

#| label: load-packages
#| eval: true
#| message: false
#| warning: false

library(nhanesdata)
library(dplyr)
library(ggplot2)

Loading Data

#| label: load-demo
#| eval: true
#| echo: false

demo <- read_nhanes("demo")

#| label: load-demo-fake
#| eval: false

demo <- read_nhanes("demo")
glimpse(demo)

Every dataset includes two key columns:

year: Survey cycle start year (1999, 2001, 2003, ..., 2017, 2021)
seqn: Respondent sequence number (unique within a cycle, used for joining)

Dataset names are case-insensitive: "demo", "DEMO", and "Demo" all work.

Example: Age Distribution Over Time

#| label: age-analysis
#| eval: true
#| fig.width: 8
#| fig.height: 6

demo |>
  filter(!is.na(ridageyr)) |>
  ggplot(aes(x = ridageyr)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  facet_wrap(~year, ncol = 4) +
  labs(
    title = "NHANES Age Distribution by Survey Cycle",
    x = "Age (years)",
    y = "Count"
  ) +
  theme_minimal()

CDC Dataset Naming

When you call read_nhanes("demo"), you receive data that the CDC publishes across multiple cycle-specific tables:

| CDC Table | Survey Years | Package Behavior | |-----------|-------------|------------------| | DEMO | 1999-2000 | Merged into a single demo dataset | | DEMO_B | 2001-2002 | with a year column and | | DEMO_C | 2003-2004 | harmonized data types | | ... | ... | across all cycles. | | DEMO_L | 2021-2023 | |

This matters when you need CDC documentation for a specific variable. Use get_url() to retrieve the codebook URL for any cycle-specific table:

#| label: get-url
#| eval: false

get_url("DEMO")   # 1999-2000 codebook
get_url("DEMO_I") # 2015-2016 codebook

The 2019-2020 cycle (suffix K) is excluded from all datasets. See vignette("covid-data-exclusion") for details.

Joining Multiple Datasets

Combine datasets using seqn and year as join keys:

#| label: multiple-datasets
#| eval: false

demo <- read_nhanes("demo")
bpx <- read_nhanes("bpx")
bmx <- read_nhanes("bmx")

analysis_data <- demo |>
  inner_join(bpx, by = c("seqn", "year")) |>
  inner_join(bmx, by = c("seqn", "year")) |>
  select(year, seqn, ridageyr, riagendr, bpxsy1, bmxbmi)

Always join on both seqn and year. Each seqn is unique within its cycle, and joining on both columns ensures participants are matched within the same survey period.

Filtering by Survey Year

#| label: filter-years
#| eval: false

demo <- read_nhanes("demo")

# Recent cycles only
recent <- demo |>
  filter(year >= 2015)

# Compare time periods
demo |>
  mutate(
    period = case_when(
      year < 2010 ~ "1999-2009",
      year < 2020 ~ "2010-2019",
      TRUE ~ "2020+"
    )
  ) |>
  group_by(period) |>
  summarise(n = n())

Finding Variables

By keyword

#| label: term-search
#| eval: false

term_search("blood pressure")

Returns variable names, table names, descriptions, and collection years. From there you can identify the base table name (e.g., BPX) for use with read_nhanes().

By variable name

#| label: var-search
#| eval: false

var_search("BPXSY1")

Shows which cycles contain a specific variable.

In loaded data

#| label: check-var-setup
#| echo: false
#| eval: true

bmx <- read_nhanes("bmx")

#| label: check-var
#| eval: true

"bmxht" %in% names(bmx)

All search functions are case-insensitive.

Complete Example: Blood Pressure by Age Group

#| label: bp-plot-setup
#| include: false
#| eval: true
#| message: false
#| warning: false

bpx <- read_nhanes("bpx")
bp_combined <- demo |>
  filter(ridageyr >= 18) |>
  select(seqn, year, ridageyr, riagendr, ridreth1) |>
  inner_join(
    bpx |> select(seqn, year, bpxsy1, bpxdi1),
    by = c("seqn", "year")
  )
bp_summary <- bp_combined |>
  filter(!is.na(bpxsy1), !is.na(bpxdi1), bpxsy1 > 0, bpxdi1 > 0) |>
  mutate(
    age_group = cut(
      ridageyr,
      breaks = c(18, 30, 40, 50, 60, 70, 80, Inf),
      labels = c(
        "18-29", "30-39", "40-49", "50-59",
        "60-69", "70-79", "80+"
      ),
      right = FALSE
    )
  ) |>
  group_by(age_group) |>
  summarize(
    n = n(),
    mean_systolic = mean(bpxsy1),
    mean_diastolic = mean(bpxdi1),
    .groups = "drop"
  )

#| label: complete-example-load
#| eval: false

demo <- read_nhanes("demo")
bpx <- read_nhanes("bpx")

bp_analysis <- demo |>
  filter(ridageyr >= 18) |>
  select(seqn, year, ridageyr, riagendr, ridreth1) |>
  inner_join(
    bpx |> select(seqn, year, bpxsy1, bpxdi1),
    by = c("seqn", "year")
  ) |>
  filter(!is.na(bpxsy1), !is.na(bpxdi1), bpxsy1 > 0, bpxdi1 > 0) |>
  mutate(
    age_group = cut(
      ridageyr,
      breaks = c(18, 30, 40, 50, 60, 70, 80, Inf),
      labels = c(
        "18-29", "30-39", "40-49", "50-59",
        "60-69", "70-79", "80+"
      ),
      right = FALSE
    )
  )

bp_summary <- bp_analysis |>
  group_by(age_group) |>
  summarize(
    n = n(),
    mean_systolic = mean(bpxsy1),
    mean_diastolic = mean(bpxdi1),
    .groups = "drop"
  )

#| label: complete-example-plot
#| eval: true
#| echo: true
#| fig.width: 7
#| fig.height: 5

bp_summary |>
  ggplot(aes(x = age_group)) +
  geom_col(aes(y = mean_systolic), fill = "coral", alpha = 0.7) +
  geom_col(aes(y = mean_diastolic), fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Blood Pressure Increases with Age",
    subtitle = "Mean systolic (coral) and diastolic (blue) BP by age group",
    x = "Age Group",
    y = "Blood Pressure (mmHg)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Next Steps

Browse the dataset catalog for the full list of available tables
Learn about creating survey design objects with proper weighting for multi-cycle analyses
Read about the 2019-2020 cycle exclusion
Use ?read_nhanes for function documentation
File issues or feature requests on GitHub

Any scripts or data that you put into this service are public.

nhanesdata documentation built on March 1, 2026, 1:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

nhanesdata
Harmonized Access to NHANES Survey Data

Getting Started with NHANES Data
In nhanesdata: Harmonized Access to NHANES Survey Data

Motivation

Acknowledgments

Installation

Loading Data

Example: Age Distribution Over Time

CDC Dataset Naming

Joining Multiple Datasets

Filtering by Survey Year

Finding Variables

By keyword

By variable name

In loaded data

Complete Example: Blood Pressure by Age Group

Next Steps

Try the nhanesdata package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

nhanesdata Harmonized Access to NHANES Survey Data

Getting Started with NHANES Data In nhanesdata: Harmonized Access to NHANES Survey Data

Motivation

Acknowledgments

Installation

Loading Data

Example: Age Distribution Over Time

CDC Dataset Naming

Joining Multiple Datasets

Filtering by Survey Year

Finding Variables

By keyword

By variable name

In loaded data

Complete Example: Blood Pressure by Age Group

Next Steps

Try the nhanesdata package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

nhanesdata
Harmonized Access to NHANES Survey Data

Getting Started with NHANES Data
In nhanesdata: Harmonized Access to NHANES Survey Data