#| label: setup #| include: false knitr::opts_chunk$set( collapse = TRUE, eval = FALSE, comment = "#>" )
NHANES uses a complex, multistage probability sampling design to select participants who represent the non-institutionalized U.S. population. Without proper survey weights, analyses will produce biased estimates. The create_design() function automates the calculation of appropriate weights when combining multiple NHANES cycles, following CDC weighting guidelines.
NHANES provides three categories of sampling weights, each reflecting different levels of participation:
wtint2yr, wtint4yr): Used when all variables come from the household interview (demographics, questionnaires).wtmec2yr, wtmec4yr): Used when any variable requires a physical exam (laboratory tests, body measurements, DEXA scans).wtsaf2yr): Used when any variable requires fasting laboratory tests (glucose, insulin, lipids).The probability of being sampled decreases from interview to MEC to fasting subsamples. When combining variables across categories, always use the weight with the lowest probability of selection. For example, if your analysis includes both demographics (interview) and body measurements (MEC), use MEC weights.
CDC recommendations for combining cycles are based on the number of cycles present in your data, not the timespan covered. This distinction matters when you have gaps in your data.
NHANES provides 4-year weights (wtint4yr, wtmec4yr) for 1999-2000 and 2001-2002 cycles, while all subsequent cycles provide only 2-year weights. When combining multiple cycles:
Cycles 1999 or 2001: Use 4-year weight × (2/n) The numerator is 2 because the 4-year weight represents two 2-year cycles.
Cycles 2003+: Use 2-year weight × (1/n)
Denominator n: Total number of cycles in your analysis
Combining 4 cycles (1999, 2001, 2003, 2005) with MEC weights:
wtmec4yr * 2/4 = wtmec4yr * 0.5wtmec2yr * 1/4 = wtmec2yr * 0.25If you excluded the 2003 cycle, you would have 3 cycles total, so:
wtmec4yr * 2/3wtmec2yr * 1/3The key principle: n is the number of cycles present, not the timespan.
#| label: load-packages #| eval: true #| message: false #| warning: false library(nhanesdata) library(dplyr) library(srvyr)
When analyzing demographics and questionnaire data only:
#| label: interview-example # Load demographics data demo <- read_nhanes("demo") # Create design with interview weights design_int <- create_design( dsn = demo, start_yr = 1999, end_yr = 2011, wt_type = "interview" ) # Calculate weighted means design_int |> summarize( mean_age = survey_mean(ridageyr, na.rm = TRUE), pct_female = survey_mean(riagendr == 2, na.rm = TRUE) )
When including any examination or laboratory data:
#| label: mec-example # Load demographics and body measures demo <- read_nhanes("demo") bmx <- read_nhanes("bmx") combined <- demo |> left_join(bmx, by = c("seqn", "year")) # Use MEC weights because body measures require exam participation design_mec <- create_design( dsn = combined, start_yr = 2007, end_yr = 2017, wt_type = "mec" ) # Weighted BMI analysis design_mec |> filter(!is.na(bmxbmi)) |> summarize( mean_bmi = survey_mean(bmxbmi, na.rm = TRUE), pct_obese = survey_mean(bmxbmi >= 30, na.rm = TRUE) )
When including fasting laboratory measurements:
#| label: fasting-example # Load demographics and fasting lab data demo <- read_nhanes("demo") glu <- read_nhanes("glu") combined <- demo |> left_join(glu, by = c("seqn", "year")) # Use fasting weights for glucose analysis design_fast <- create_design( dsn = combined, start_yr = 2005, end_yr = 2015, wt_type = "fasting" ) # Analyze fasting glucose design_fast |> filter(!is.na(lbxglu)) |> summarize( mean_glucose = survey_mean(lbxglu, na.rm = TRUE) )
You can specify a wide year range even if some cycles are missing from your data. The function calculates weights based only on cycles actually present:
#| label: gaps-example # Data might be missing 2007-2010 cycles # Weights calculated on cycles present, not timespan design <- create_design( dsn = demo, start_yr = 1999, end_yr = 2017, wt_type = "interview" )
When creating a survey design, some participants may lack the weight variable needed for your analysis. This happens naturally in NHANES because not everyone completes every component.
How create_design() handles this:
Example message you might see:
Filtered out 150 participants without valid mec weights.
These participants were not in the subsample for this weight category.
Learn more:
+ CDC weighting guidance:
https://wwwn.cdc.gov/nchs/nhanes/tutorials/Weighting.aspx
+ Survey design vignette: vignette('survey-design', package = 'nhanesdata')
Zero weights are different from missing weights:
NHANES uses a stratified, multistage sampling design with Primary Sampling Units (PSUs) nested within strata. Variance estimation requires at least 2 PSUs per stratum. When subsetting data (e.g., filtering to diabetes patients only), you may create strata with only one PSU.
The create_design() function sets options(survey.lonely.psu = "adjust"), which handles this conservatively by centering single-PSU strata at the sample grand mean rather than the stratum mean. This approach:
For more details on lonely PSU handling, see Thomas Lumley's {survey} package documentation.
The function validates that your dataset contains:
year: NHANES cycle start year (odd years: 1999, 2001, 2003, ..., 2021)sdmvpsu: Primary sampling unitssdmvstra: Sampling stratawt_type:wtint2yr (and wtint4yr if 1999/2001 cycles present)wtmec2yr (and wtmec4yr if 1999/2001 cycles present)wtsaf2yrThese variables are automatically included in datasets loaded via read_nhanes().
read_nhanes() and {dplyr} joinscreate_design()Preprocessing before design creation is strongly recommended. Once the design object is created, filtering and recoding become more complex due to the survey structure.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.