NHANES aims to produce national estimates on a range of health, nutrition, and other factors that accurately represent the non-institutionalized civilian U.S. population. The data is gathered using a complex, four-stage sampling design. When analyzing this data, it's crucial to use specialized survey analysis techniques that incorporate specific weights. These weights account for the intricate sampling strategy employed during data collection.
You can find the NHANES sample weights as part of the downloadable data. For a detailed understanding of these weights and their correct usage, the CDC provides extensive documentation: CDC NHANES Tutorials.
It's worth noting that the methodologies and strategies for NHANES have evolved over time. Reasons for these changes range from methodological enhancements, external events like the COVID epidemic, to a recent shift towards a more longitudinal perspective. Such changes can significantly influence any analysis. Thus, analysts should approach this data with diligence and care.
There are numerous online resources that offer worked examples for further understanding:
survey
package: R-Survey.This vignette's intent is not to serve as an exhaustive guide. Instead, we aim to showcase some fundamental functions and point you towards more in-depth online resources necessary for specialized analyses. Here we will cover the basics of setting up the survey design and performing basic tasks. For those interested in diving deeper, we encourage you to develop and share more comprehensive documents.
Using the nhanesA
and survey
packages, we will explore the average blood pressure of individuals over 40 years of age, segmented by their reported ethnicity, during the 2017-2018 cycle. To achieve this, we'll merge the demographic (DEMO_J
) and blood pressure (BXP_J
) datasets. Our examination will cover average blood pressure distributions across different sexes and ethnicities, and we'll conduct a regression analysis to understand the relationship between blood pressure and age. This is a basic demonstration; comprehensive analyses would need to consider various risk factors and align with specific epidemiological queries.
library("nhanesA") demoj = nhanes("DEMO_J") dim(demoj) ## merge DEMO_J and BXP_J using SEQN. bpxj = nhanes("BPX_J") data = merge(demoj, bpxj, by="SEQN") dim(data)
To generate accurate estimates, it's essential to establish a survey design object that integrates the weights into our analysis. This design object should be created before any subsetting or manipulation of the data. By doing this, we ensure that intricate survey design elements, like stratification and clustering, are fully captured and applied across the dataset.
It is also crucial to choose the appropriate weighting scheme for your analysis. NHANES offers various weights, such as interview, examination, fasting, and MEC laboratory weights, to name a few. The selection hinges on the specific dataset components and analyses you're focusing on. Moreover, when analyzing data across multiple cycles, it will be necessary to modify these weights to ensure accurate representation and inference.
For a comprehensive understanding, the CDC provides in-depth guidance on its website: CDC NHANES Weighting Tutorial.
library("survey") nhanesDesign <- svydesign(id = ~SDMVPSU, # Primary Sampling Units (PSU) strata = ~SDMVSTRA, # Stratification used in the survey weights = ~WTMEC2YR, # Survey weights nest = TRUE, # Whether PSUs are nested within strata data = data)
Next, we subset the data to focus on individuals over 40 years of age. At this stage, it's imperative to use the tools in the survey
package for any data manipulations. This procedure guarantees that weight adjustments are correctly applied, ensuring the validity of subsequent analyses. For comparison purposes, we also create an additional subset without the survey design framework, which allows for easy examination of unadjusted values.
# subset survey design object dfsub = subset(nhanesDesign, data$RIDAGEYR>=40) # subset the original dataset datasub = data[data$RIDAGEYR>=40,]
Next, we'll explore the variable RIDRETH1 from the DEMO_J table, which represents reported ethnicity. We'll focus on diastolic blood pressure for our analysis. For simplicity in our presentation, we'll utilize only the first measurement, represented by the variable BPXDI1 in the BPX_J table.
First we split by ethnicity and calculate mean blood pressure for each group on the unweighted data.
## mean on total data set mean(datasub$BPXDI1, na.rm = TRUE) ##split the data by ethnicity and calculate mean of the unweighted data unweighted_means <- sapply(split(datasub$BPXDI1, datasub$RIDRETH1), mean, na.rm=TRUE) unweighted_means
Now, we perform the same analysis using the survey weights.
adjmns = svymean(~BPXDI1, dfsub, na.rm=TRUE) adjmns # By ethnicity adjmnsbyEth = svyby(~BPXDI1, ~RIDRETH1, dfsub, svymean, na.rm=TRUE) weighted_means <- as.numeric(adjmnsbyEth$BPXDI1) combined_data <- cbind(unweighted_means, weighted_means) colnames(combined_data) <- c("Unweighted", "Weighted") combined_data
We can do this with gender and any other categorical variable:
# By Gender mns = sapply(split(datasub$BPXDI1, datasub$RIAGENDR), mean, na.rm=TRUE) adjmnsbyGen = svyby(~BPXDI1, ~RIAGENDR, dfsub, svymean, na.rm=TRUE) combined_data <- cbind(mns, adjmnsbyGen$BPXDI1) colnames(combined_data) <- c("Unweighted", "Weighted") combined_data
In addition to computing the mean, various other summary statistics, such as quantiles and variance, can be readily calculated using the survey design object. Importantly, each of these summary statistics will come with a corresponding standard error that accounts for the weights, providing insights into the precision of the estimates.
# For the unweighted data quantile(datasub$BPXDI1, c(0.25,0.5,.75), na.rm = TRUE) # For the survey weighted data svyquantile(~BPXDI1, dfsub, quantiles = c(0.25,0.5,0.75), na.rm=TRUE) # By Gender svyby(~BPXDI1, ~RIAGENDR, dfsub, svyquantile, quantiles = c(0.5), na.rm=TRUE)
Note: The columns ci.2.5 and ci.97.5 in the output represent percentile intervals, not the traditional confidence intervals. Specifically, for a given quantile (like the 25th percentile), the values in these columns denote the range between the 2.5th percentile and the 97.5th percentile of that particular quantile estimate, when considering the survey weights.
# For the entire dataset svyvar(~BPXDI1, dfsub, quantiles = c(0.25,0.5,0.75), na.rm=TRUE) # By ethnicity svyby(~BPXDI1, ~RIDRETH1, dfsub, svyvar, na.rm=TRUE) # By Gender svyby(~BPXDI1, ~RIAGENDR, dfsub, svyvar, na.rm=TRUE)
The svyglm
function allows us to perform regression analyses while accounting for the survey's design. In this section, we will demonstrate how to model diastolic blood pressure levels based on age and ethnicity.
svyglm
Using svyglm()
is akin to utilizing the glm()
function from base R. The key difference is that svyglm()
is specifically tailored to work with survey objects, considering the intricate sampling design and weights.
To model diastolic blood pressure (noted as BPXDI1
in our dataset) as a function of age (RIDAGEYR
) and ethnicity (RIDRETH3
), we can construct our model as outlined below. The resulting output will furnish regression coefficients, standard errors, and significance tests that have been adjusted for the complex survey design. When interpreting these results, it's crucial to remember that they resemble outputs from a generalized linear model (glm
). However, the estimates provided by svyglm()
have been weighted to produce nationally representative conclusions.
weighted_model <- svyglm(BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, family = gaussian()) summary(weighted_model)
Visualizing the differences between weighted and unweighted analyses can offer a clear, intuitive understanding of the impact of weights. We can use bar plots to compare mean diastolic blood pressures across different ethnicities. This plot could be extended to other metrics as well.
library(ggplot2) ## recalculating means (same as above) unweighted_means <- sapply(split(datasub$BPXDI1, datasub$RIDRETH1), mean, na.rm=TRUE) weighted_means <- as.numeric(adjmnsbyEth$BPXDI1) plot_data <- data.frame( Ethnicity = factor(rep(names(unweighted_means), 2), levels = names(unweighted_means)), Means = c(unweighted_means, weighted_means), Type = factor(c(rep("Unweighted", length(unweighted_means)), rep("Weighted", length(weighted_means))), levels = c("Unweighted", "Weighted")) ) # Creating the plot ggplot(plot_data, aes(x = Ethnicity, y = Means, fill = Type)) + geom_bar(stat = "identity", position = "dodge", width = 0.6) + labs(title = "Comparison of Diastolic Blood Pressure by Ethnicity", y = "Mean Diastolic Blood Pressure") + scale_x_discrete(labels = c('Mex-Amer','Other Hisp','White', 'Black', 'Other')) + scale_fill_manual(values = c("blue", "red")) + theme_minimal() + theme(legend.title = element_blank())
Visualizing regression coefficients can provide a clear understanding of how different variables impact the outcome, especially when using weighted and unweighted data.
unweighted_model <- glm(BPXDI1 ~ RIDAGEYR + RIDRETH1, data = datasub, family = gaussian()) weighted_model <- svyglm(BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, family = gaussian()) # For the unweighted model unweighted_summary <- summary(unweighted_model) unweighted_coefs <- unweighted_summary$coefficients[, "Estimate"] unweighted_se <- unweighted_summary$coefficients[, "Std. Error"] # For the weighted model weighted_summary <- summary(weighted_model) weighted_coefs <- as.numeric(weighted_summary$coefficients[, "Estimate"]) weighted_se <- as.numeric(weighted_summary$coefficients[, "Std. Error"]) comparison <- data.frame( Variable = names(unweighted_coefs), Unweighted = unweighted_coefs, Weighted = as.numeric(weighted_coefs) ) print(comparison) unweighted_df <- data.frame(Variable = names(unweighted_coefs), Estimate = unweighted_coefs, SE = unweighted_se, Type = "Unweighted") weighted_df <- data.frame(Variable = names(unweighted_coefs), Estimate = weighted_coefs, SE = weighted_se, Type = "Weighted") plot_data <- rbind(unweighted_df, weighted_df) ggplot(subset(plot_data, Variable!='(Intercept)'), aes(x = Estimate, y = reorder(Variable, Estimate), color = Type)) + geom_point(position = position_dodge(0.5), size = 2.5) + geom_errorbarh(aes(xmin = Estimate - SE, xmax = Estimate + SE), height = 0.2, position = position_dodge(0.5)) + geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") + labs(title = "Comparison of Regression Coefficients", x = "Coefficient Value", y = "Predictors") + theme_minimal() + scale_color_manual(values = c("Unweighted" = "blue", "Weighted" = "red")) + theme(legend.title = element_blank())
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.