The nhanesA R package was developed to allow investigators to easily explore and retrieve data from the National Health and Nutrition Examination Survey (NHANES). The survey assesses overall health and nutrition of adults and children in the United States, and is conducted by the National Center for Health Statistics (NCHS). NHANES data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm and are reported in thousands of peer-reviewed journal publications every year.
Since 1999, the NHANES survey has been conducted continuously, and the surveys during that period are referred to as "continuous NHANES" to distinguish from several prior surveys. Continuous NHANES surveys are grouped in two-year intervals, with the first interval being 1999-2000.
Most NHANES data are in the form of tables in SAS 'XPT' format. The survey is grouped into five data categories that are publicly available, as well as an additional category (Limited access data) that requires written justification and prior approval before access. Package nhanesA is intended mostly for use with the publicly available data, but some information pertaining to the limited access data can also be retrieved.
The five publicly available data categories are: - Demographics (DEMO) - Dietary (DIET) - Examination (EXAM) - Laboratory (LAB) - Questionnaire (Q). The abbreviated forms in parentheses may be substituted for the long form in nhanesA commands.
For limited access data, the available tables and variable names can be listed, but the data cannot be downloaded directly. To indicate limited access data in nhanesA functions, use: - Limited (LTD)
To quickly get familiar with NHANES data, it is helpful to display a listing of tables. Use nhanesTables to get information on tables that are available for a given category for a given year.
library(nhanesA) nhanesTables('EXAM', 2005)
df <- data.frame(matrix(1,nrow=13,ncol=2)) names(df) <- c('Data.File.Name', 'Data.File.Description') df[1,] <- list('BPX_D', 'Blood Pressure') df[2,] <- list('BMX_D', 'Body Measures') df[3,] <- list('AUX_D', 'Audiometry') df[4,] <- list('AUXTYM_D', 'Audiometry - Tympanometry') df[5,] <- list('DXXFEM_D', 'Dual Energy X-ray Absorptiometry - Femur') df[6,] <- list('OPXFDT_D', 'Ophthalmology - Frequency Doubling Technology') df[7,] <- list('OHX_D', 'Oral Health') df[8,] <- list('PAXRAW_D', 'Physical Activity Monitor') df[9,] <- list('VIX_D', 'Vision') df[10,] <- list('DXXAG_D', 'Dual Energy X-ray Absorptiometry - Android/Gynoid') df[11,] <- list( 'AUXAR_D', 'Audiometry - Acoustic Reflex') df[12,] <- list('OPXRET_D', 'Ophthalmology - Retinal Imaging') df[13,] <- list('DXXSPN_D', 'Dual Energy X-ray Absorptiometry - Spine') df
Note that the two-year survey intervals begin with the odd year. For convenience, only a single 4-digit year is entered such that nhanesTables('EXAM', 2005)
and nhanesTables('EXAM', 2006)
yield identical output.
After viewing the output, we decide we are interested in table 'BMX_D' that contains body measures data. To better determine if that table is of interest, we can display detailed information on the table contents using nhanesTableVars.
nhanesTableVars('EXAM', 'BMX_D')
df <- data.frame(matrix(1,nrow=27,ncol=2)) names(df) <- c('Variable.Name', 'Variable.Description') df[1,] <- list('BMDSTATS', 'Body Measures Component status Code') df[2,] <- list('BMIARMC', 'Arm Circumference Comment') df[3,] <- list('BMIARML', 'Upper Arm Length Comment') df[4,] <- list('BMICALF', ' Maximal Calf Comment') df[5,] <- list('BMIHEAD', 'Head Circumference Comment') df[6,] <- list('BMIHT', 'Standing Height Comment') df[7,] <- list('BMILEG', 'Upper Leg Length Comment') df[8,] <- list('BMIRECUM', 'Recumbent Length Comment') df[9,] <- list('BMISUB', 'Subscapular Skinfold Comment') df[10,] <- list('BMITHICR', 'Thigh Circumference Comment') df[11,] <- list('BMITRI', 'Triceps Skinfold Comment') df[12,] <- list('BMIWAIST', 'Waist Circumference Comment') df[13,] <- list('BMIWT', 'Weight Comment') df[14,] <- list('BMXARMC', 'Arm Circumference (cm)') df[15,] <- list('BMXARML', 'Upper Arm Length (cm)') df[16,] <- list('BMXBMI', 'Body Mass Index (kg/m**2)') df[17,] <- list('BMXCALF', 'Maximal Calf Circumference (cm)') df[18,] <- list('BMXHEAD', 'Head Circumference (cm)') df[19,] <- list('BMXHT', 'Standing Height (cm)') df[20,] <- list('BMXLEG', 'Upper Leg Length (cm)') df[21,] <- list('BMXRECUM', 'Recumbent Length (cm)') df[22,] <- list('BMXSUB', 'Subscapular Skinfold (mm)') df[23,] <- list('BMXTHICR', 'Thigh Circumference (cm)') df[24,] <- list('BMXTRI', 'Triceps Skinfold (mm)') df[25,] <- list('BMXWAIST', 'Waist Circumference (cm)') df[26,] <- list('BMXWT', 'Weight (kg)') df[27,] <- list('SEQN', 'Respondent sequence number.') df
We see that there are 27 columns in table BMX_D. SEQN is a subject identifier that is used to join information across tables.
We now import BMX_D along with the demographics table DEMO_D.
bmx_d <- nhanes('BMX_D') demo_d <- nhanes('DEMO_D')
We merge the tables and display several variables. Note that RIAGENDR, like most categorical variables, is a coded field. By default, the original coded values (1,2) are translated to (Male, Female).
bmx_demo <- merge(demo_d, bmx_d) options(digits=4) select_cols <- c('RIAGENDR', 'BMXHT', 'BMXWT', 'BMXLEG', 'BMXCALF', 'BMXTHICR') print(bmx_demo[5:8,select_cols], row.names=FALSE)
df <- data.frame(matrix(1,nrow=4,ncol=6)) names(df) <- c('RIAGENDR', 'BMXHT', 'BMXWT', 'BMXLEG', 'BMXCALF', 'BMXTHICR') df[1,] <- list('Female', 156.0, 75.2, 38.0, 36.6, 53.7) df[2,] <- list('Male', 167.6, 69.5, 40.4, 35.6, 48.0) df[3,] <- list('Female', 163.7, 45.0, 39.2, 31.7, 41.3) df[4,] <- list('Male', 182.4, 101.9, 41.5, 42.6, 50.5) print(df,row.names=FALSE)
For each variable, NHANES provides a codebook, which is a basic description of the variable and also includes the distribution or range of values. We can use nhanesCodebook to list the codebook definition for the gender field RIAGENDR in table DEMO_D.
nhanesCodebook('DEMO_D', 'RIAGENDR')
df <- data.frame(matrix(1,nrow=3,ncol=5)) names(df) <- c("Code.or.Value", "Value.Description", "Count", "Cumulative", "Skip to Item") df[1,] <- list(1, 'Male', 5080, 5080, NA) df[2,] <- list(2, 'Female', 5268, 10348, NA) df[3,] <- list('.', 'Missing', 0, 10348, NA) codelist <- list("RIAGENDR", "Gender", "Gender of the sample person", "Both males and females 0 YEARS -\r 150 YEARS", df) names(codelist) <- c('Variable Name', 'SAS Label', 'English Text', 'Target', 'RIAGENDR') codelist
As a default, the nhanes function will translate coded values. To ensure proper interpretation of variables, it is recommended to always use nhanes
with the default option of translate = TRUE. However, you may also customize the translation of coded fields manually using nhanesTranslate.
Customized translation of coded fields is a three step process.
1: Download the table using nhanes with translate = FALSE
2: Select the table variables to translate
3: Pass the table and variable list to nhanesTranslate
bpx_d <- nhanes('BPX_D', translate=FALSE) head(bpx_d[,6:11])
df <- data.frame(matrix(1,nrow=6,ncol=6)) names(df) <- c("BPQ150A", "BPQ150B", "BPQ150C", "BPQ150D", "BPAARM", "BPACSZ") df[2:6,1:4] <- 2 df[3,1] <- 1 df[3:6,6] <- 4 df[2,6] <- 3 df[4,6] <- 3 df[1,] <- NA df
bpx_d_vars <- nhanesTableVars('EXAM', 'BPX_D', namesonly=TRUE) #Alternatively may use bpx_d_vars = names(bpx_d) bpx_d <- nhanesTranslate('BPX_D', bpx_d_vars, data=bpx_d)
translated <- c('BPAARM', 'BPACSZ', 'BPAEN2', 'BPAEN3', 'BPAEN4', 'BPQ150A', 'BPQ150B', 'BPQ150C', 'BPQ150D', 'BPXPTY', 'BPXPULS', 'PEASCCT1', 'PEASCST1') message(paste(c("Translated columns:", translated), collapse = ' '))
head(bpx_d[,6:11])
df$BPAARM[df$BPAARM==1] <- 'Right' df[df==1] <- 'Yes' df[df==2] <- 'No' df[df==3] <- 'Adult (12X22)' df[df==4] <- 'Large (15X32)' df
Some discretion should be applied when translating coded columns as code translations can be quite long. To improve readability the translation string is restricted to a default length of 128 but can be set as high as 1024. Also, columns that have at least two categories (e.g. Male, Female) will be translated, but mincategories can be set to 1 to perform the translation even if only a single category is present.
The primary goal of nhanesA is to enable fully customizable processing of select NHANES tables. However, it is quite easy to download entire surveys using nhanesA functions. Say we want to download every questionnaire in the 2007-2008 survey. We first get a list of the table names by using nhanesTables with namesonly = TRUE. The tables can then be downloaded using nhanes with lapply.
q2007names <- nhanesTables('Q', 2007, namesonly=TRUE) q2007tables <- lapply(q2007names, nhanes) names(q2007tables) <- q2007names
Some NHANES measurements require special handling, e.g. due to statistical considerations. Furthermore, there are surveys conducted outside the scope of the continuous survey, but that are provided in a similar format such that nhanesA can be readily adapted to retrieve their data. Note that nhanesA cannot be used to handle accelerometer data from 2003-2006. For those data, please see package accelerometry.
Due to COVID, data collection for the 2019-2020 cycle was not completed. In order to make better use of data that were collected, data from the 2017-2018 survey were included to form a representative sample of 2017-March 2020 pre-pandemic data. The table structure and variable format of pre-pandemic tables is essentially identical to standard continuous NHANES. The table names include the prefix "P_". To list pre-pandemic tables, use 'P' or 'p' as the year.
#List all pre-pandemic tables nhanesSearchTableNames('^P_') #List table variables nhanesTableVars('EXAM', 'P_AUX', namesonly=TRUE) #List pre-pandemic EXAM tables nhanesTables('EXAM', 'P') #Table import, variable translation, and codebook display operate as usual p_dxxfem <- nhanes('P_DXXFEM') nhanesTranslate('P_BMX', 'BMDSTATS') nhanesCodebook('P_INS', 'LBDINSI')
Dual Energy X-Ray Absorptiometry (DXA) data were acquired from 1999-2006. These data were found to contain a higher amount of missing values than usual, and a multiple imputation process was applied to fill in missing and invalid values. More information may be found at https://wwwn.cdc.gov/nchs/nhanes/dxa/dxa.aspx. By default the DXA data are imported into the R environment, however, because the tables are quite large it may be desirable to save the data to a local file then import to R as needed. When nhanesTranslate is applied to DXA data, only the 2005-2006 translation tables are used as those are the only DXA codes that are currently available in html format.
#Import into R dxx_b <- nhanesDXA(2001) #Save to file nhanesDXA(2001, destfile="dxx_b.xpt") #Import supplemental data dxx_c_s <- nhanesDXA(2003, suppl=TRUE) #Apply code translations dxalist <- c('DXAEXSTS', 'DXIHE') dxx_b <- nhanesTranslate("dxxb",colnames=dxalist, data=dxx_b, dxa=TRUE)
NNYFS is a 2012 ancillary study conducted to assess physical activity and fitness levels of children and teens from ages 3-15. This study consists of questionnaires and basic measurements. There are no LAB tables. NNYFS table names include the prefix "Y_". To list tables, use 'Y' or 'y' as the year.
#List NNYFS EXAM tables nhanesTables('EXAM', 'Y') #Table import and variable translation operate as usual y_cvx <- nhanes('Y_CVX') nhanesTranslate('Y_CVX','CVXPARC')
The NHANES repository is extensive, thus it is helpful to perform a targeted search to identify relevant tables and variables. There are several nhanesA functions that allow the user to search using different criteria including the variable description, variable name, and table name pattern.
Comprehensive lists of NHANES variables are maintained for each data group. For example, the demographics variables are available at https://wwwn.cdc.gov/nchs/nhanes/search/variablelist.aspx?Component=Demographics. The nhanesSearch function allows the investigator to input search terms, match against the comprehensive variable descriptions, and retrieve the list of matching variables. Matching search terms (variable descriptions must contain one of the terms) and exclusive search terms (variable descriptions must NOT contain any exclusive terms) may be provided. The search can be restricted to a specific survey range as well as specific data groups.
# nhanesSearch use examples # # Search on the word bladder, restrict to the 2001-2008 surveys, # print out 50 characters of the variable description nhanesSearch("bladder", ystart=2001, ystop=2008, nchar=50) # # Search on "urin" (will match urine, urinary, etc), from 1999-2010, return table names only nhanesSearch("urin", ignore.case=TRUE, ystop=2010, namesonly=TRUE) # # Search on "urin", exclude "During", search surveys from 1999-2010, return table names only nhanesSearch("urin", exclude_terms="during", ignore.case=TRUE, ystop=2010, namesonly=TRUE) # # Restrict search to 'EXAM' and 'LAB' data groups. Explicitly list matching and exclude terms, leave ignore.case set to default value of FALSE. Search surveys from 2009 to present. nhanesSearch(c("urin", "Urin"), exclude_terms=c("During", "eaten during", "do during"), data_group=c('EXAM', 'LAB'), ystart=2009) # # Search on "tooth" or "teeth", all years nhanesSearch(c("tooth", "teeth"), ignore.case=TRUE) # # Search for variables where the variable description begins with "Tooth" nhanesSearch("^Tooth")
nhanesSearch is a versatile search function as it imports the comprehensive variable lists to a data frame. That allows for detailed conditional extraction of the variables. However, each call to nhanesSearch takes up to a minute or more to process. Faster processing can be achieved when we know the name of a specific variable of interest and we look only for exact matches to the variable name. Function nhanesSearchVarName matches a given variable name in the html directly, then only the matching elements are converted to a data frame. Consequently, a call to nhanesSearchVarName executes much faster than nhanesSearch; typically under 30s. nhanesSearchVarName is useful for finding all data tables that contain a given variable.
#nhanesSearchVarName use examples nhanesSearchVarName('BPXPULS')
bpxtables <- c("BPX_D", "BPX_E", "BPX", "BPX_C", "BPX_B", "BPX_F", "BPX_G", "BPX_H", "BPX_I", "BPX_J") bpxtables
nhanesSearchVarName('CSQ260i', includerdc=TRUE, nchar=38, namesonly=FALSE)
df <- data.frame(Variable.Name=character(2), Variable.Description=character(2), Data.File.Name=character(2), Data.File.Description=character(2), Begin.Year=integer(2), EndYear=integer(2), Component=character(2), Use.Constraints=character(2)) df[1,] <- list('CSQ260i', 'Do you now have any of the following p','CSX_G_R','Taste & Smell', 2012,2012,'Examination', 'RDC Only') df[2,] <- list('CSQ260i', 'Do you now have any of the following p','CSX_H','Taste & Smell', 2013, 2014, 'Examination', 'None') df
In order to group data across surveys, it is useful to list all available tables that follow a given naming pattern. Function nhanesSearchTableNames is used for such pattern matching. For example, if we want to work with all available body measures data we can retrieve the full list of available tables with nhanesSearchTableNames('BMX'). The search is conducted over the comprehensive table list, which is much smaller than the comprehensive variable list, such that a call to nhanesSearchTableNames takes only a few seconds.
# nhanesSearchTableNames use examples nhanesSearchTableNames('BMX')
bpxtables <- c("BMX_D", "BMX", "BMX_E", "BMX_C", "BMX_B", "BMX_F", "BMX_H", "BMX_G", "BMX_I", "BMX_J", "P_BMX") bpxtables
nhanesSearchTableNames('HPVS', includerdc=TRUE, nchar=42, details=TRUE)
df <- data.frame( Years=character(), # Data.File.Name=character(), Doc.File=character(), Data.File=character(), Date.Published=character()) df[1,] <- list('2009-2010', 'HPVSER_F Doc', 'HPVSER_F Data [XPT - 171.6 KB]', 'November 2013') df[2,] <- list('2007-2008', 'HPVSER_E Doc', 'HPVSER_E Data [XPT - 155.7 KB]', 'November 2013') df[3,] <- list('2005-2006', 'HPVSER_D Doc', 'HPVSER_D Data [XPT - 151.6 KB]', 'July 2013') df[4,] <- list('2005-2006', 'HPVSRM_D Doc', 'HPVSRM_D Data [XPT - 302.6 KB]', 'January 2015') df[5,] <- list('2007-2008', 'HPVSWR_E Doc', 'HPVSWR_E Data [XPT - 677.9 KB]', 'August 2012') df[6,] <- list('2009-2010', 'HPVSWR_F Doc', 'HPVSWR_F Data [XPT - 725.2 KB]', 'August 2012') df[7,] <- list('2011-2012', 'HPVSWR_G Doc', 'HPVSWR_G Data [XPT - 661.1 KB]', 'March 2015') df[8,] <- list('2005-2006', 'HPVSWR_D Doc', 'HPVSWR_D Data [XPT - 694.4 KB]', 'Updated November 2018') df[9,] <- list('2013-2014', 'HPVSWR_H Doc', 'HPVSWR_H Data [XPT - 716.6 KB]', 'December 2016') df[10,] <- list('2015-2016', 'HPVSWC_I Doc', 'HPVSWC_I Data [XPT - 33.3 KB]', 'November 2018') df[11,] <- list('2015-2016', 'HPVSWR_I Doc', 'HPVSWR_I Data [XPT - 667.5 KB]', 'November 2018') df[12,] <- list('2005-2006', 'HPVS_D_R Doc', 'RDC Only', 'July 2013') df[13,] <- list('2009-2010', 'HPVS_F_R Doc', 'RDC Only', 'August 2012') df[14,] <- list('2011-2012', 'HPVS_G_R Doc', 'RDC Only', 'March 2015') df[15,] <- list('2013-2014', 'HPVS_H_R Doc', 'RDC Only', 'December 2016') df[16,] <- list('2015-2016', 'HPVS_I_R Doc', 'RDC Only', 'November 2018') df[17,] <- list('2017-2018', 'HPVS_J_R Doc', 'RDC Only', 'December 2020') df
Sincerely,
Christopher J. Endres
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.