In dosgillespie/hseclean: Health Survey Data Wrangling

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.pos = 'H'
)

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(testthat))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(hseclean))

Introduction

Alcohol consumption data in the Health Survey for England (HSE) is recorded in four main forms:

How often someone usually drinks
For adults considering the last 12 months, what they drink on average
For adults considering the last week, when they drank the most
For children considering the last week, what they drank

Both adults and children have data on whether they drink alcohol or not, and on the frequency of drinking. The main difference between the recording of data for adults and children is that adults have a lot of data on how much and what they drink, but children only have data on the amount drunk in the last week.

The recording of data varies among years of the HSE. We consider years from 2001 onwards. The main features of these changes in recording are:

Adult drinking in the last 12 months is only recorded for years 2001, 2002, and 2011 onwards.
Adult drinking in the last 12 months is recorded in different terms for 2001/2 and 2011+.
In 2007, the way that wine was recorded changed from asking how many glasses (with a size of 125ml assumed) to asking how many glasses of either 125ml, 175ml or 250ml. Therefore the post HSE 2007 unit calculations are not directly comparable to previous years’ data.

Due to the variability in recording, we only consider data on the amount drunk by adults and children from 2011 onwards.

We analyse beverage-specific alcohol consumption in terms of beer (combining normal beer, strong beer), wine (combining wine and sherry), spirits, and alcopops.

Reading the HSE data files

There are separate functions in the hseclean package to read each year of HSE data e.g. read_2016(). These functions link to where the data is stored in the project folder PR_Consumption_TA. They read in all variables related to alcohol and selected socioeconomic and other descriptor variables.

# First test that each year of data can be read successfully

# If on uni system set the root directory as
root_dir <- "X:/"

# Each function has the file path to each year of data added to it as a default

test_2001 <- read_2001(root = root_dir)
test_2002 <- read_2002(root = root_dir)
test_2003 <- read_2003(root = root_dir)
test_2004 <- read_2004(root = root_dir)
test_2005 <- read_2005(root = root_dir)
test_2006 <- read_2006(root = root_dir)
test_2007 <- read_2007(root = root_dir)
test_2008 <- read_2008(root = root_dir)
test_2009 <- read_2009(root = root_dir)
test_2010 <- read_2010(root = root_dir)
test_2011 <- read_2011(root = root_dir)
test_2012 <- read_2012(root = root_dir)
test_2013 <- read_2013(root = root_dir)
test_2014 <- read_2014(root = root_dir)
test_2015 <- read_2015(root = root_dir)
test_2016 <- read_2016(root = root_dir)
test_2017 <- read_2017(root = root_dir)

Processing socioeconomic variables

There are separate functions to process each socioeconomic variables - detailed descriptions of what these functions do are given in vignette("covariate_data").

# Test each cleaning function on one year of data

temp <- read_2017(root = root_dir) %>%
  clean_age %>%
  clean_demographic %>% 
  clean_education %>%
  clean_economic_status %>%
  clean_family %>%
  clean_income %>%
  clean_health_and_bio

Whether someone drinks and frequency of drinking

Calculated for adults (aged 16 years or older) and children (aged 8 to 15 years) by the function alc_drink_now_allages(). We combine the information on drinking frequency from adults and children into a single variable.

We calculate the variable drinks_now, which classes someone as either a drinker or a non-drinker. Adults are classed as drinkers if they reported drinking at all in the last 12 months, even if reporting only having 1-2 drinks a year (according to the variable dnoft). Note that this definition of a non-drinker can vary among surveys, e.g. some surveys class only having 1-2 drinks a year as a non-drinker, and this could lead to variation in estimates of the number of non-drinkers.

We calculate the variable drink_freq_7d, which is a numerical variable that described drinking frequency. Adult drinking frequency is also inferred from the variable dnoft: the function alc_drink_freq() converts the categorical responses into the expected number of days in a week that someone drinks.

"Almost every day" = 7 days a week
"Five or six days a week" = 5.5 days a week
"Three or four days a week" = 3.5 days a week
"Once or twice a week" = 1.5 days a week
"Once or twice a month" = 0.375 days a week
"Once every couple of months" = 0.188 days a week
"Once or twice a year" = 0.029 days a week

Missing data on whether or not someone currently drinks (drinks_now) is supplemented by responses to if currently drinks or if always non-drinker (the variables dnnow, dnany and dnevr).

For children (aged 8-15 years) we infer whether someone drinks or not (drinks_now) from the variable adrinkof. Someone is a non-drinker if they responded never to adrinkof. The categorical responses are converted into the expected number of days in a week that someone drinks as follows

"Almost every day" = 7 days a week
"Twice a week" = 2 days a week
"Once a week" = 1 days a week
"Once a fortnight" = 0.5 days a week
"Once a month" = 0.25 days a week
"Only a few times a year" = 0.058 days a week

Missing data on whether or not a child currently drinks (drinks_now) is supplemented by responses to when they last had an alcoholic drink (adrlast): if the last drink was less than six months ago, then we classify them as a drinker; if the last drink was six months or more ago, then we classify them as a non-drinker.

# Number of sampled drinkers and non-drinkers in 2017
read_2017(root = "X:/") %>%
  clean_age %>%
  clean_demographic %>%
  alc_drink_now_allages %>%
  filter(age < 90, age >= 8) %>%
  group_by(imd_quintile) %>% 
  count(drinks_now) %>% 
  ggplot(aes(x = drinks_now, y = n, fill = imd_quintile)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  ylab("number of observations")

# Frequency of drinking in 2017 among drinkers
read_2017(root = "X:/") %>%
  clean_age %>%
  clean_demographic %>%
  alc_drink_now_allages %>%
  filter(age < 90, age >= 8, drinks_now == "drinker") %>%
  group_by(imd_quintile, age_cat) %>% 
  summarise(av_freq = mean(drink_freq_7d, na.rm = T)) %>% 
  ggplot(aes(x = imd_quintile, y = av_freq, fill = age_cat)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  ylab("average number of days drink in a week")

Average amount of alcohol consumed

Assumptions about serving size and alcohol content

Some standard assumptions are made about the volume and alcohol content of the beverages that are reported to be drunk. The values that we use for these assumptions are based on those used by Natcen to create the derived variables for units of alcohol consumed in the HSE. We have made our own adjustments to the values used based on further information from market research data and figures from academic publications.

Alcohol content assumptions are the expected percentages of alcohol that each beverage contains (alcohol by volume, ABV). We use separate values for normal beer (4.4\%), strong beer (8.4\%), spirits (38\%), sherry (17\%), wine (12.5\%), and alcopops (also known as "ready to drink" or RTD) (4.5\%).

Beverage volume assumptions are the expected volumes (ml) of different beverage containers / serving sizes. We use separate values for normal and strong beer (half pint 284ml, small can 330ml, large can 440ml, bottle 330ml), spirits (serving 25ml), sherry (serving 50ml), wine (small glass 125ml, standard glass 175ml, large glass 250ml, bottle 750ml), and alcopops (small can 250ml, small bottle 275ml, large bottle 700ml).

# These data are stored within the hseclean package for easy use
# they can be accessed by typing 

hseclean::abv_data

hseclean::alc_volume_data

Adult average weekly consumption in the last 12 months

We estimate the average amount drunk in a week (weekmean) in terms of UK standard units of alcohol (1 unit = 10ml or 8g pure ethanol). The average amount drunk is then categorised as follows:

abstainer = 0 units/week
lower_risk drinker = less than 14 units/week
increasing_risk drinker = 14 or more units/week but less than 35 units/week for females or less than 50 units/week for males
higher_risk drinker = 35 or more units/week for females or 50 or more units/week for males

Separate variables are produced describing the average weekly units in four beverage categories: beer_units (including cider), wine_units (including sherry), spirit_units, rtd_units (this is alcopops). Further variables on beverage preference are produced that:

describe the percentage of the total consumption in a week that is contributed by each beverage type (per_spirit_units, perc_wine_units, perc_beer_units, perc_rtd_units).
describe whether or not someone shows a clear beverage preference e.g. does_not_drink_spirits, drinks_some_spirits, mostly_drinks_spirits, where "mostly drinks" is defined by a single beverage comprising more that 50\% of an individuals average weekly consumption.

The processing is done by the function alc_weekmean_adult(). The calculation has the following steps:

Convert the categorical variables to numeric variables for the frequency with which each beverage is typically consumed (normal beer, strong beer, spirits, sherry, wine, alcopops).
Convert the reported volumes usually consumed (e.g. small glass, large glass) into volumes in ml, using the beverage size assumptions above. In doing so, variations in recording among years and between the interview and self-complete questionnaire are accounted for.
Combine the volumes (ml) usually consumed with the frequency of consumption to give the average volume of each beverage type drunk each week (assuming constant consumption across the year).
Convert the expected volumes of each beverage consumed each week to UK standard units of alcohol consumed, using the alcohol content assumptions above.
Collapse normal and strong beer into a single "beer" variable by summing their units. Collapse wine and sherry into a single "wine" variable by summing their units.
Calculate total weekly units but summing across beverage categories.
Cap the total units consumed in a week at 300 units, assuming that above this already very high level of consumption estimates of variation in consumption are less reliable.

# Average weekly units drunk in 2017
read_2017(root = "X:/") %>%
  clean_age %>%
  clean_demographic %>%
  alc_drink_now_allages %>%
  alc_weekmean_adult %>%
  filter(age < 90, age >= 16) %>%
  group_by(imd_quintile, age_cat) %>% 
  summarise(av_amount = mean(weekmean, na.rm = T)) %>% 
  ggplot(aes(x = imd_quintile, y = av_amount, fill = age_cat)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  ylab("average number of units drunk in a week")

# Number of sampled people in each drinker category in 2017
read_2017(root = "X:/") %>%
  clean_age %>%
  clean_demographic %>%
  alc_drink_now_allages %>%
  alc_weekmean_adult %>%
  filter(age < 90, age >= 16) %>%
  group_by(imd_quintile) %>% 
  count(drinker_cat) %>% 
  mutate(drinker_cat = factor(drinker_cat, 
    levels = c("abstainer", "lower_risk", "increasing_risk", "higher_risk"))) %>%
  ggplot(aes(x = drinker_cat, y = n, fill = imd_quintile)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  ylab("number of observations")

Adult consumption on the heaviest drinking day in the last week

The function alc_sevenday_adult() processes the information from the questions on adult (16 or more years old) drinking in the last seven days:

Number of days drank on in the last seven, n_days_drink.
the characteristics of drinking on the heaviest drinking day

We estimate the number of UK standard units of alcohol drunk on the heaviest drinking day (peakday) by using the data on how many of what size measures of different beverages were drunk, and combining this with our standard assumptions about beverage volume and alcohol content. We further estimate their total units drunk of each beverage type on the heaviest drinking day (d7nbeer_units, d7sbeer_units, d7spirits_units, d7sherry_units, d7wine_units, d7pops_units).

Binge drinking status is then categorised into the variable binge_cat, with levels did_not_drink, binge and no_binge, where a binge day in defined by males drinking over 8 units and females drinking over 6 units.

Note that in 2007 new questions were added asking which glass size was used when wine was consumed. Therefore the post HSE 2007 unit calculations are not directly comparable to previous years’ data.

Missing data is imputed using the means of people who did drink in the last seven days, stratified by year, sex, IMD quintile and age category (0-1, 2-4, 5-7, 8-10, 11-12, 13-15, 16-17, 18-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, 85-89, 90+).

# drinking in last 7 days in 2017
read_2017(root = "X:/") %>%
  clean_age %>%
  clean_demographic %>%
  alc_drink_now_allages %>%
  alc_weekmean_adult %>%
  alc_sevenday_adult %>%
  filter(age < 90, age >= 16) %>%
  group_by(imd_quintile, age_cat, sex) %>% 
  summarise(n_days7 = mean(n_days_drink, na.rm = T), 
            amount7 = mean(peakday, na.rm = T)) %>% 
  ggplot(aes(x = n_days7, y = amount7, colour = age_cat, shape = sex)) +
  geom_point(size = 3, alpha = .5) +
  facet_wrap(~ imd_quintile, nrow = 1) +
  theme_minimal() +
  ylab("average amount drunk on heaviest drinking day") +
  xlab("average number of days drunk on in last 7")

# Number of sampled people in each binge drinker category in 2017
read_2017(root = "X:/") %>%
  clean_age %>%
  clean_demographic %>%
  alc_drink_now_allages %>%
  alc_weekmean_adult %>%
  alc_sevenday_adult %>%
  filter(age < 90, age >= 16) %>%
  group_by(imd_quintile) %>% 
  count(binge_cat) %>% 
  mutate(binge_cat = factor(binge_cat, 
    levels = c("did_not_drink", "no_binge", "binge"))) %>%
  ggplot(aes(x = binge_cat, y = n, fill = imd_quintile)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  ylab("number of observations")

Children's consumption in the last week

The function alc_sevenday_child() processes the information on drinking by children (ages 13-15) in the last seven days. The data on children's drinking comes in the form of survey questions on whether or not they have drunk each beverage type in the last week, and if so, how much of each was drunk. The main output is the variable total_units7_ch - the total units drunk in the last seven days.

We estimate the number of UK standard units of alcohol drunk in the last 7 days by using the data on how many of what size measures of different beverages were drunk, and combining this with our standard assumptions about beverage volume and alcohol content.

The information from this question is also used to update the drinks_now variable to make it a variable that describes whether or not adults and children drink.

Due to high missingness in this variable, we assume that anyone who has missing data for this variable does not drink. This means that we are likely to under-estimate the number of children who drink.

# drinking by age in 2017
read_2017(root = "X:/") %>%
  clean_age %>%
  clean_demographic %>%
  alc_drink_now_allages %>%
  alc_weekmean_adult %>%
  alc_sevenday_adult %>%
  alc_sevenday_child %>%
  filter(age < 90, age >= 13) %>%
  group_by(age_cat, sex) %>% 
  count(drinks_now) %>% 
  filter(drinks_now == "drinker") %>%
  ggplot(aes(x = age_cat, y = n, shape = sex, colour = sex)) +
  geom_point(size = 3, alpha = .5) +
  facet_wrap(~ sex, nrow = 1) +
  theme_minimal() +
  ylab("number of observations")

# drinking by age in 2017
read_2017(root = "X:/") %>%
  clean_age %>%
  clean_demographic %>%
  alc_drink_now_allages %>%
  alc_weekmean_adult %>%
  alc_sevenday_adult %>%
  alc_sevenday_child %>%
  filter(age < 90, age >= 13) %>%
  mutate(weekamount = ifelse(age %in% 13:15, total_units7_ch, weekmean)) %>%
  group_by(age_cat, sex) %>% 
  summarise(av_amount = mean(weekamount, na.rm = T)) %>% 
  ggplot(aes(x = age_cat, y = av_amount, colour = sex, shape = sex)) +
  geom_point(size = 3, alpha = .5) +
  facet_wrap(~ sex, nrow = 1) +
  theme_minimal() +
  ylab("expected number of units drunk in a week")