Click here to go to the Health Survey for England data page on the STAPM website
This vignette sets out a description of how the data related to tobacco and alcohol consumption in the Health Survey for England are processed for use in the Sheffield Tobacco and Alcohol Policy Modelling (STAPM). The vignette covers: (1) the processing of alcohol consumption data; the processing of tobacco consumption data; the processing of socio-demographic covariates relevant to our tobacco and alcohol modelling; the imputation of missing data. This is a working set of notes to help keep track of the methods used to process the data.
The Sheffield Alcohol Policy Model (SAPM) has been used to examine the effects of pricing policies, advertising restrictions and advice on why and how to reduce drinking [@Holmes2014;@purshouse2013modelling] (see the range of publications and projects on the Sheffield Alcohol Research Group website). Patterns of alcohol consumption in the SAPM modelling are informed primarily by the Health Survey for England data, with additional information e.g. on the division of alcohol consumption between the on- and off-trade provided by the Living Costs and Food Survey.
The hseclean R package was written to help standardise the processing of the Health Survey for England data to produce inputs for the STAPM simulation modelling. This helps data storage and processing to be consistent across the STAPM projects. The code reads, cleans, filters and combines data from multiple survey years.
Alcohol consumption data in the Health Survey for England (HSE) is recorded in four main forms:
Both adults and children have data on whether they drink alcohol or not, and on the frequency of drinking. The main difference between the recording of data for adults and children is that adults have a lot of data on how much and what they drink, but children only have data on the amount drunk in the last week.
The recording of data varies among years of the HSE. We consider years from 2001 onwards. The main features of these changes in recording are:
Due to the variability in recording, we only consider data on the amount drunk by adults and children from 2011 onwards.
We analyse beverage-specific alcohol consumption in terms of beer (combining normal beer, strong beer), wine (combining wine and sherry), spirits, and alcopops.
Calculated for adults (aged 16 years or older) and children (aged 8 to 15 years) by the function hseclean::alc_drink_now_allages()
. We combine the information on drinking frequency from adults and children into a single variable.
We calculate the variable drinks_now
, which classes someone as either a drinker or a non-drinker. Adults are classed as drinkers if they reported drinking at all in the last 12 months, even if reporting only having 1-2 drinks a year (according to the variable dnoft
). Note that this definition of a non-drinker can vary among surveys, e.g. some surveys class only having 1-2 drinks a year as a non-drinker, and this could lead to variation in estimates of the number of non-drinkers.
We calculate the variable drink_freq_7d
, which is a numerical variable that described drinking frequency. Adult drinking frequency is also inferred from the variable dnoft
: the function hseclean::alc_drink_freq()
converts the categorical responses into the expected number of days in a week that someone drinks.
Missing data on whether or not someone currently drinks (drinks_now
) is supplemented by responses to if currently drinks or if always non-drinker (the variables dnnow
, dnany
and dnevr
).
For children (aged 8-15 years) we infer whether someone drinks or not (drinks_now
) from the variable adrinkof
. Someone is a non-drinker if they responded never
to adrinkof
. The categorical responses are converted into the expected number of days in a week that someone drinks as follows
Missing data on whether or not a child currently drinks (drinks_now
) is supplemented by responses to when they last had an alcoholic drink (adrlast
): if the last drink was less than six months ago, then we classify them as a drinker; if the last drink was six months or more ago, then we classify them as a non-drinker.
Some standard assumptions are made about the volume and alcohol content of the beverages that are reported to be drunk. The values that we use for these assumptions are based on those used by Natcen to create the derived variables for units of alcohol consumed in the HSE. We have made our own adjustments to the values used based on further information from market research data and figures from academic publications.
Alcohol content assumptions are the expected percentages of alcohol that each beverage contains (alcohol by volume, ABV). We use separate values for normal beer (4.4\%), strong beer (8.4\%), spirits (38\%), sherry (17\%), wine (12.5\%), and alcopops (also known as "ready to drink" or RTD) (4.5\%).
Beverage volume assumptions are the expected volumes (ml) of different beverage containers / serving sizes. We use separate values for normal and strong beer (half pint 284ml, small can 330ml, large can 440ml, bottle 330ml), spirits (serving 25ml), sherry (serving 50ml), wine (small glass 125ml, standard glass 175ml, large glass 250ml, bottle 750ml), and alcopops (small can 250ml, small bottle 275ml, large bottle 700ml).
We estimate the average amount drunk in a week (weekmean
) in terms of UK standard units of alcohol (1 unit = 10ml or 8g pure ethanol). The average amount drunk is then categorised as follows:
abstainer
= 0 units/week lower_risk
drinker = less than 14 units/week increasing_risk
drinker = 14 or more units/week but less than 35 units/week for females or less than 50 units/week for males higher_risk
drinker = 35 or more units/week for females or 50 or more units/week for males Separate variables are produced describing the average weekly units in four beverage categories: beer_units
(including cider), wine_units
(including sherry), spirit_units
, rtd_units
(this is alcopops). Further variables on beverage preference are produced that:
per_spirit_units
, perc_wine_units
, perc_beer_units
, perc_rtd_units
). does_not_drink_spirits
, drinks_some_spirits
, mostly_drinks_spirits
, where "mostly drinks" is defined by a single beverage comprising more that 50\% of an individuals average weekly consumption. The processing is done by the function hseclean::alc_weekmean_adult()
. The calculation has the following steps:
The function hseclean::alc_sevenday_adult()
processes the information from the questions on adult (16 or more years old) drinking in the last seven days:
n_days_drink
. We estimate the number of UK standard units of alcohol drunk on the heaviest drinking day (peakday
) by using the data on how many of what size measures of different beverages were drunk, and combining this with our standard assumptions about beverage volume and alcohol content. We further estimate their total units drunk of each beverage type on the heaviest drinking day (d7nbeer_units
, d7sbeer_units
, d7spirits_units
, d7sherry_units
, d7wine_units
, d7pops_units
).
Binge drinking status is then categorised into the variable binge_cat
, with levels did_not_drink
, binge
and no_binge
, where a binge day in defined by males drinking over 8 units and females drinking over 6 units.
Note that in 2007 new questions were added asking which glass size was used when wine was consumed. Therefore the post HSE 2007 unit calculations are not directly comparable to previous years’ data.
Missing data is imputed using the means of people who did drink in the last seven days, stratified by year, sex, IMD quintile and age category (0-1, 2-4, 5-7, 8-10, 11-12, 13-15, 16-17, 18-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, 85-89, 90+).
The function hseclean::alc_sevenday_child()
processes the information on drinking by children (ages 13-15) in the last seven days. The data on children's drinking comes in the form of survey questions on whether or not they have drunk each beverage type in the last week, and if so, how much of each was drunk. The main output is the variable total_units7_ch
- the total units drunk in the last seven days.
We estimate the number of UK standard units of alcohol drunk in the last 7 days by using the data on how many of what size measures of different beverages were drunk, and combining this with our standard assumptions about beverage volume and alcohol content.
The information from this question is also used to update the drinks_now
variable to make it a variable that describes whether or not adults and children drink.
Due to high missingness in this variable, we assume that anyone who has missing data for this variable does not drink. This means that we are likely to under-estimate the number of children who drink.
For the Sheffield Tobacco Policy Model (STPM), we use HSE data from years 2001 to the latest available (although survey weights started to be used from 2003, which is likely to make the data from 2003 slightly more reliable). We use these data to inform the trends in smoking prevalence, the socio-demographic variation in smoking prevalence, and as inputs to a procedure that we use to infer the age-specific probabilities of smoking initiation and quitting (see our smktrans R package). Our upper age limit is 89 years, but otherwise we make use of all ages.
The purpose of this vignette is to explain how we use the HSE data to inform the patterns of tobacco smoking, and to explain how hseclean
supports this.
Questions about cigarette smoking have been asked of adults aged 16 and over as part of the HSE series since 1991 - we use data from 2001 to the latest year available. We use data on children (12-15 years) and adults (16+ years). There is often a special section in the annual HSE report devoted to describing trends in cigarette smoking e.g.HSE 2015.
The function hseclean::smk_status()
categorises cigarette smoking into current, former and never regular cigarette smokers. If some smokes either regularly or occasionally, then they are classified as a current regular cigarette smoker. People who used to smoke regularly or occasionally are classified as former smokers; people who have only tried a cigarette once or twice are classified as never smokers. We create a smoking status variable for children aged 8-15 years and adults aged >= 16 years. Ever-smokers are people who are either current or former smokers.
The function hseclean::smk_quit()
is in development, and will process the data on the motivation to quit smoking, the reasons for quitting smoking, and the support used to stop smoking. It currently produces only one variable - whether someone wants to quit smoking (y/n).
The function hseclean::smk_former()
cleans the data for former smokers on the time since quitting and time spent as a regular smoker. The main issue to overcome is that in the HSE 2015+, time since quit and time spent as a smoker is provided in categories rather than single years. We simulate the single years by just picking a value at random within the time interval, using hseclean::num_sim()
. We then fill missing data for these variables as follows:
The function hseclean::smk_life_history()
cleans the data on the ages when smokers started and stopped being regular cigarette smokers. For each individual smoker, the data recorded in the HSE implies a single age at which a smoker started to smoke and, if they stopped, an age at which they did so. This provides a simplified view of what might be a complicated life history of smoking, e.g. smoking to different frequencies or levels, or starting and stopping multiple times.
Both the start age and stop age will have error in them e.g. due to uncertainty in respondent recall, and, for years 2015+, due to the reporting in categories of time intervals rather than single years, which we then impute introducing random error. Start age is likely to be biased towards earlier ages, because for adult smokers and former smokers with missing values we use the age first tried a cigarette, and for children the reported start age does not necessarily mean the start of regular smoking, it is just the age at which they started to smoke.
We also create a variable for the age at which an individual was censored from our data sample - this is their age at the survey + 1 year.
Any missing data is assigned the average start or stop age for each age, sex and IMD quintile.
The function hseclean::smk_amount()
cleans the data that describe how much, what and to what level of addiction people smoke. The main variable is the average number of cigarettes smoked per day. For adults, this is calculated from questions about how many cigarettes are smoked typically on a weekday vs. a weekend (this is a weighted average to account for more weekdays in a week than weekends). For children, this is based on asking how many cigarettes were smoked in the last week. Missing values are imputed as the average amount smoked for an age, sex and IMD quintile subgroup.
We categorise cigarette preferences based on the answer to 'what is the main type of cigarette smoked'. For years 2013 and later of the HSE, questions were added that ask how many handrolled vs. machine rolled cigarettes are smoked on a weekday vs. a weekend.
We also categorise the amount smoked, and use information on the time from waking until smoking the first cigarette (this latter variable has a high level of missingness). Together these two variables allow calculation of the heaviness of smoking index.
Taking the survey design into account is important when estimating the mean and confidence intervals around summary statistics computed from the data i.e. it is not possible to accurately estimate sampling error without accounting for survey design. The survey
R package [@Rsurvey] has a collection of functions that incorporate survey design into the calculation of summary statistics. The survey
package is used by the function prop_summary()
in hseclean
to estimate the uncertainty around proportions calculated from a binary variable - prop_summary()
was designed to simplify the process of estimating smoking prevalence from the HSE data, stratified by a specified set of variables.
The suppliers of the HSE data introduced tighter information governance rules in 2015, which meant that they stopped providing variables that could be used to identify the age in single years of an individual, and also stopped providing information on number of children in the household. These variables can still be obtained, but only after applying for the secure-access version of the data, which we do not do. Therefore, in our processing of the standard-access version of the data, we use imputation methods to overcome the added restrictions.
The first thing to consider is the influence of survey sampling design, which is variable among years. The variables that describe the sampling structure are cluster
and PSU
(probabilistic sampling unit).
In most years there are also survey weights, which are calculated after the survey data has been collected, that when applied are supposed to make the survey sample representative of the general population e.g. if a particular subgroup has been under-sampled, then it receives a higher survey weight. As we understand the HSE methods, the survey weights supplied with the data consider only the age and sex distribution of the population, and do not consider the distribution of socio-economic or health characteristics. The definition and structure of the survey weights provided with the data varies between years, and is described in the dataset documentation for each year of data. For example, some key changes
hseclean
contains separate functions for reading the survey data for each year, e.g. read_2001()
, and a description of the survey weights has been added to the help files of those functions. Any processing or combining of survey weights is done in the functions that read each year of data. The function clean_surveyweights()
assigns any missing weights the average weight for each year, and standardises the weights to sum to 1 within each year. The resulting survey weight variable for each year is wt_int
.
From 2015 onwards, the HSE no longer supplies age in single years (to prevent individual identification). For our modelling, we require age in single years, so we apply a method that randomly assigns an age in single years to individuals for who we only have an age category. The age categories we work with are: 0-1, 2-4, 5-7, 8-10, 11-12, 13-15, 16-17, 18-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, 85-89, 90+. These categories are the finest scale version of age that is available for years 2015+. We then select only individuals younger than 90 years for our modelling.
This processing is done by the function clean_age()
that calls the function num_sim()
to simulate single years of age. For years 2015+, we also use num_sim()
to convert the categorical variables for years since quitting smoking and years spent as a smoker to single years of age.
The function clean_demographic()
creates variables for ethnicity, sex and quintiles of the Index of Multiple Deprivation (IMDq).
1 = Male, 2 = Female.
5_most_deprived, 4, 3, 2, 1_least_deprived.
Previous SAPM modelling has used a simple white/non-white classification. The ONS recommend a harmonised ethnicity measure for use in social surveys (ONS, 2017). The use of ethnicity measures is also discussed in Connelly et al. 2016, who recommend testing the sensitivity of analyses to different specifications. We try to map the HSE categories to the ONS recommended groups for England. However, over the years, the HSE is not clear or consistent in how they have categorised chinese and arab as 'asian' or 'other'. In an attempt to harmonise, we have pooled the asian and other categories.
Following inspection of the data, the white/non-white classification does look appropriate, especially given the likely limited sample sizes - so the 2 level variable has also been created. Previous Sheffield modelling in the Sheffield Alcohol Policy Model has also used the white/non-white classification.
Individuals in the HSE are not assigned a Townsend quintile of deprivation, but for a project that investigated the cost of alcohol to primary care in England, we needed to predict the Townsend quintile of each individual so that we could use it to stratify our summary of alcohol consumption.
The function use_townsend()
adds a Townsend variable to the data. It produces a version of the Health Survey for England data that has the Townsend Index in it, based on the probabilistic mapping between the 2015 English Index of Multiple Deprivation and the Townsend Index from the 2001 census.
It does so based on a matrix (stored in hseclean::imdq_to_townsend
) that maps quintiles of the Index of Multiple Deprivation onto the Townsend Index of Deprviation. To produce this we used area-level Office for National Statistics data to estimate the statistical association between the two metrics of deprivation. We used estimates of the Townsend Index from 2001 Census data at Ward level, and the Index of Multiple Deprivation 2015 (IMD 2015) at Lower-layer Super Output Area (LSOA) level. First, we mapped the 2001 definitions of Wards to the 2001 definitions of LSOAs. Second, we mapped the 2001 definitions of LSOAs to the 2011 definitions of LSOAs that are used by the IMD 2015.
The function clean_economic_status()
creates a variety of variables to classify economic status.
The issues around using occupation-based social classifications for social survey research are discussed by Connelly et al. [-@Connelly2016a]. They advise using a range of alternative measures, and not creating new measures beyond what is already established.
The classifications considered are:
The main education variable produced by the function clean_education()
is a four category description of the age at which someone finished full-time education. The categories are:
If someone was still in full time education at the time of the survey, then if they were younger than 18 years, we assumed they would leave at 16-18, and if they were older than 18 years, we assumed they would leave at 19 years or over.
A further education variable is also produced - which indicates whether an individual reached a degree as their top qualification or not. Here a degree is defined as an "NVQ4/NVQ5/Degree or equiv".
The function clean_family()
processes the data on the number of children in the household and the relationship status of each respondent.
kids
is the number of children aged 0-15 years who live in the household. If a 3 year old lives in a household with 2 siblings, aged 6 and 8 years, then we might expect them to be recorded as living in a household with 3 children under age 15 years. The variable is created by combining the HSE data on children and infants in the household. It is categorised into: 0, 1, 2, 3+ children under age 15 years.
The problem with the Health Survey for England is that from 2015 onwards, the number of children in the household is not provided as this information could be identifiable (you can get it if you apply and pay for a secure dataset). Therefore, for years 2015+, the number of children in the household is completely missing and needs to be imputed.
We impute the number of children for years 2015+ automatically in the function clean_family()
, based on the correlation between the number of children and a range of demographic and socioeconomic variables in 2012-2014, the last three years for which data on kids is available. This imputation is based on the fit of a multinomial model in package(nnet)
. The model object is saved in the hseclean
package as the object hseclean::impute_kids_model
, and is drawn upon by the clean_family()
function to impute the data as needed. This imputation won't work unless the required demographic and socio-economic variables have already been cleaned prior to running clean_family()
. There will still be missing values in kids
if there are missing values in the predictor variables required by the model. These missing values can be taken care of in a multiple imputation procedure (see vignette(missing_data)).
In previous versions of modelling for the Sheffield Alcohol Policy Model, relationship status has been described as married/not-married. Here, we include more detail by using:
The function clean_income()
processes the data on income.
There are a few different options for classifying income - the need to have a measure that is consistent across years of the Health Survey for England has led us to use equivalised income quintiles only. (Past SAPM modelling has used years of the HSE for which a continous variable for equivalised income was provided - and calculated our own income groups - but in later years, this continuous income variable is not available.)
In the past SAPM modelling, a measure of in "poverty" vs. "not in poverty" has been used, where the poverty threshold is defined as 60\% of the median income for any year. For years in which we only have income quintiles available, it is not possible to make an exact calculation of poverty, but being in poverty will coincide approximately with the lowest 2 income quintiles.
It would also be possible from the Health Survey for England to classify people as being in receipt of benefits or not, but this is not currently implemented in hseclean
, and would have to have some thought on how to deal with the changing definitions of benefits over time.
The function clean_health_and_bio()
cleans data on presence/absence of certain categories of health condition, and on height and weight.
There are a set of 15 categories of long-lasting illnesses (occurring for or expected to last at least 12 months) that are ascertained consistently across all years of the HSE. These are:
- Cancer
- Endocrine or metabolic condition
- Mental health condition
- Nervous system condition
- Eye condition
- Ear condition
- Heart or circulatory system condition
- Respiratory condition
- Digestive condition
- Genito-urinary condition
- Skin condition
- Musculo-skeletal condition
- Infectious disease
- Blood and related organs condition
- Other complaints
Height (cm) and weight (kg). Weight is estimated above 130kg. Missing values of height and weight are replaced by the mean height and weight for each age, sex and IMD quintile. BMI is calculated according to kg / m^2.
To prepare for the process that imputes missing data, run the full set of functions to read and clean the data. It is important to note that there has already been some filling-in of missing data done by these cleaning functions - using simple rules,
clean_age()
function has randomly assigned single years of age within each age category. The number of children in the household is missing for years 2015+. This is imputed in the function clean_family()
based on the fit of a multinomial model to years 2012-2014 (see vignette("covariate_data")
).
The function select_data()
has the option to filter the data to retain only complete cases for certain variables. To prepare the data, the example code below filters out any incomplete data on key survey variables (age, sex, year, quarter, psu, cluster, imd_quintile). It also filters out any incomplete information on the key smoking and drinking variables, "cig_smoker_status" and "drinks_now".
The variable with the most missingness in the data is income5cat (19% missing). In this example, the other variables to be imputed are: kids, ethnicity_4cat, eduend4cat, degree, relationship_status, nssec3_lab, activity_lstweek.
To conduct the multiple imputation, we use the R package mice
[@Rmice]. The process of running the multiple imputation can take a long time and consume a lot of RAM. There is a range of mice
documentation and tutorials online.
In hseclean
, multiple imputation is implemented in a basic way by the impute_data_mice()
function.
mice
fits a chained series of regression equations that predict the missing values of variables based on their relationships with other selected variables in the data. The impute_data_mice()
function currently only imputes categorical variables, which could be one of three types: "logreg" - binary Logistic regression; "polr" - ordered Proportional odds model; "polyreg" - unordered Polytomous logistic regression.
In running the multiple imputation, the number of iterations of the imputed data is selected (choosing a small number e.g. < 5 helps keep the size of the resulting imputed data manageable), and the variables to either be predicted or to inform the prediction are selected. If a variable is just going to inform the prediction of the other variables but is not going to be predicted itself, then the model type is set to "", otherwise to one of "logreg", "polr" or "polyreg".
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.