knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.pos = 'H' )
suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(magrittr)) suppressPackageStartupMessages(library(data.table)) suppressPackageStartupMessages(library(testthat)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(hseclean))
For the Sheffield Tobacco Policy Model (STPM) we use HSE data from years 2001 to the latest available. We use these data to inform the trends in smoking prevalence, the socio-demographic variation in smoking prevalence, and as inputs to a procedure that we use to infer the age-specific probabilities of smoking initiation and quitting (see our smoke.trans R package). Our upper age limit is 89 years, but otherwise we make use of all ages.      
The purpose of this vignette is to explain how we use the HSE data to inform the patterns of tobacco smoking, and to explain how hseclean supports this.    
hseclean contains functions to clean covariates in the data, which are explained in vignette("covariate_data"). Here we mention only the important things to consider for the processing of smoking data.    
Questions about cigarette smoking have been asked of adults aged 16 and over as part of the HSE series since 1991 - we use data from 2001 to the latest year available. We use data on children (12-15 years) and adults (16+ years). There is often a special section in the annual HSE report devoted to describing trends in cigarette smoking e.g.HSE 2015.
The function smk_status() categorises cigarette smoking into current, former and never regular cigarette smokers. If some smokes either regularly or ocassionally, then they are classified as a current regular cigarette smoker. People who used to smoke regularly or ocassionally are classified as former smokers; people who have only tried a cigarette once or twice are classified as never smokers. We create a smoking status variable for children aged 8-15 years and adults aged >= 16 years. Ever-smokers are people who are either current or former smokers.    
The function smk_quit() is in development, and will process the data on the motivation to quit smoking, the reasons for quitting smoking, and the support used to stop smoking. It currently produces only one variable - whether someone wants to quit smoking (y/n).   
The function smk_former() cleans the data for former smokers on the time since quitting and time spent as a regular smoker. The main issue to overcome is that in the HSE 2015+, time since quit and time spent as a smoker is provided in categories rather than single years. We simulate the single years by just picking a value at random within the time interval, using num_sim(). We then fill missing data for these variables as follows:   
The function smk_life_history() cleans the data on the ages when smokers started and stopped being regular cigarette smokers. For each individual smoker, the data recorded in the HSE implies a single age at which a smoker started to smoke and, if they stopped, an age at which they did so. This provides a simplified view of what might be a complicated life history of smoking, e.g. smoking to different frequencies or levels, or starting and stopping multiple times.    
Both the start age and stop age will have error in them e.g. due to uncertainty in respondent recall, and, for years 2015+, due to the reporting in categories of time intervals rather than single years, which we then impute introducing random error. Start age is likely to be biased towards earlier ages, because for adult smokers and former smokers with missing values we use the age first tried a cigarette, and for children the reported start age does not necessarily mean the start of regular smoking, it is just the age at which they started to smoke.
We also create a variable for the age at which an individual was censored from our data sample - this is their age at the survey + 1 year.
Any missing data is assigned the average start or stop age for each age, sex and IMD quintile.
The function smk_amount() cleans the data that describe how much, what and to what level of addiction people smoke. The main variable is the average number of cigarettes smoked per day. For adults this is calculated from questions about how many cigarettes are smoked typically on a weekday vs. a weekend. For children, this is based on asking how many cigarettes were smoked in the last week. Missing values are imputed as the average amount smoked for an age, sex and IMD quintile subgroup.  
We categorise cigarette preferences based on the answer to 'what is the main type of cigarette smoked'. In later years of the HSE, new questions are added that ask how many handrolled vs. machine rolled cigarettes are smoked on a weekday vs. a weekend.
We also categorise the amount smoked, and use information on the time from waking until smoking the first cigarette (this latter variable has a high level of missingness). Together these two variables allow calculation of the heaviness of smoking index.
The data is stored in X:/ScHARR/PR_Consumption_TA/Data/. The following code will read, clean, filter and combine the data. 
# Write a bespoke function that does just the cleaning jobs required. cleandata <- function(data) { data <- clean_age(data) data <- clean_demographic(data) data <- smk_status(data) data <- smk_former(data) data <- smk_life_history(data) data <- smk_amount(data) data <- select_data( data, ages = 12:89, years = 2001:2017, # The variables to retain keep_vars = c("wt_int", "psu", "cluster", "year", "age", "age_cat", "censor_age", "sex", "imd_quintile", "cig_smoker_status", "smk_start_age", "smk_stop_age", "years_since_quit", "giveup_smk", "cigs_per_day", "smoker_cat", "banded_consumption", "cig_type", "time_to_first_cig"), # The variables that must have complete cases complete_vars = c("cig_smoker_status", "wt_int", "psu", "cluster", "year", "censor_age") ) return(data) } # Choose the required years and combine hse_data <- combine_years(list( cleandata(read_2014()), cleandata(read_2015()), cleandata(read_2016()) )) # clean the survey weights hse_data <- clean_surveyweights(hse_data) # change some variable names setnames(hse_data, c("smk_start_age", "cig_smoker_status", "years_since_quit"), c("start_age", "smk.state", "time_since_quit"))
Taking the survey design into account is important when estimating the mean and confidence intervals around summary statistics computed from the data i.e. it is not possible to accurately estimate sampling error without accounting for survey design. The survey R package [@Rsurvey] has a collection of functions that incorporate survey design into the calculation of summary statistics. The survey package is used by the function prop_summary() in hseclean to estimate the uncertainty around proportions calculated from a binary variable - prop_summary() was designed to simplify the process of estimating smoking prevalence from the HSE data, stratified by a specified set of variables.   
Using prop_summary(), calculate the proportion of smokers, stratified by year, sex and quintiles of the Index of Multiple Deprivation.  
prop_smokers <- prop_summary( data = hse_data, var_name = "smk.state", levels_1 = "current", levels_0 = c("former", "never"), strat_vars = c("year", "sex", "imd_quintile") )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.