In dosgillespie/hseclean: Health Survey Data Wrangling

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.pos = 'H'
)

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(testthat))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(hseclean))

in development

Introduction

The Scottish Health Survey (SHeS) is a series of annual surveys covering health and health-related behaviours. Prior to 2008, the survey had been carried out on three separate occasions (1995, 1998, 2003). We use data from years 2008-2018 to inform the trends in smoking prevalence, the socio-demographic variation in smoking prevalence, and as inputs to a procedure that we use to infer the age-specific probabilities of smoking initiation and quitting (see our smoke.trans R package). Our upper age limit is 89 years, but otherwise we make use of all ages and all booster samples are automatically incorporated in the data we use.

The aim is to combine the Health Survey for England and the Scottish Health Survey into one dataset. The majority of variables are the same, however some are missing and some are coded differently.

To support the processing of the HSE data, there is the hseclean R package, which contains a collection of functions to read, process and summarise the HSE data. To support the processing of the SHeS data, we have included additional code in the hseclean functions to clean the SHeS data. The purpose of this vignette is to explain how we use the SHeS data to inform the patterns of cigarette consumption, and to explain how the hseclean package supports this.

Things still to think about:
- survey design variables
- number of children
- economic/activity status variables - work out what to do with doing something else.
- Quality of life
- Relationship status: living as married?
- what to do with missing nssec8 for children?

Read data

hseclean contains separate functions for reading the survey data for each year, e.g. read_SHeS_2008().

Survey design variables

The first thing to consider is the influence of survey sampling design, which is variable among years. The variables that describe the sampling structure are strata and PSU (probabilistic sampling unit).

There are also survey weights, which are calculated after the survey data has been collected, that when applied are supposed to make the survey sample representative of the general population e.g. if a particular subgroup has been under-sampled, then it receives a higher survey weight. The definition and structure of the survey weights is described in the dataset documentation and a description of the survey weights has been added to the help files of those functions. There are different weights for children and adults. The children weights adjusts for the selection of just two children per household and adjusts for differences between responding and non-responding households.

Any processing or combining of survey weights is done in the functions that read each year of data. The function clean_surveyweights() assigns any missing weights the average weight for each year, and standardises the weights to sum to 1 within each year. The resulting survey weight variable for each year is wt_int.

Demographic and socio-economic variables

The function clean_demographic() creates variables for ethnicity, sex and quintiles of the Index of Multiple Deprivation (IMDq).

Sex

1 = Male, 2 = Female.

IMD quintiles

SHeS uses the Scottish Index of Multiple Deprivation. We keep this as a separate variable to the English IMD variable as each country calculated its own slightly different version of IMD. However, there has been a study harmonising IMD measures across the four UK nations [@Abel2016] that we could look at in the future if we want to compare across countries.

5_most_deprived, 4, 3, 2, 1_least_deprived.

Ethnicity

HSE

Previous SAPM modelling has used a simple white/non-white classification. The ONS recommend a harmonised ethnicity measure for use in social surveys ONS, 2017. The use of ethnicity measures is also discussed in Connelly et al. 2016, who recommend testing the sensitivity of analyses to different specifications.

In keeping with our analysis of the Health Survey for England, where we tried to map the HSE categories to the ONS recommended groups for England, however variability in the data over the years meant that the same categories were only feasible with years up to 2014, and thereafter we have to use a 2-category ethnicity variable.

White (English, Irish, Scottish, Welsh, other European)
Mixed / multiple ethnic groups
Asian / Asian British (includes African-Indian, Indian, Pakistani, Bangladeshi), plus Other ethnic group (includes Chinese, Japanese, Philippino, Vietnamese, Arab)
Black / African / Caribbean / Black British (includes Caribbean, African)

SHeS

In 2008, the variable for ethnicity EthnicI categorised ethnicity into 13 groups:
- White: Scottish
- White: Other British
- White: Irish
- White: Any other white background (write in)
- Mixed: Any mixed background
- Asian, Asian Scottish or Asian British: Indian
- Asian, Asian Scottish or Asian British: Bangladeshi
- Asian, Asian Scottish or Asian British: Chinese
- Asian, Asian Scottish or Asian British: Any other Asian background (write in)
- Black, Black Scottish or Black British: Caribbean
- Black, Black Scottish or Black British: African
- Black, Black Scottish or Black British: Any other black background (write in)
- Any other ethnic group (write in)

It was renamed in 2009 when the list of categories was expanded. The new variable name is ethnic09 categorised by 21 groups:
- White: Scottish
- White: English
- White: Welsh
- White: Northern Irish
- White: British
- White: Irish
- White: Gypsy/Traveller
- White: Polish
- White: Other white ethnic group (write in)
- Mixed: Any mixed or multiple ethnic groups (write in)
- Asian: Pakistani, Pakistani Scottish or Pakistani British
- Asian: Indian, Indian Scottish or Indian British
- Asian: Bangladeshi, Bangladeshi Scottish or Bangladeshi British
- Asian: Chinese, Chinese Scottish or Chinese British
- Asian: Other (write in)
- Black: African, African Scottish or African British
- Black: Caribbean, Caribbean Scottish or Caribbean British
- Black: Black, Black Scottish or Black British
- Black: Other Black ethnic group (write in)
- Other ethnic group: Arab
- Other ethnic group: Other (write in)

It was renamed again in 2012 when the list of categories was revised. The new variable for ethnicity is ethnic12, categorised by 19 groups:
- White: Scottish
- White: Other British
- White: Irish
- White: GypsyTraveller
- White: Polish
- White: Other (write in)
- Mixed: Any mixed or multiple ethnic groups (write in)
- Asian: Pakistani, Pakistani Scottish or Pakistani British
- Asian: Indian, Indian Scottish or Indian British
- Asian: Bangladeshi, Bangladeshi Scottish or Bangladeshi British
- Asian: Chinese, Chinese Scottish or Chinese British
- Asian: Other (write in)
- African, African Scottish or African British
- African: Other (write in)
- Caribbean or Black: Caribbean, Caribbean Scottish or Caribbean British
- Caribbean or Black: Black, Black Scottish or Black British
- Caribbean or Black: Other (write in)
- Other ethnic group: Arab, Arab Scottish or Arab British
- Other ethnic group: Other (write in)

The wording changed slightly in 2013, but the coding remained the same.

From 2014, the variable for ethnicity ethnic5, categorises ethnicity by 5 groups:
- White: Scottish
- White: Other British
- White: Other
- Asian
- Other minority ethnic

Economic status

The function clean_economic_status() creates a variety of variables to classify economic status.

The issues around using occupation-based social classifications for social survey research are discussed by Connelly et al. [-@Connelly2016a]. They advise using a range of alternative measures, but not creating new measures beyond what is already established.

The classifications considered from the Health Survey for England are:

Employed / in paid work or not.
The NS-SEC measure which was constructed to measure the employment relations and conditions of occupations (i.e. it classifies people based on their employment occupation). It is therefore not that good at classifying people who are not employed for various reasons.
The NRS social grade system. This measure is the one used in the Tobacco and Alcohol Toolkit studies, but is not reported in the Health Survey for England. We create this variable by recategorising the NS-SEC 8 level variable. This is important to facilitate the link of analysis to the Toolkit Study.
Manual vs. non-manual occupation. In the 2017 Tobacco control plan for England, there was a specific target to reduce the difference in rates of smoking between people classified with a manual or non-manual occupation. We create this variable from the 3 level NS-SEC classification by grouping Managerial and professional with intermediate occupations to give the non-manual group.
Economic status - retired / employed / unemployed.
Activity status for last week that adds more detail such as 'in education' and 'looking after home or family'.

In ShES, there is no variable for 'Paidwk' (Paid work in last 7 days). Need to find a variable that indicates employment or not.
The variables the data does have include:
- nssec3
- nssec8
- nactiv/econact - equivalent of activity status

Not done yet, but need to think through the best variables to include to accurately represent econonic/activity status.

Whether working in last week

This variable was called ‘Nactiv’ in SheS08; the wording of the question changed slightly in 2009 but the name was kept the same as in 2008 by mistake. In SHeS10, this variable changed to Nactiv09.

There is also another variable: econac08 - the economic status of respondent.
- 1: In education
- 2: In paid employment, self-employed, or on government training
- 3: perm unable to work
- 4: Looking for/intending to look for paid work
- 5: Retired
- 6: Looking after home/family
- 7: Doing something else.

In 2012, econac08 changed to econac12, but kept the same responses. We use econac08/econac12 to represent the economic status of the respondent.

Education

The main education variable produced by the function clean_education() is a four category description of the age at which someone finished full-time education. The categories are:
- never went to school,
- left at 15 years or younger,
- left at 16-18,
- left at 19 years or over.

If someone was still in full time education at the time of the survey, then if they were younger than 18 years, we assumed they would leave at 16-18, and if they were older than 18 years, we assumed they would leave at 19 years or over.

A further education variable is also produced - which indicates whether an individual reached a degree as their top qualification or not. Here a degree is defined as an "NVQ4/NVQ5/Degree or equiv".

Family

The function clean_family() processes the data on the number of children in the household and the relationship status of each respondent.

Number of children in the household

The number of children in the household is not supplied explicitly. There is a variable 'hhdtypb2' describing the household type, but it is not explicit enough to record the number of children in the household.

Single adults household: 1 adult aged 16-64, no children,
Single parent household: 1 adult any age and 1 or more children,
Single older household: 1 adults 65+, no children,
Small family: two adults of any age and one or two children,
Older smaller family: 1 adult under 65 and one adult 65+, or two adults 65+ and no children
Large adult: 3+ adults, no children
Small adult: 2 adults under 65 and no children
Large family: 2 adults of any age and 3+ children or 3+ adults and 1+ children

Therefore, the number of children in the household is completely missing and needs to be imputed.

Need to look at what is done for HSE years from 2015 onwards, as will need to do the same with SHeS.

Relationship status

In previous versions of modelling (e.g. the alcohol binge model) relationship status has been described as married/not-married. Here, we include more detail by using:
- single
- married, civil partnership or cohabiting
- separated, divorced, widowed

Height and weight

Height (cm) and weight (kg). Weight is estimated above 130kg. Missing values of height and weight are replaced by the mean height and weight for each age, sex and IMD quintile.

Income

The function clean_income() processes the data on income.

There are a few different options for classifying income - the need to have a measure that is consistent across years of the Health Survey for England has led us to use equivalised income quintiles only.

In the past a measure of in poverty / not in poverty has been used, where the poverty threshold is defined as 60% of the median income for any year. For years in which we only have income quintiles available, it is not possible to make an exact calculation of poverty. But it will coincide approximately with the lowest 2 income quintiles.

We keep the income variable in 5 quintiles as in the HSE. SHeS includes individual equivalised income as well as by quintile, we include this as variable eqvinc_15.

Health and biometric variables

The function clean_health_and_bio() cleans data on presence/absence of certain categories of health condition, and on height and weight.

Health conditions

In line with what we do in the Health Survey for England, we include the 14 categories of health conditions. These are:
- Cancer
- Endocrine or metabolic condition
- Mental health condition
- Nervous system condition
- Eye condition
- Ear condition
- Heart or circulatory system condition
- Respiratory condition
- Digestive condition
- Genito-urinary condition
- Skin condition
- Musculo-skeletal condition
- Infectious disease
- Blood and related organs condition

Quality of life

tbc

Cigarette smoking variables

Questions about cigarette smoking have been asked of adults aged 16 and over.

Cigarette smoking status

The function smk_status() categorises cigarette smoking into current, former and never regular cigarette smokers. If some smoke either regularly or ocassionally, then they are classified as a current regular cigarette smoker. People who used to smoke regularly or ocassionally are classified as former smokers, but people who have only tried a cigarette once or twice are classified as never smokers. Ever-smokers are people who are either current or former smokers.

The only issue with using smk_status() for both datasets is that SHeS does not include smoking data for children, and therefore there are missing variables required to run the current code. May need to create an NA column for missing variables: kcigreg, and kcigevr.

Former smoking

The function smk_former() cleans the data on the time since quitting and time spent as a regular smoker among former smokers. We fill missing data:

For children 8-15 years, we assume that missing values for former smokers = 1 year.
For adults, we fill missing values with the average value for each age, sex and IMD quintile subgroup.

The issue with the smk_former() function is that in the Health Surveys for England 2015+, time since quit and time spent as a smoker is provided in categories rather than single years. Therefore additional code is used for years 2015 and higher, which we don't need for SHeS, so will need to adapt the function for SHeS.

Smoking life-histories

The function smk_life_history() cleans the ages that define when smokers started and stopped being regular cigarette smokers. For each individual smoker, the data recorded in implies a single age at which a smoker started to smoke and, if they stopped, an age at which they did so. This provides a simplified view of what might be a complicated life history of smoking, e.g. smoking to different frequencies or levels, or starting and stopping multiple times.

Both the start age and stop age will have error in them e.g. due to uncertainty in respondent recall. Start age is likely to be biased towards earlier ages, because for adults with missing values we use the age first tried a cigarette, and for children the variable for start age does not necessarily mean the start of regular smoking, it is just the age at which they started to smoke.

We also create a variable for the age at which an individual was censored from our data sample - this is their age at the survey + 1 year.

Any missing data is assigned the average start or stop age for each age, sex and IMD quintile.

The only issue with using smk_life_history() for both datasets is that SHeS does not include smoking data for children, and therefore there are missing variables required to run the current code. May need to create an NA column for missing variables: kcigage.

Amount and type of cigarette smoked by current smokers

The function smk_amount() cleans the variables that describe how much, what and to what level of addiction people smoke. The main variable is the average number of cigarettes smoked per day. For adults this is calculated from questions about how many cigarettes are smoked typically on a weekday vs. a weekend. For children, this is based on asking how many cigarettes were smoked in the last week. Missing values are imputed as the average amount smoked for an age, sex and IMD quintile subgroup.

For the HSE, we categorise cigarette preferences based on the answer to 'what is the main type of cigarette smoked'. In later years of the Health Survey for England, new questions are added that ask how many handrolled vs. machine rolled cigarettes are smoked on a weekday vs. a weekend. We currently don't use those questions because they were not asked in all years.

We also categorise the amount smoked, and use information on the time from waking until smoking the first cigarette. This latter variable has a high level of missingness. Together these categorical variables allow calculation of the heaviness of smoking index.

Alcohol data

Alcohol consumption data in SHeS is recorded in the same ways as the Health Survey for England (HSE), in four main forms:
- How often someone usually drinks
- For adults considering the last 12 months, what they drink on average
- For adults considering the last week, when they drank the most
- For children considering the last week, what they drank

We analyse beverage-specific alcohol consumption in terms of beer (combining normal beer, strong beer), wine (combining wine and sherry), spirits, and alcopops.

Assumptions about serving size and alcohol content

We make the same assumptions as we do in the HSE. This is described more fully in the vignette alcohol_data.

Whether someone drinks and frequency of drinking

We class someone as a current drinker if they reported drinking at all in the last 12 months e.g. even if reporting only having 1-2 drinks a year. There is no alcohol data collected for children, therefore we retain under 16s in our data, but assign them NAs for all alcohol questions. Any missing data is supplemented by responses to if currently drinks or if always non-drinker. This processing is done by the function alc_drink_now_allages().

In 2010, some of the variables are names differently to the other years:
- nberqbt7 and sberqbt are not variables - we set these variables to zero is missing.
- sberqbt is not a variable, it is called sbeer4.

Adult average consumption in the last 12 months

We estimate the average amount drunk in terms of UK standard units of alcohol (1 unit = 10ml or 8g pure ethanol). The processing is done by the function alc_weekmean_adult(). The calculation has the following steps:
- Convert the categorical variables to numeric variables for the frequency with which each beverage is typically consumed (normal beer, strong beer, spirits, sherry, wine, alcopops).
- Convert the reported volumes usually consumed (e.g. small glass, large glass) into volumes in ml, using the beverage size assumptions above.
- Combine the volumes (ml) usually consumed with the frequency of consumption to give the average volume of each beverage type drunk each week (assuming constant consumption across the year).
- Convert the expected volumes of each beverage consumed each week to UK standard units of alcohol consumed, using the alcohol content assumptions above.
- Collapse normal and strong beer into a single "beer" variable by summing their units. Collapse wine and sherry into a single "wine" variable by summing their units.
- Calculate total weekly units but summing across beverage categories.
- Calculate the beverage "preference vector", the percentage of total consumption contributed by the consumption of each of four beverage types (beer, wine, spirits, alcopops).
- Cap the total units consumed in a week at 300 units, assuming that above this already very high level of consumption estimates of variation in consumption are less reliable.
- Categorise average weekly consumption into "abstainer", "lower risk" (less than 14 units$/$week), "increasing risk" (greater than or equal to 14 units$/$week and less than 35 units$/$week for women, and less than 50 units$/$week for men), "higher risk".
- Categorise beverage preferences - for each of the four beverages, "does not drink", "drinks some" (less than or equal to 50\% of consumption), "mostly drinks".

Adult consumption on the heaviest drinking day in the last week

alc_sevenday_adult() processes the information from the questions on drinking in the last seven days - how many times drank and characteristics of the heaviest drinking day. We estimate the number of UK standard units of alcohol drunk on the heaviest drinking day by using the data on how many of what size measures of different beverages were drunk, and combining this with our standard assumptions about beverage volume and alcohol content. We estimate their total units on heaviest drinking day (peakday) and categorise their binge drinking status (did_not_drink, binge, no_binge).

Normally, if one of the constituent variables was missing, the whole variable would be marked as missing. However, due to high missingness, we just assume any missing = 0, and so are likely to make underestimates.

References

dosgillespie/hseclean documentation built on May 2, 2020, 1:15 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dosgillespie/hseclean
Health Survey Data Wrangling

In dosgillespie/hseclean: Health Survey Data Wrangling

Introduction

Read data

Survey design variables

Demographic and socio-economic variables

Sex

IMD quintiles

Ethnicity

HSE

SHeS

Economic status

Whether working in last week

Education

Family

Number of children in the household

Relationship status

Height and weight

Income

Health and biometric variables

Health conditions

Quality of life

Cigarette smoking variables

Cigarette smoking status

Former smoking

Smoking life-histories

Amount and type of cigarette smoked by current smokers

Alcohol data

Assumptions about serving size and alcohol content

Whether someone drinks and frequency of drinking

Adult average consumption in the last 12 months

Adult consumption on the heaviest drinking day in the last week

References

R Package Documentation

Browse R Packages

We want your feedback!

dosgillespie/hseclean Health Survey Data Wrangling

In dosgillespie/hseclean: Health Survey Data Wrangling

Introduction

Read data

Survey design variables

Demographic and socio-economic variables

Sex

IMD quintiles

Ethnicity

HSE

SHeS

Economic status

Whether working in last week

Education

Family

Number of children in the household

Relationship status

Height and weight

Income

Health and biometric variables

Health conditions

Quality of life

Cigarette smoking variables

Cigarette smoking status

Former smoking

Smoking life-histories

Amount and type of cigarette smoked by current smokers

Alcohol data

Assumptions about serving size and alcohol content

Whether someone drinks and frequency of drinking

Adult average consumption in the last 12 months

Adult consumption on the heaviest drinking day in the last week

References

R Package Documentation

Browse R Packages

We want your feedback!

dosgillespie/hseclean
Health Survey Data Wrangling