hseclean: Health Survey Data Wrangling

Documented in read_2004

#' Read the Health Survey for England 2004
#'
#' Reads and does basic cleaning on the Health Survey for England 2004.
#'
#' @section Survey details:
#'
#' The Health Survey for England 2004 was designed to provide data at both national and regional level
#' about the population living in private households in England. **The sample design of the 2004 survey
#' had two parts: a general population sample that followed the same pattern as in previous years and a
#' minority ethnic ‘boost’ sample, designed solely to yield interviews with members of seven largest
#' minority ethnic groups in England: Black Caribbean, Black African, Indian, Pakistani, Bangladeshi,
#' Chinese and Irish.**
#'
#' The general population sample was half the size of the usual sample, and involved selection 6,552
#' addresses from the Postcode Address File (PAF) in 312 wards, issued over a twelve-month period
#' from January to December 2004. Up to ten adults and up to two children in each household were
#' interviewed, and a nurse visit arranged for those participants in minority ethnic groups who
#' consented.
#'
#' In the ethnic boost sample, 41,436 addresses were randomly selected from PAF, within another 483
#' wards, issued over the same 12 month period, January to December 2004. All sampled addresses were
#' fully screened and only informants from the specified minority ethnic groups were eligible for
#' inclusion in the survey. Among those eligible informants at an address, up to four adults and three
#' children were selected for interview, with a random selection of participants if there was more than
#' this number in an eligible household.
#' In order to increase further the number of Chinese informants in 2004, the sample was supplemented
#' with a extra sample consisting of people with ‘Chinese sounding’ surnames obtained from the
#' Electoral Register (for further information see HSE 2004 report, Volume 2, Methodology and
#'                     documentation).
#'
#' For informants from the specified minority ethnic groups (whether identified in the general population
#'                                                           sample or the minority ethnic sample), an interview with each eligible person was followed by a nurse
#' visit both using computer assisted interviewing. The main focus of the 2004 survey for adults from
#' minority ethnic backgrounds was cardiovascular disease (CVD) and related risk factors. At the nurse
#' visit, questions were asked about prescribed medication, vitamin supplements and nicotine
#' replacements. The nurse took the blood pressure of those aged 5 and over, measured lung function of
#' those aged 7-15, and made waist and hip measurements for those aged 11 and over. Saliva samples
#' were collected from 4-15 year olds and blood samples from those aged 11 and over including fasting
#' blood from those aged 16 and over. Blood and saliva samples were sent to a laboratory for analysis.
#'
#' Informants in the general population sample, unless they were members of the specified minority
#' ethnic groups, were given a shortened version of the questionnaire covering core topics only.
#' For all informants, information was obtained directly from persons aged 13 and over. Information
#' about children under 13 was obtained from a parent with the child present.
#'
#' @section Weighting:
#'
#' General Population Data (HSE04gpa.sav)
#'
#' Prior to 2003, the weighting strategy for the core sample in the HSE was to apply selection weights
#' only – no attempt was made to reduce non-response bias through weighting. However, following a
#' review of the weighting, non-response was also included in the weighting strategy for the HSE 2003.
#' We have followed the same approach for weighting the HSE 2004 general population sample data.
#'
#' Two sets of non-response weights have been generated for the general population sample: household
#' weights which adjust for non-contact and refusal of households, and interview weights which also
#' adjust for the additional non-response among individuals in participating households.
#'
#' **The household weight (wt_hhld) is a household level weight that corrects the distribution of
#' household members to match population estimates for sex/age groups and GOR.** These weights were
#' generated using calibration weighting, with the household selection weights as starting values. (The
#' household selection weights correct for where the limit of three households are selected at addresses
#' with more than three.) Note that the population control totals used for the calibration weighting were
#' the ONS projected mid-year population estimates for 2004, but with a small adjustment to exclude
#' (our best estimate of) the population aged 65 and over living in communal establishments.
#'
#' For analyses at the individual level, the weighting variable to use is wt_int. These weights are
#' generated separately for adults and children:
#' \itemize{
#' \item for adults (aged 16 or more), **the interview weights are a combination of the household weight
#' and a component which adjusts the sample to reduce bias from individual non-response within
#' households**;
#' \item for children (aged 0 to 15), **the weights are generated from the household weights and the child
#' selection weights – the selection weights correct for only including a maximum of two children in
#' a household.** The combined household and child selection weight were adjusted to ensure that the
#' weighted age/sex distribution matched that of all children in co-operating households.
#' }
#' For analysis of children aged 0-15 in the General Population Sample, taking into account child
#' selection only and not adjusting for non-response, the child_wt variable can be used.
#'
#' **Minority Ethnic Group Data (HSE04etha.sav)**
#'
#' For the HSE 2004, as well as the general population sample, boost samples were selected in areas
#' with (relatively) higher proportions of people in minority ethnic groups. All respondents, whether
#' from the general population sample or the boost sample, are included in the minority ethnic group
#' sample. As well as the main interview, respondents in the minority ethnic group sample were also
#' eligible for nurse visits and to have blood taken. Therefore, three sets of weights were generated:
#' interview weights, plus nurse and blood weights.
#'
#' The first stage of the weighting process was to generate weights for the probability of selecting an
#' address in the minority ethnic sample. Addresses in areas with different ethnic profiles had different
#' chances of being selected for the HSE - selection weight were generated which corrected for this.
#' These weights were combined with the selection weights for household within addresses and for
#' individuals within household to give the interview weights (wt_int).
#'
#' All respondents in the minority ethnic group sample were eligible for a nurse visit and were also
#' asked to give a sample of blood (if aged 11 or more). As there was drop-out at both these stages,
#' separate weights were generated for the nurse visit sample (wt_nurse) and blood sample (wt_ blood).
#'
#' @section Missing values:
#'
#' \itemize{
#' \item -1 Not applicable: Used to signify that a particular variable did not apply to a given respondent
#' usually because of internal routing. For example, men in women only questions.
#' \item -2 Schedule not applicable: Used mainly for variables on the self-completions when the
#' respondent was not of the given age range, also used for children without legal guardians in the
#' home who could not participate in the nurse schedule.
#' \item -6 Schedule not obtained: Used to signify that a particular variable was not answered because the
#' respondent did not complete or agree to a particular schedule (i.e. nurse schedule or selfcompletions).
#' \item -7 Refused/ not obtained: Used only for variables on the nurse schedules, this code indicates that a
#' respondent refused a particular measurement or test or the measurement was attempted but not
#' obtained or not attempted.
#' \item -8 Don't know, Can't say.
#' \item -9 No answer/ Refused
#' }
#'
#' @param root Character - the root directory.
#' @param file_generalpop Character - the file path and name of the general population data file.
#' @param file_ethnicboost Character - the file path and name of the ethnic boost data file.
#' @param select_cols Character string - select either:
#' "all" - keep all variables in the survey data;
#' "tobalc" - keep a reduced set of variables associated with tobacco and alcohol consumption and a selected set of
#' survey design and socio-demographic variables that are needed for the functions within the hseclean package to work.
#'
#' @importFrom data.table :=
#'
#' @return Returns a data table.
#'
#' @export
#'
#' @examples
#'
#' \dontrun{
#'
#' data_2004 <- read_2004(
#'   root = "X:/",
#'   file_generalpop =
#'     "ScHARR/PR_Consumption_TA/HSE/HSE 2004/UKDA-5439-tab/tab/hse04gpa.tab",
#'   file_ethnicboost =
#'     "ScHARR/PR_Consumption_TA/HSE/HSE 2004/UKDA-5439-tab/tab/hse04etha.tab"
#' )
#'
#' }
#'
read_2004 <- function(
    root = c("X:/", "/Volumes/Shared/")[1],
    file_generalpop =
      "HAR_PR/PR/Consumption_TA/HSE/Health Survey for England (HSE)/HSE 2004/UKDA-5439-tab/tab/hse04gpa.tab",
    file_ethnicboost =
      "HAR_PR/PR/Consumption_TA/HSE/Health Survey for England (HSE)/HSE 2004/UKDA-5439-tab/tab/hse04etha.tab",
    select_cols = c("tobalc", "all")[1]
) {

  ##################################################################################
  # General population

  data <- data.table::fread(
    paste0(root, file_generalpop),
    na.strings = c("NA", "", "-1", "-2", "-6", "-7", "-8", "-9", "-90", "-90.0", "N/A"))

  data.table::setnames(data, names(data), tolower(names(data)))

  if(select_cols == "tobalc") {

    alc_vars <- colnames(data[ , 1507:1574])
    smk_vars <- colnames(data[ , 1436:1506])
    health_vars <- paste0("compm", 1:15)

    other_vars <- Hmisc::Cs(
      mintb, addnum,
      area, cluster, wt_int,
      hserial,pserial,
      age, sex,
      ethcind,
      imd2004, econact, nssec3, nssec8,
      #econact2, #paidwk,
      activb, #HHInc,
      children, infants,
      educend, topqual3,
      #eqv5,
      eqvinc,

      marstatb, # marital status inc cohabitees

      # how much they weigh
      htval, wtval)

    names <- c(other_vars, alc_vars, smk_vars, health_vars)

    names <- tolower(names)

    data <- data[ , names, with = F]

  }

  setnames(data, c("area", "imd2004", "d7unit", "marstatb", "ethcind","pserial"),
           c("psu", "qimd", "d7unitwg", "marstat", "ethnicity_raw","hse_id"))

  data[ , psu := paste0("2004_", psu)]
  data[ , cluster := paste0("2004_", cluster)]

  data[ , year := 2004]
  data[ , country := "England"]

  data[ , quarter := c(1:4)[findInterval(mintb, c(1, 4, 7, 10))]]
  data[ , mintb := NULL]

  # For combining data with the ethnic boost sample,
  # Make the interview weights for the general population sample sum to 1
  data[ , wt_int := wt_int / sum(wt_int, na.rm = T)]

  ##################################################################################
  # Ethnic boost sample

  data_ethnicboost <- data.table::fread(
    paste0(root, file_ethnicboost),
    na.strings = c("NA", "", "-1", "-2", "-6", "-7", "-8", "-9", "-90", "-90.0", "N/A"))

  data.table::setnames(data_ethnicboost, names(data_ethnicboost), tolower(names(data_ethnicboost)))

  if(select_cols == "tobalc") {

    data_ethnicboost <- data_ethnicboost[ , names, with = F]

  }

  data.table::setnames(data_ethnicboost,
                       c("area", "imd2004", "d7unit", "marstatb", "ethcind", "pserial"),
                       c("psu", "qimd", "d7unitwg", "marstat", "ethnicity_raw", "hse_id"))

  data_ethnicboost[ , psu := paste0("2004_", psu, "eb")]
  data_ethnicboost[ , cluster := paste0("2004_", cluster, "eb")]

  data_ethnicboost[ , year := 2004]
  data_ethnicboost[ , country := "England"]

  data_ethnicboost[ , quarter := c(1:4)[findInterval(mintb, c(1, 4, 7, 10))]]
  data_ethnicboost[ , mintb := NULL]

  # For combining data with the ethnic boost sample,
  # Make the interview weights for the ethnic boost sample sum to 1
  data_ethnicboost[ , wt_int := wt_int / sum(wt_int, na.rm = T)]

  # Due to uncertainty about how to combine the ethnic boost and general population sample
  # and smoking prevlence in the combined sample looking too low for this year
  # use only the general population sample
  #return(rbindlist(list(data, data_ethnicboost), use.names = T)[])
  return(data[])
}