hseclean: Health Survey Data Wrangling

Documented in read_2007

#' Read HSE 2007
#'
#' Reads and does basic cleaning on the Health Survey for England 2007.
#'
#' The Health Survey for England 2007 was designed to provide data at both national and
#' regional level about the population living in private households in England. The sample for
#' the HSE 2007 comprised of two components: the core (general population) sample and a
#' boost sample of children aged 2-15. The core sample was designed to be representative of
#' the population living in private households in England and should be used for analyses at
#' the national level. The core sample was split in two for some modules of the 2007 survey,
#' further details are shown in Appendix A.
#'
#' A random sample of 720 PSUs (Primary Sampling Units) was selected for the core and the
#' boost sample, an additional 180 PSUs were used to supplement the child boost sample. The
#' PSUs were selected with probability proportional to the total number of addresses within
#' them. Once selected, the PSUs were randomly allocated to the 12 months of the year (60
#' per month in the core sample, 15 per month in the additional child boost) so that each
#'  quarter provided a nationally representative sample.
#'
#' Within each of the 720 core PSUs a sample of 36 addresses was selected. The selected
#' addresses were randomly allocated to either the core or child boost sample: 10 addresses to
#' the core sample and 26 to the child boost sample. In total therefore, there were 10 core
#' addresses allocated within each PSU, giving a total sample of 7,200 (720 x 10) core
#' addresses, and 18,720 child boost addresses (720 x 26).
#'
#' For the 180 additional child boost PSUs, a random sample of 41 addresses was selected in
#' each PSU, giving a total sample of 7,380 addresses (180 x 41) for the additional child boost
#' sample. The total child boost sample was thus 26,100 addresses (18,720 from the child
#' boost sample in core points and 7,380 from the additional child boost sample).
#'
#' For the HSE core sample, all adults aged 16 years or older at each household were selected
#' for the interview (up to a maximum of ten adults). However, a limit of two was placed on the
#' number of interviews carried out with children aged 0-15. For households with three or more
#' children, interviewers selected two children at random.
#'
#' At boost addresses interviewers screened for households containing at least one child aged
#' 2-15 years. For households which included eligible children, up to two were selected by the
#' interviewer for inclusion in the survey.
#'
#' An interview with each eligible person was followed by a nurse visit both using computer
#' assisted interviewing (CAPI). The 2007 survey for adults focused on lifestyle behaviour,
#' knowledge and attitudes. Adults were asked modules of questions on general health, alcohol
#' consumption, smoking, and fruit and vegetable consumption. Knowledge and attitudes were
#' covered in self-completion questionnaires.
#'
#' Children aged 13-15 were interviewed themselves, and parents of children aged 0-12 were
#' asked about their children, with the interview including questions on eating habits (fat and
#' sugar consumption) and fruit and vegetable consumption. Children in the boost sample only
#' were asked about physical activity.
#'
#' WEIGHTING
#'
#' 5.2 Individual weight
#'
#' For analyses at the individual level, the weighting variable to use is (wt_int). These weights
#' are generated separately for adults and children:
#' \itemize{
#' \item for adults (aged 16 or more), the interview weights are a combination of the household
#' weight and a component which adjusts the sample to reduce bias from individual nonresponse within households;
#' \item for children (aged 0 to 15), the weights are generated from the household weights and
#' the child selection weights – the selection weights correct for only including a maximum
#' of two children in a household. The combined household and child selection weight were
#' adjusted to ensure that the weighted age/sex distribution matched that of all children in
#' co-operating households.
#' }
#'
#' For analysis of children aged 0-15 in both the Core and the Boost sample, taking into
#' account child selection only and not adjusting for non-response, the (wt_child) variable can
#' be used. For analysis of children aged 2-15 in the only Boost sample the (wt_childb)
#' variable can
#'
#'
#' MISSING VALUES
#'
#' \itemize{
#' \item -1 Not applicable: Used to signify that a particular variable did not apply to a given respondent
#' usually because of internal routing. For example, men in women only questions.
#' \item -2 Schedule not applicable: Used mainly for variables on the self-completions when the
#' respondent was not of the given age range, also used for children without legal guardians in the
#' home who could not participate in the nurse schedule.
#' \item -8 Don't know, Can't say.
#' \item -9 No answer/ Refused
#' }
#'
#' @param root Character - the root directory.
#' @param file Character - the file path and name.
#' @importFrom data.table :=
#' @return Returns a data table. Note that:
#' \itemize{
#' \item Missing data ("NA", "", "-1", "-2", "-6", "-7", "-9", "-90", "-90.0", "N/A") is replace with NA,
#' except -8 ("don't know") as this is data.
#' \item All variable names are converted to lower case.
#' \item The cluster and probabilistic sampling unit have the year appended to them.
#' }
#' @export
#'
#' @examples
#'
#' \dontrun{
#'
#' data_2007 <- read_2007("X:/", "ScHARR/PR_Consumption_TA/HSE/HSE 2007/UKDA-6112-tab/tab/hse07ai.tab")
#'
#' }
#'
read_2007 <- function(
  root = c("X:/", "/Volumes/Shared/"),
  file = "ScHARR/PR_Consumption_TA/HSE/HSE 2007/UKDA-6112-tab/tab/hse07ai.tab"
) {

  ##################################################################################
  # General population

  data <- data.table::fread(
    paste0(root[1], file),
    na.strings = c("NA", "", "-1", "-2", "-6", "-7", "-9", "-90", "-90.0", "N/A")
  )

  data.table::setnames(data, names(data), tolower(names(data)))

  alc_vars <- colnames(data[ , 482:595])
  smk_vars <- colnames(data[ , c(1540:1557, 1808:2010)])
  health_vars <- paste0("compm", 1:15)

  other_vars <- Hmisc::Cs(
    mintb, addnum,
    area, cluster, wt_int,
    age, sex,
    ethinda,
    imd2007, econact, nssec3, nssec8,
    #econact2, #paidwk,
    activb, #HHInc,
    children, infants,
    educend, topqual3,
    eqv5, #eqvinc,

    marstatc, # marital status inc cohabitees

    # how much they weigh
    htval, wtval)

  names <- c(other_vars, alc_vars, smk_vars, health_vars)

  names <- tolower(names)

  data <- data[ , names, with = F]

  data.table::setnames(data, c("imd2007", "area", "marstatc", "ethinda"), c("qimd", "psu", "marstat", "ethnicity_raw"))

  data[ , psu := paste0("2007_", psu)]
  data[ , cluster := paste0("2007_", cluster)]

  data[ , year := 2007]
  data[ , country := "England"]
  
  data[ , quarter := c(1:4)[findInterval(mintb, c(1, 4, 7, 10))]]
  data[ , mintb := NULL]

return(data[])
}