hseclean: Health Survey Data Wrangling

Documented in read_2002

#' Read HSE 2002
#'
#' Reads and does basic cleaning on the Health Survey for England 2002.
#'
#' As well as providing a sample designed to give a cross-section of the population, HSE 2002 also
#' focussed on the health of a number of specific groups, including: infants and children (aged 0-15),
#' young adults (aged 16-24) and mothers with infants aged under 1. Addresses sampled in each postal
#' sector were systematically allocated to one of two groups: Sample I (29 addresses) or Sample II (9
#' addresses). Sample I was designed to boost the proportion of children, young people and mothers of
#' infants, and Sample II to provide a sample of the general population. At Sample I addresses all
#' persons aged 0-24 were eligible for inclusion in the survey, as were all mothers of infants aged under
#' 1 (there was no upper age limit for the mothers). At Sample II addresses all persons were eligible for
#' interview. At both Sample I and II addresses, where there were more than two children aged 0-15, two
#' children were selected at random. Information was obtained directly from persons aged 13 and over.
#' Information about children aged under 13 was obtained from a parent, with the child present.
#'
#' An interview with each eligible person (Stage 1) was followed by a visit by a nurse (Stage 2), who
#' made a number of measurements and in some cases obtained a blood sample and a saliva sample.
#' Both interviewers and nurses used computer-assisted interviewing. Blood and saliva samples were
#' sent to a laboratory for analysis.
#'
#' WEIGHTING
#'
#' In HSE 2002, the sample was boosted in order to obtain greater numbers of children, young adults
#' (aged 16-24) and mothers of infants under 1. While children aged 0-15 and young adults aged 16-24
#' were sampled from all selected addresses, adults aged 25 and over were selected only at Sample II
#' addresses (i.e. they were selected at only 9 out of the 38 addresses included within each postcode
#' sector). Consequently, in HSE 2002, those aged 25 and over were under-represented in the final
#' dataset. Different weights were applied to different age groups as explained below:
#' \itemize{
#' \item Children aged 0-15: To compensate for limiting the number of children interviewed in a household
#' to two (the sampling fraction therefore being lower in households containing three or more children) it
#' has become necessary to weight the child sample. This ‘child weight’ is the total number of children
#' aged 0-15 in the household divided by the number of selected children in the household. The
#' weighted sample was then adjusted to ensure that the age/sex distribution matched that of all children
#' in co-operating households.
#' \item Young adults aged 16-24: As all people in the household in this age range were selected for
#' interview, the sample in this age group have a weight of 1.
#' \item Adults aged 25 and over: The under-representation of adults aged 25+ in the sample is addressed by
#' weighting the final dataset whereby all adults aged 25 and over are given a weight of 38/9. The
#' exception is natural mothers of children under the age of 1 who were selected at all addresses and
#' hence, were not over represented.
#' \item The variable child_wt contains the appropriate weights for each of the three age groups described
#' above. These weights were then scaled by a constant factor so that the weighted sample size across the
#' sample as a whole was same as the unweighted sample size. The scaled weight variable is tablewt.
#' \item The tables in the published volumes of the HSE2002 have been weighted using the child_wt variable.
#' For analysis relating to adults aged 16 and over using the both boost and general population samples,
#' the variable tablewt should be used.
#' }
#'
#' MISSING VALUES
#'
#' \itemize{
#' \item -1 Not applicable: Used to signify that a particular variable did not apply to a given respondent
#' usually because of internal routing. For example, men in women only questions.
#' \item -2 Schedule not applicable: Used mainly for variables on the self-completions when the
#' respondent was not of the given age range, also used for children without legal guardians in the
#' home who could not participate in the nurse schedule.
#' \item -6 Schedule not obtained: Used to signify that a particular variable was not answered because the
#' respondent did not complete or agree to a particular schedule (i.e. nurse schedule or selfcompletions).
#' \item -7 Refused/ not obtained: Used only for variables on the nurse schedules, this code indicates that a
#' respondent refused a particular measurement or test or the measurement was attempted but not
#' obtained or not attempted.
#' \item -8 Don't know, Can't say.
#' \item -9 No answer/ Refused
#' }
#'
#' @param root Character - the root directory.
#' @param file Character - the file path and name.
#' @importFrom data.table :=
#' @return Returns a data table. Note that:
#' \itemize{
#' \item Missing data ("NA", "", "-1", "-2", "-6", "-7", "-9", "-90", "-90.0", "N/A") is replace with NA,
#' except -8 ("don't know") as this is data.
#' \item All variable names are converted to lower case.
#' \item A single sampling cluster is assigned.
#' \item The probabilistic sampling unit have the year appended to them.
#' }
#' @export
#'
#' @examples
#'
#' \dontrun{
#'
#' data_2002 <- read_2002("X:/", "ScHARR/PR_Consumption_TA/HSE/HSE 2002/UKDA-4912-tab/tab/hse02ai.tab")
#'
#' }
#'
read_2002 <- function(
  root = c("X:/", "/Volumes/Shared/"),
  file = "ScHARR/PR_Consumption_TA/HSE/HSE 2002/UKDA-4912-tab/tab/hse02ai.tab"
) {

  data <- data.table::fread(
    paste0(root[1], file),
    na.strings = c("NA", "", "-1", "-2", "-6", "-7", "-9", "-90", "-90.0", "N/A")
  )

  setnames(data, names(data), tolower(names(data)))

  alc_vars <- colnames(data[ , 1532:1648])
  smk_vars <- colnames(data[ , 1469:1531])
  health_vars <- paste0("compm", 1:15)

  other_vars <- Hmisc::Cs(
    mintb, addnum,
    area, child_wt, #tablewt,#cluster, #wt_int,
    age, sex,
    ethnici,
    nimd, econact, nssec3, nssec8,
    #econact2, #paidwk,
    activb, #HHInc,
    children, infants,
    educend, topqual3,
    eqv5,
    #eqvinc,

    marstatb, # marital status inc cohabitees

    # how much they weigh
    htval, wtval)

  names <- c(other_vars, alc_vars, smk_vars, health_vars)

  names <- tolower(names)

  data <- data[ , names, with = F]

  data.table::setnames(data,

           c("area", "nimd", "d7unit", "child_wt", "marstatb", "ethnici",
             "nberf", "sberf", "spirf", "sherf", "winef", "popsf",
             "nberqhp", "nberqsm", "nberqlg", "nberqbt", "nberqpt",
             "sberqhp", "sberqsm", "sberqlg", "sberqbt", "sberqpt",
             "sherqgs", "spirqme"),

           c("psu", "qimd", "d7unitwg", "wt_int", "marstat", "ethnicity_raw",
             "nbeer", "sbeer", "spirits", "sherry", "wine", "pops",
             "nbeerq1", "nbeerq2", "nbeerq3", "nbeerq4", "nbeerq5",
             "sbeerq1", "sbeerq2", "sbeerq3", "sbeerq4", "sbeerq5",
             "sherryq", "spiritsq"))


  data[ , psu := paste0("2002_", psu)]
  data[ , cluster := "2002_all"]

  data[ , year := 2002]
  data[ , country := "England"]
  
  data[ , quarter := c(1:4)[findInterval(mintb, c(1, 4, 7, 10))]]
  data[ , mintb := NULL]

return(data[])
}