knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This vignette explains how you can add variables to the cchsflow package. There are two types of variables that can be added:

  1. Existing CCHS variables to be harmonized across cycles.
  2. Derived variables based on harmonized CCHS cycles.

How to add existing CCHS variables to cchsflow

When adding variables that already exist across CCHS cycles, there are two worksheets that need to be specified:

  1. variable_details.csv: This worksheet maps variables across CCHS cycles.
  2. variables.csv: This worksheet lists all the variables that exist in cchsflow

Example of an existing CCHS variable: Body mass index (BMI)

This example will show how the existing CCHS BMI variable was developed using variable_details.csv and variables.csv. Note this variable is different from the derived BMI variable that is also included in cchsflow. For this article, a sample variable_details.csv & variables.csv will be loaded to demonstrate how to add variables.

variables <- read.csv(file.path(getwd(), '../inst/extdata/sample_variables.csv'))
variable_details <- read.csv(file.path(getwd(),'../inst/extdata/sample_variable_details.csv'))

Specifying the variable on variable_details.csv

Columns

  1. variable: the most common variable name for BMI is HWTGBMI. This should be written for each row.
library(knitr)
library(kableExtra)
kable(variable_details[1:6, 1], col.names = 'variable')
  1. dummyVariable: BMI is a continuous variable, so it does not have dummy variables.
kable(variable_details[1:6, 1:2])
  1. typeEnd: BMI was captured in the CCHS as a continuous variable. It does not make much sense to transform it into a categorical variable, so the toType should be cont in each row of BMI.
kable(variable_details[1:6, 1:3])
  1. databaseStart: BMI was captured in all CCHS surveys between 2001 and 2018, so in the first row with the continuous "category" and the else row, the CCHS identifiers will be listed this column:
kable(variable_details[c(1, 6), 1:4])
kable(variable_details[2:3, 1:4])
kable(variable_details[4:5, 1:4])
  1. variableStart: In the 2001, 2003, and 2005 CCHS surveys the BMI variable differs from the common name, while in the CCHS surveys from 2007-2018, the BMI variable is the same as the common name. However, the values for not applicable and missing categories changes after 2003. Therefore for the first & else rows, the variableStart column will look like this:
kable(variable_details[c(1, 6), 1:5])
kable(variable_details[2:3, 1:5])
kable(variable_details[4:5, 1:5])
  1. typeStart: As mentioned previously, BMI was measured as a continuous variable in the CCHS, so the fromType should be cont in each row of BMI.
kable(variable_details[1:6, 1:6])
  1. recEnd: Since this is a continuous variable, the first row (the main "category") has copy written. For the not applicable rows NA::a is written. For the missing and else rows NA::b is written. The haven package is used for tagging NA in numeric variables.
kable(variable_details[1:6, 1:7])
  1. numValidCat: Since this is a continuous variable, there are no actual categories; so N/A is written in each row.
kable(variable_details[1:6, 1:8])
  1. catLabel: For the first row BMI is written. Not applicable rows not applicable is written. Missing rows: missing. Else row: else
kable(variable_details[1:6, 1:9])
  1. catLabelLong: For the first row, body mass index is written to give further detail on what BMI is. The other rows remain the same.
kable(variable_details[1:6, 1:10])
  1. units: BMI is measured in kg/m^2^, so kg/m2 is written in each row.
kable(variable_details[1:6, 1:11])
  1. recStart: Going through the CCHS data documentation from 2001 to 2018, it was found that the lowest BMI value was 11.91 and the highest BMI value was 57.9. Therefore the recFrom for the first row is written as [11.91,57.9]. In the 2001 and 2003 CCHS surveys not applicable was coded as 999.6 so the recFrom for this row would be [999.6,999.6]. Similarly, in the 2001 and 2003 CCHS surveys don't know was coded as 999.7, refusal was coded as 999.8, and not stated was coded as 999.9. Therefore the recFrom for the missing row for CCHS 2001 and 2003 would be [999.7,999.9]. In the not applicable row for the 2005 CCHS survey onwards, the recFrom is [999.96,999.96]. In the missing row for CCHS 2005 onwards, the recFrom is [999.97,999.99]. For the else row, just write else.
kable(variable_details[1:6, 1:12])
  1. catStartLabel: For the first row, BMI / self-report (D,G) is written as it is written in CCHS documentation. The other rows remain the same, and the values for each missing category are stated in the missing rows.
kable(variable_details[1:6, 1:13])
  1. variableStartShortLabel: Writing BMI for each row is sufficient for this variable.
kable(variable_details[1:6, 1:14])
  1. variableStartLabel: As per CCHS documentation, the label for this variable is BMI / self-report - (D,G).
kable(variable_details[1:6, 1:15])
  1. notes: As described previously, there are differences between CCHS surveys with regards to coding the not applicable and missing categories. These are documented in this section. Aside from this, there are other changes and differences that should also be documented. In the 2001 CCHS survey, this variable was restricted to participants aged 20-64. As well, don't know (999.97) and refusal (999.98) were not asked in this survey.
kable(variable_details[1:6, ])

Specifying the variable on variables.csv

Once mapped and specified on variable_details.csv, the BMI variable can now be specified on variables.csv

library(knitr)
library(kableExtra)
kable(variables[1, ])

How to create derived variables and add them to cchsflow

Along with specifying the variable on variable_details.csv and variables.csv, a previous step is required in creating derived variables and that is creating a custom function that creates the derived variable from existing CCHS variables.

CustomFunctionName <- function(Vars from variableStart following same order){
  outputVar <- {Code on passed vars that generates a single value output}

  return(outputVar)
}

Example of a derived variable: Smoking pack-years

Pack-years is a complex derived variable often used by researchers to quantify the amount of cigarette use over a period of time. Even given its complex nature, pack-years can still be calculated. This derived variable incorporates numerous CCHS smoking variables, along with age.

Step 1. Creating a derived function

With complex derived variables, sometimes it is necessary to create functions within the custom function. For pack-years, a nested function was used to create an intermediate smoking variable that was used in the main function.

Pack_years_fun <-
  function(SMKDSTY, DHHGAGE_cont, SMK_09A_B, SMKG09C, SMKG203_cont, SMKG207_cont,
           SMK_204, SMK_05B, SMK_208, SMK_05C, SMKG01C_cont, SMK_01A) {
    #Time since quit for former daily smokers
    tsq_ds_fun <- function(SMK_09A_B, SMKG09C) {
      SMKG09C <-
        ifelse2(SMKG09C==1, 4,
        ifelse2(SMKG09C==2, 8,
        ifelse2(SMKG09C==3, 12, NA)))
      tsq_ds <-
        ifelse2(SMK_09A_B==1, 0.5,
        ifelse2(SMK_09A_B==2, 1.5,
        ifelse2(SMK_09A_B==3, 2.5,
        ifelse2(SMK_09A_B==4, SMKG09C, NA))))
    }
    tsq_ds<-tsq_ds_fun(SMK_09A_B, SMKG09C)
    # PackYears for Daily Smoker
    ifelse2(SMKDSTY==1, pmax(((DHHGAGE_cont - SMKG203_cont)*(SMK_204/20)), 0.0137),
    # PackYears for Occasional Smoker (former daily)     
    ifelse2(SMKDSTY==2, pmax(((DHHGAGE_cont - SMKG207_cont - tsq_ds)*(SMK_208/20)), 0.0137) 
            + (pmax((SMK_05B*SMK_05C/30), 1)*tsq_ds),
    # PackYears for Occasional Smoker (never daily)      
    ifelse2(SMKDSTY==3, (pmax((SMK_05B*SMK_05C/30), 1)/20)*(DHHGAGE_cont - SMKG01C_cont),
    # PackYears for former daily smoker (non-smoker now)      
    ifelse2(SMKDSTY==4, pmax(((DHHGAGE_cont - SMKG207_cont - tsq_ds)*(SMK_208/20)), 0.0137),
    # PackYears for former occasional smoker (non-smoker now) who smoked at least 100 cigarettes lifetime      
    ifelse2(SMKDSTY==5 & SMK_01A==1, 0.0137,
    # PackYears for former occasional smoker (non-smoker now) who have not smoked at least 100 cigarettes lifetime      
    ifelse2(SMKDSTY==5 & SMK_01A==2, 0.007,
    # Non-smoker      
    ifelse2(SMKDSTY==6, 0, NA)))))))
  }

More information on what each smoking variable means can be found in the Reference section.

Steps 2 and 3. Specifying pack-years in variable_details.csv and variables.csv

This is how the variable_details.csv sheet would look for the derived pack-years row

kable(variable_details[9,])

And this is how the variables.csv sheet would look for the derived pack-years row

kable(variables[3,])

Adding labels to a derived variable

For a continuous derived variable like pack-years, the labels specified in variables.csv are sufficient for the variable to be properly labelled. For categorical derived variables, extra rows will need to be added on variable_details.csv so that labels are generated for each category. The example below shows how binge_drinker, a derived categorical variable flagging respondents who binge drink, is specified in variable_details.csv.

kable(variable_details[10:12,])

As you can see, the first row for binge_drinker specifies the function for the derived variable and the base variables included. The second and third rows specify the categories of the variables, which are then labelled.

Creating a derived variable using derived variables

It is possible to create a derived variable that involves derived variables. When creating the custom function for it, use the derived variable name inside the function. Similarly, when specifying the variable in variable_details.csv and variables.csv, use the derived variable in the variableStart column. The example below shows how number_conditions, a derived categorical variable that counts the number of chronic conditions that uses the derived respiratory condition variable, is specified in variable_details.csv and variables.csv.

kable(variable_details[13:20,])
kable(variables[4,])

Creating a derived variable that exists in other CCHS cycles

In order to harmonize a variable across all cycles, you may need to create a derived variable for the years in which the variable is not present. An example of this is RACDPAL, a variable that was derived from 2003 to 2018. To create and specify RACDPAL for 2001, the same derived variable conventions apply with a few slight differences. In variables.csv, the DerivedVar is enclosed with the database in which it is derived.

kable(variables[5, 9], col.names = 'variableStart')

This is to specify that for the 2001 CCHS cycle, it is derived from those base variables; while in later cycles it is simply recoded from the database.

In variable_details.csv, additional rows are created for the 2001 cycle so that rec_with_table() understands that in 2001 it will be derived from base variables, while in later cycles it is a simple recode.

kable(variable_details[21:30, ])


Big-Life-Lab/cchsflow documentation built on Feb. 23, 2024, 12:04 a.m.