knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette explains how you can add variables to the cchsflow package. There are two types of variables that can be added:
When adding variables that already exist across CCHS cycles, there are two worksheets that need to be specified:
variable_details.csv
: This worksheet maps variables across CCHS cycles.variables.csv
: This worksheet lists all the variables that exist in cchsflowThis example will show how the existing CCHS BMI variable was developed using variable_details.csv
and variables.csv
. Note this variable is different from the derived BMI variable that is also included in cchsflow. For this article, a sample variable_details.csv
& variables.csv
will be loaded to demonstrate how to add variables.
variables <- read.csv(file.path(getwd(), '../inst/extdata/sample_variables.csv')) variable_details <- read.csv(file.path(getwd(),'../inst/extdata/sample_variable_details.csv'))
variable_details.csv
HWTGBMI
. This should be written for each row.library(knitr) library(kableExtra) kable(variable_details[1:6, 1], col.names = 'variable')
kable(variable_details[1:6, 1:2])
cont
in each row of BMI.kable(variable_details[1:6, 1:3])
kable(variable_details[c(1, 6), 1:4])
kable(variable_details[2:3, 1:4])
kable(variable_details[4:5, 1:4])
kable(variable_details[c(1, 6), 1:5])
kable(variable_details[2:3, 1:5])
kable(variable_details[4:5, 1:5])
cont
in each row of BMI.kable(variable_details[1:6, 1:6])
copy
written. For the not applicable rows NA::a
is written. For the missing and else rows NA::b
is written. The haven
package is used for tagging NA in numeric variables.kable(variable_details[1:6, 1:7])
N/A
is written in each row.kable(variable_details[1:6, 1:8])
BMI
is written. Not applicable rows not applicable
is written. Missing rows: missing
. Else row: else
kable(variable_details[1:6, 1:9])
body mass index
is written to give further detail on what BMI is. The other rows remain the same.kable(variable_details[1:6, 1:10])
kg/m2
is written in each row. kable(variable_details[1:6, 1:11])
[11.91,57.9]
. In the 2001 and 2003 CCHS surveys not applicable was coded as 999.6 so the recFrom for this row would be [999.6,999.6]
. Similarly, in the 2001 and 2003 CCHS surveys don't know was coded as 999.7, refusal was coded as 999.8, and not stated was coded as 999.9. Therefore the recFrom for the missing row for CCHS 2001 and 2003 would be [999.7,999.9]
. In the not applicable row for the 2005 CCHS survey onwards, the recFrom is [999.96,999.96]
. In the missing row for CCHS 2005 onwards, the recFrom is [999.97,999.99]
. For the else row, just write else
.kable(variable_details[1:6, 1:12])
BMI / self-report (D,G)
is written as it is written in CCHS documentation. The other rows remain the same, and the values for each missing category are stated in the missing rows.kable(variable_details[1:6, 1:13])
BMI
for each row is sufficient for this variable.kable(variable_details[1:6, 1:14])
BMI / self-report - (D,G)
.kable(variable_details[1:6, 1:15])
kable(variable_details[1:6, ])
variables.csv
Once mapped and specified on variable_details.csv
, the BMI variable can now be specified on variables.csv
library(knitr) library(kableExtra) kable(variables[1, ])
Along with specifying the variable on variable_details.csv
and variables.csv
, a previous step is required in creating derived variables and that is creating a custom function that creates the derived variable from existing CCHS variables.
CustomFunctionName <- function(Vars from variableStart following same order){ outputVar <- {Code on passed vars that generates a single value output} return(outputVar) }
Pack-years is a complex derived variable often used by researchers to quantify the amount of cigarette use over a period of time. Even given its complex nature, pack-years can still be calculated. This derived variable incorporates numerous CCHS smoking variables, along with age.
With complex derived variables, sometimes it is necessary to create functions within the custom function. For pack-years, a nested function was used to create an intermediate smoking variable that was used in the main function.
Pack_years_fun <- function(SMKDSTY, DHHGAGE_cont, SMK_09A_B, SMKG09C, SMKG203_cont, SMKG207_cont, SMK_204, SMK_05B, SMK_208, SMK_05C, SMKG01C_cont, SMK_01A) { #Time since quit for former daily smokers tsq_ds_fun <- function(SMK_09A_B, SMKG09C) { SMKG09C <- ifelse2(SMKG09C==1, 4, ifelse2(SMKG09C==2, 8, ifelse2(SMKG09C==3, 12, NA))) tsq_ds <- ifelse2(SMK_09A_B==1, 0.5, ifelse2(SMK_09A_B==2, 1.5, ifelse2(SMK_09A_B==3, 2.5, ifelse2(SMK_09A_B==4, SMKG09C, NA)))) } tsq_ds<-tsq_ds_fun(SMK_09A_B, SMKG09C) # PackYears for Daily Smoker ifelse2(SMKDSTY==1, pmax(((DHHGAGE_cont - SMKG203_cont)*(SMK_204/20)), 0.0137), # PackYears for Occasional Smoker (former daily) ifelse2(SMKDSTY==2, pmax(((DHHGAGE_cont - SMKG207_cont - tsq_ds)*(SMK_208/20)), 0.0137) + (pmax((SMK_05B*SMK_05C/30), 1)*tsq_ds), # PackYears for Occasional Smoker (never daily) ifelse2(SMKDSTY==3, (pmax((SMK_05B*SMK_05C/30), 1)/20)*(DHHGAGE_cont - SMKG01C_cont), # PackYears for former daily smoker (non-smoker now) ifelse2(SMKDSTY==4, pmax(((DHHGAGE_cont - SMKG207_cont - tsq_ds)*(SMK_208/20)), 0.0137), # PackYears for former occasional smoker (non-smoker now) who smoked at least 100 cigarettes lifetime ifelse2(SMKDSTY==5 & SMK_01A==1, 0.0137, # PackYears for former occasional smoker (non-smoker now) who have not smoked at least 100 cigarettes lifetime ifelse2(SMKDSTY==5 & SMK_01A==2, 0.007, # Non-smoker ifelse2(SMKDSTY==6, 0, NA))))))) }
More information on what each smoking variable means can be found in the Reference section.
variable_details.csv
and variables.csv
This is how the variable_details.csv
sheet would look for the derived pack-years row
kable(variable_details[9,])
And this is how the variables.csv
sheet would look for the derived pack-years row
kable(variables[3,])
For a continuous derived variable like pack-years, the labels specified in variables.csv
are sufficient for the variable to be properly labelled. For categorical derived variables, extra rows will need to be added on variable_details.csv
so that labels are generated for each category. The example below shows how binge_drinker, a derived categorical variable flagging respondents who binge drink, is specified in variable_details.csv
.
kable(variable_details[10:12,])
As you can see, the first row for binge_drinker specifies the function for the derived variable and the base variables included. The second and third rows specify the categories of the variables, which are then labelled.
It is possible to create a derived variable that involves derived variables. When creating the custom function for it, use the derived variable name inside the function. Similarly, when specifying the variable in variable_details.csv
and variables.csv
, use the derived variable in the variableStart column. The example below shows how number_conditions, a derived categorical variable that counts the number of chronic conditions that uses the derived respiratory condition variable, is specified in variable_details.csv
and variables.csv
.
kable(variable_details[13:20,]) kable(variables[4,])
In order to harmonize a variable across all cycles, you may need to create a derived variable for the years in which the variable is not present. An example of this is RACDPAL, a variable that was derived from 2003 to 2018. To create and specify RACDPAL for 2001, the same derived variable conventions apply with a few slight differences. In variables.csv
, the DerivedVar is enclosed with the database in which it is derived.
kable(variables[5, 9], col.names = 'variableStart')
This is to specify that for the 2001 CCHS cycle, it is derived from those base variables; while in later cycles it is simply recoded from the database.
In variable_details.csv
, additional rows are created for the 2001 cycle so that rec_with_table()
understands that in 2001 it will be derived from base variables, while in later cycles it is a simple recode.
kable(variable_details[21:30, ])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.