knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
The worksheet variable_details
maps variables across datasets. variable_details
also contains the instructions for function rec_with_table
on how to recode the variables listed in the worksheet variables
.
Specifically, the function rec_with_table
recodes variables listed in column variable_details$variableStart
to the variables listed in variable_details$variable
.
library(readr) library(DT) library(knitr) library(kableExtra) source("../R/test_data_generator.R") datatable(variable_details, options = list(pageLength = 2)) cat("In the dataframe `variable_details` there are", nrow(variable_details), "rows and", ncol(variable_details), "columns", "\n\n")
variable_details
rec_with_table()
uses the tagged_na()
function from the haven package to tag not applicable responses as NA(a)
, and missing values (don't know, refusal, not stated) as NA(b)
. As you will see later, not applicable values are recoded to NA::a
, and missing values are recoded to NA::b
. See tagged_na_usage for more information about tagged_na()
.
Each row in variable_details
gives instructions to recode a single category of a final variable. Typically, a single category of a final variable maps to a single category of an original variable. Therefore, in variable_details
you'll have one row per category.
There are two exceptions to this rule.
Original categories are sometimes combined into a single final variable category. For example,
Most often survey responses "don't know,” "refusal," and "not stated" are combined into a final single "missing" category.
Original age group categories are often combined into larger age groups categories (e.g., original five-year age group categories are combined into final ten-year age group categories).
Original datasets with different categories map to separate rows for a final variable. For example,
The final variable, “Sex,” has categories “1” = male and “2” = female.
Note: We recommend not combining variables if the variable has an important change between datasets. variable_detailsdetails$notes
is used to identify issues that may be relevant when recoding the variable or category.
Here we explain the columns in the worksheet variable_details
.
Table 1. Summary of the source and level of information used for each column in the worksheet 'variable_details'.
library(knitr) library(kableExtra) var_desc_table <- data.frame( Data_source = c(rep("recoded",5), rep("original",3), "both"), Dataset = c(rep("",5), "databaseStart", rep("", 2), "notes"), Variable = c("variable *", "typeEnd *", rep("", 3), "variableStart *", "typeStart *", "variableStartlabel", "notes"), Categories = c("recEnd *", "numValidCat", "catLabel", "catLongLabel", "units", "recStart *", "catStartLabel", "", "notes") )
kable(var_desc_table) %>% kable_styling(full_width = T) %>% collapse_rows(columns = 1, valign = "top") %>% add_header_above(c("", "Level of information" = 3)) %>% footnote(symbol = c("Columns required for the function rec_with_table. All other columns are recommended but optional."))
The order of the columns within the worksheet variable_details
is not important because recodeflow functions use the column names.
kable(variable_details[7:13, 1], col.names = 'variable')
cont
; if the recoded variable that is categorical fill in as cat
.kable(variable_details[c(7:13), c(1:2)])
cat
and continuous variables are denoted as cont
.kable(variable_details[c(7:13), c(1:3)])
kable(variable_details[c(7:13), c(1:4)])
variableStart: the original names of the variables as they are listed in the original datasets, separated by commas.
If the variable name in a particular dataset is different from the recoded variable name, write out the dataset name, add two colons, and write out the original variable name for that dataset.
kable(variable_details[c(7:13), c(1:5)])
kable(variable_details[c(7:13), c(1:6)])
numValidCat = N/A
. Note that for categories: not applicable, missing, and else, are not included in the category count. This column is not necessary for the function rec_with_table()
.kable(variable_details[c(7:13), c(1:7)])
NA::a
and for missing and else categories, write NA::b
. For continuous variables that are not recoded in type, you would write in this column copy
so that the function copies the values without any recoding.kable(variable_details[c(7:13), c(1:8)])
For categorical variables that have changed in type (i.e from: cat -> to: cont), you will have to develop values that make the most sense to your analysis. For example, for variables that change from: cat -> to: cont, you can choose the categories' lower bound, mid-point, or upper-bound for the continuous value.
catLabel: short form label describing the category of a particular variable.
kable(variable_details[c(7:13), c(1:9)])
kable(variable_details[c(7:13), c(1:10)])
N/A
. Note, the function will not work if there different units between the rows of the same variable (e.g. if the concentration of a single molecule has two records one recorded as mg/dl and the other recorded as g/dl).kable(variable_details[c(7:13), c(1:11)])
The rules for each category of a new variable are a string in recStart
and value in recEnd
. These recode pairs are the same syntax as interval notation in which a closed range of values are specified using square brackets. See Notations for intervals for more information on interval notation. Recode pairs are obtained from the recStart and recEnd columns
value range is indicated by a comma, e.g. recStart= [1,4]; recEnd = 1
(recodes all values from 1 to 4 into 1}
value range for double vectors (with fractional part), all values within the specified range are recoded; e.g. recStart = [1,2.5]; recEnd = 1
recodes 1 to 2.5 into 1, but 2.55 would not be recoded (since it's not included in the specified range).
NA is used for missing values (don't know, refusal, not stated)
else is used all other values, which have not been specified yet, are indicated by else
, e.g. recStart = "else"; recEnd = NA
(recode all other values (not specified in other rows) to "NA")}
copy the else
token can be combined with copy
, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. recStart = "else"; recEnd = "copy"
kable(variable_details[c(7:13), c(1:9)])
bllflow
helper functions. See bllflow documentation.kable(variable_details[c(7:13), c(1:13)])
recode-with-table
function. Things to include here would be changes in wording across datasets, missing/changes in categories, and changes in variable type across datasets.kable(variable_details[c(7:13), c(1:14)])
The same naming convention applies to derived variables with the exception of two columns:
DerivedVar::[var1, var2, var3]
In recEnd, write Func:: followed with the exact name of the custom function used to create the derived variable.
Func::derivedFunction
A derived variable looks like this in variable_details.csv
kable(variable_details[64,1:14])
Certain derived variables may need information from a reference table to complete their derivation. For example, imagine a derived variable that has the average sodium consumption for an individual specific to the individual's age and sex. It makes sense to put this information in a table that maps the individual's age and sex to their sodium consumption like below,
kable( data.frame( age = c(18, 18, 19, 19), sex = c("male", "male", "female", "female"), sodium_consumption = c(1800, 1500, 2100, 1600) ) )
with the derivation function shown below,
sodium_consumption <- function(age, sex, reference_table) { return(reference_table[reference_table$age == age & reference_table$sex == sex, ]$sodium_consumption) }
The reference_table argument in the above function expects a data frame.
Including a table in the start variables for a derived variable can be done using the syntax, tables::<table_name>
where sodium_consumption
derived variable, the entry in the variables sheet would be,
kable( data.frame( variable = c("sodium_consumption"), typeEnd = c("cont"), databaseStart = c("database_one"), variableStart = c("DerivedVar::[age, sex, tables::sodium_reference_table]"), typeStart = c("N/A"), recEnd = c("Func::sodium_consumption"), numValidCategories = c("N/A"), recStart = c("N/A"), catLabel = c(""), catLabelLong = c("") ) )
When using the rec_with_table
function, the tables
argument can be used to include all the tables mentioned in the variables sheet.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.