knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(ipumsPMA)
library(kableExtra)

my_config <- tibble(
  new_var = c("newbinary", "newnominal", "newcontinuous"),
  mnemonic = c("ar_sc_info_itn", "frs_result", "age"),
  samples = c("bf2017a_nh, ke2017a_nh"),
  universe = "FILL IN UNIVERSE",
  desc = c("A binary variable", "A nominal variable", "A continuous variable")
)

my_config2 <- tibble(
  new_var = "fpsecurrent",
  mnemonic = "cur_se_yn",
  samples = "et2019a_hh",
  universe = "FILL IN UNIVERSE",
  desc = "Currently experiencing side effects"
)

my_dds <- dds_list()

Basic usage

Translation tables harmonize input values from all included samples into output values and labels. They also contain information about the universe for each sample, as well as the variable name and label.

This function completes most of the work required to make a translation table (TT), starting either with a CSV config file called "new_mnemonics.csv" or a TT skeleton produced by other Python-based tools.

It contains a number of arguments that allow you to do things like:

Starting from a config file: new_mnemonics.csv

A config file can be made with the function config_make or a Python script. For use with this tt_make, it should contain 5 columns as follows in this example:

my_config %>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T
  ) 

Contents of a config file

Notice that PMA staff do not add universe statements to the config file: we either draft these in our tracking sheets, or we create them by hand. The text "FILL IN UNIVERSE" is a placeholder.

The mnemonic column shows the name of the variable used in the input data. If a mnemonic appears in "new_mnemonics.csv", it is because a function like config_make searched all existing data dictionaries before determining that the mnemonic has never appeared in any samples we have processed to date. This does not necessarily mean that the variable will get a brand new translation table (see discussion on merging below). Each new sample that contains the new mnemonic is separated by commas in the samples column.

The new_var column shows the integrated variable name that we plan to use for the mnemonic (this is the variable name that will appear on the PMA website). It will provide the name for our new translation table, as well.

The description column contains the description found in the input data. PMA normally negotiates new integrated descriptions in our tracking sheets.

Import a config file into R

Config files made with config_make always appear in the "config_files" subfolder of the PMA admin folder. They can be imported with the function read_csv:

my_config <- read_csv("/pkg/ipums/pma/admin/config_files/07-Jul-2020_18.01/new_mnemonics.csv")

Write a new TT Excel file

I called my imported object my_config, so that I can call it by name inside the function tt_make (I also use write = F to preview output, and unit = "hhf" to assign the correct output to the input value -6):

tt_make(
  config = my_config,
  new_variable = "newbinary",
  unit = "hhf",
  write = F
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    write = F,
    unit = "hhf",
    dds = my_dds) %>%
    replace(is.na(.), "")%>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T) %>%
  scroll_box(height = "500px")
))
tt

Iterate through all variables in a config file

Often, we want to speed through all of the variables in a cofig file in one command. To do so, you might use a for-loop to execute the function for each row of the config file. Setting write = T (or exlcuding the write argument) ensures that one new TT Excel file will be written for each variable in the config file.

for(var in my_config$new_var){
  tt_make(
    config = my_config,
    new_variable = var,
    unit = "hhf",
    write = T
  )
}

Starting from a TT Skeleton

Optionally, you might wish to start by moving a pre-made TT skeleton to your working directory. tt_make will search for a skeleton automatically if the argument config is not provided. If a skeleton cannot be found in the working directory, you will receive a FileNotFoundError.

Usually, PMA staff will use their personal "in progress" folder as a working directory. Be sure to set your working directory in R, or else your skeleton will not be found.

setwd("Z:/pma/variables/tt_work/Matt/in_progress")
tt_make(
  new_variable = "newbinary",
  unit = "hhf",
  write = F
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    write = F,
    unit = "hhf",
    dds = my_dds) %>%
    replace(is.na(.), "")%>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T) %>%
  scroll_box(height = "500px")
))
tt

Iterate through all skeletons in your working directory

You may tell R to rummage through your working directory, find all files ending in "_tt.xls", and use tt_make for each one. Take care not to overwrite any files that you have already processed by moving them to a new folder.

files <- list.files(pattern = "_tt.xls")
varnames <- gsub(x = files, pattern = "_tt.xls", replacement = "")
for(var in varnames){
  tt_make(
    new_variable = var,
    unit = "hhf",
    write = T
  )
}

Binary, Continuous, and Nominal Variables Compared

How does tt_make guess output values and labels? First, it handles various PMA standard "missing" codes with help from the "unit" argument (these are values 90-99 for variables with a width of 2, usually padded by "9" for variables wider than 2).

Most PMA variables are binary, where the appropriate output is almost always 0 = No and 1 = Yes. When tt_make recieves input values from a data dictionary, it checks to see if the range of valid responses runs from 0 to 1; if so, it assumes that the variable is binary with Yes/No response options:

tt_make(
  config = my_config,
  new_variable = "newbinary",
  unit = "hhf",
  write = F
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    unit = "hhf",
    dds = my_dds,
    write = F
  )%>%
    replace(is.na(.), "")%>%
    kable("html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      fixed_thead = T) %>%
    scroll_box(height = "500px")
))
tt

PMA samples also normally include a number of continuous variables. These may be ultimately partially or fully recoded, but tt_make performs a full recode (leaving it up to the user to manually remove rows later; removal tends to be faster and more accurate than creating recodes from scratch).

tt_make(
  config = my_config,
  new_variable = "newcontinuous",
  unit = "hhf",
  write = F
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newcontinuous",
    unit = "hhf",
    dds = my_dds,
    write = F
  )%>%
    replace(is.na(.), "")%>%
    kable("html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      fixed_thead = T) %>%
    scroll_box(height = "500px")
))
tt

Unfortunately, tt_make provides very little help with nominal variables. In this case, the input values are copied as output values, and the labels "FILL IN LABEL" are assigned to each. Standard "missing" codes, however, are handled normally.

tt_make(
  config = my_config,
  new_variable = "newnominal",
  unit = "hhf",
  write = F
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newnominal",
    unit = "hhf",
    dds = my_dds,
    write = F
  )%>%
    replace(is.na(.), "")%>%
    kable("html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      fixed_thead = T) %>%
    scroll_box(height = "500px")
))
tt

Advanced options

Try these addtional arguments to help tt_make work faster and more accurately.

Specify a unit of analysis

In the above examples, the argument unit = "hhf" ensures that the input value -6 gets recoded as 96 - Not interviewed (household questionnaire). What happens if that argument is not provided?

tt_make(
  config = my_config,
  new_variable = "newbinary",
  # unit = "hhf",
  write = F
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    write = F,
    # unit = "hhf",
    dds = my_dds) %>%
    replace(is.na(.), "")%>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T) %>%
  scroll_box(height = "500px")
))
tt

Above, the function treats -6 like a nominal input: it returns the input value as the output value, and it gives "FILL IN LABEL" as the output value label. Note that the value -7 should not need to be disambiguated because it never appears in SDP files: it is returned correctly for our HHF samples.

What if we mis-label the unit with "sdp"?

tt_make(
  config = my_config,
  new_variable = "newbinary",
  unit = "sdp",
  write = F
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    write = F,
    unit = "sdp",
    dds = my_dds) %>%
    replace(is.na(.), "")%>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T) %>%
  scroll_box(height = "500px")
))
tt

Now, the input value -6 is recoded as 94 - Not interviewed (SDP questionnaire). No change has been made to the input value -7, because it really should not appear in SDP samples.

Pre-load data dictionaries

By default, each time you run tt_make, the function will import all of the PMA data dictionaries for quick reference. This may be unnecessary, and it can take quite a while if you have a slow connection to the IPUMS server. Likewise, it will prove to be a pain if you run tt_make several times in a row.

Instead, consider pre-loading the data dictionaries into the R environment with dds_list(). Then, you can reference the list by name inside of tt_make.

my_dds <- dds_list()
tt_make(
  config = my_config,
  new_variable = "newbinary",
  unit = "sdp",
  dds = my_dds
)

Load variable metadata from a PMA tracking sheet

To save yourself from tedious copy / paste work later on, try automatically importing metadata from any PMA tracking sheet with tracking_get. It is particularly good practice to use our tracking sheets as a place to compare names and labels for variables that will appear together in a group on the website. Edit the tracking sheet to create metadata for a batch of variables and, when finished, import the tracking sheet into R (each change made to the tracking sheet will require a new import before it appears in R):

my_trak <- tracking_get("hhf", "new")
tt_make(
  config = my_config,
  new_variable = "newbinary",
  unit = "hhf",
  write = F,
  trackingSheet = my_trak
)

Note: take care when changing variable names in this workflow. If you are working from an existing tt skeleton and you change a variable name in the tracking sheet, you must also change the name of the relevant translation table skeleton: failure to do so will cause tt_make to look in the tracking sheet for the name in the skeleton file path, and it will generate an error when this variable cannot be found. Simililarly, if you are working from a config file, you must change the variable name in both the config file and the tracking sheet.

Universe statements: If the tracking sheet contains information for a variable in its universe column and no addtional universe statements are provided with the univ_cases argument, tt_make will also import universe information from the tracking sheet and apply it to all new samples. If new samples have different universe statements, provide them with the univ_cases argument (see below).

Specify sample-specific universe statements

With help from a conditional function like dplyr's case_when (or if_else), you can speficy universe statements for one or more samples within tt_make. This helper function is passed to tt_make as an "expression", rather than a character string; this allows you to use case_when in a familiar way (as if nested within a recoding function like mutate).

Important: use conditional statements referencing a variable called "sample" as shown below

tt_make(
  config = my_config,
  new_variable = "newbinary",
  write = F,
  unit = "hhf",
  dds = my_dds,
  univ_cases = case_when(
    sample == "bf2017a_nh" ~ "A universe for BF.",
    sample == "ke2017a_nh" ~ "A universe for KE."
  )
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    write = F,
    unit = "hhf",
    dds = my_dds,
    univ_cases = case_when(
      sample == "bf2017a_nh" ~ "A universe for BF.",
      sample == "ke2017a_nh" ~ "A universe for KE."
    )
  ) %>%
    replace(is.na(.), "")%>%
    kable("html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      fixed_thead = T) %>%
    scroll_box(height = "500px")
))
tt

Above, we tell the function that "in case the sample is bf2017a_nh, literally write A universe for BF; otherwise, in case the sample is ke2017a_nh, literally write A universe for KE".

Suppose we want to use the same universe for two samples:

tt_make(
  config = my_config,
  new_variable = "newbinary",
  write = F,
  unit = "hhf",
  dds = my_dds,
  univ_cases = case_when(
    sample %in% c("bf2017a_nh","ke2017a_nh") ~ "A common universe."
  )
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    write = F,
    unit = "hhf",
    dds = my_dds,
    univ_cases = case_when(
      sample %in% c("bf2017a_nh","ke2017a_nh") ~ "A common universe."
    )
  ) %>%
    replace(is.na(.), "")%>%
    kable("html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      fixed_thead = T) %>%
    scroll_box(height = "500px")
))
tt

Above, the two samples are combined in the same row in the universe block: they appear together in the first column (separated by a space), and - because we exhausted all the samples in our TT - the value "[all]" was written in the second column.

It turns out that, even if we had not combined both samples together in the sample logical test, tt_make would have combined them if we provided the same universe statement for each. This is particularly useful when merging new samles to an old TT (see below).

tt_make(
  config = my_config,
  new_variable = "newbinary",
  write = F,
  unit = "hhf",
  dds = my_dds,
  univ_cases = case_when(
    sample == "bf2017a_nh" ~ "A common universe.",
    sample == "ke2017a_nh" ~ "A common universe."  
  )
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    write = F,
    unit = "hhf",
    dds = my_dds,
    univ_cases = case_when(
      sample == "bf2017a_nh" ~ "A common universe.",
      sample == "ke2017a_nh" ~ "A common universe."  
    )
  ) %>%
    replace(is.na(.), "")%>%
    kable("html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      fixed_thead = T) %>%
    scroll_box(height = "500px")
))
tt

Another trick with case_when: you can use shorthand logic to signify "all remaining logical cases". If such shorthand is the only logic provided, all samples will get the same universe:

tt_make(
  config = my_config,
  new_variable = "newbinary",
  write = F,
  unit = "hhf",
  dds = my_dds,
  univ_cases = case_when(
    TRUE ~ "A common universe."
  )
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config,
    new_variable = "newbinary",
    write = F,
    unit = "hhf",
    dds = my_dds,
    univ_cases = case_when(
      TRUE ~ "A common universe."
    )
  ) %>%
    replace(is.na(.), "")%>%
    kable("html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      fixed_thead = T) %>%
    scroll_box(height = "500px")
))
tt

Here, our use of TRUE is evaluated logically by R as TRUE, resulting in all samples getting A common universe.

The function case_when can take as many logical statements as necessary to exhaust all new samples: simply separate each logical test by a comma, and remember to use ~ before the universe statement in each case. You may also use the function if_else or any other logical test, but the use of case_when is highly recommended because it is highly readable and can easily incorporate several complex logical tests without nesting.

Merge new samples to an existing TT

Functions like config_make identify new variables by iterating through all PMA data dicionaries, searching for any prior mnemonic that matches those found in the new input data. If, upon reviewing your config file, you determine that a given variable is, in fact, just an old variable that has been assigned a new name, you may use tt_make to merge new samples into an old TT.

Consider the following row from an example config file:

my_config2
my_config2 %>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T
  ) 

Initially, we followed the logic of config_make and gave this variable a new integrated name fpsecurrent. However, upon further review, we should realize that a similar variable already exists: it is called fpsenow and is available for the old sample "ug2018a_hh". Its translation table is stored in the PMA variables folder, and it already has a description file:

py$TranslationTable("fpsenow", "pma")$ws
py$TranslationTable("fpsenow", "pma")$ws%>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T
  ) 
cat(py$VariableDescription("fpsenow", "pma")$text)

In order to merge our new sample into this old TT, we first need to change the name we provided in the config file to match the name in the old TT.

my_config2 <- my_config2 %>%
  mutate(new_var = case_when(
    new_var == "fpsecurrent" ~ "fpsenow"
  ))
my_config2
my_config2 <- my_config2 %>%
  mutate(new_var = case_when(
    new_var == "fpsecurrent" ~ "fpsenow"
  ))
my_config2 %>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T
  ) 

Next, we can simply run tt_make with merge == TRUE, ensuring the we must include univ_cases or else our new sample will not be included in the universe block.

tt_make(
  config = my_config2,
  new_variable = "fpsenow",
  unit = "hhf",
  dds = my_dds,
  write = F,
  merge = T,
  univ_cases = case_when(sample == "et2019a_hh" ~ "A universe for ET2019A_HH")
) 
invisible(capture.output(
  tt <- tt_make(
  config = my_config2,
  new_variable = "fpsenow",
  unit = "hhf",
  dds = my_dds,
  write = F,
  merge = T,
  univ_cases = case_when(sample == "et2019a_hh" ~ "A universe for ET2019A_HH")
) %>%
    replace(is.na(.), "")%>%
  kable("html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    fixed_thead = T) %>%
  scroll_box(height = "500px")
))
tt

Suppose we discovered that our new sample has the same universe logic as "ug2018a_hh". If we type out the same universe verbatim, tt_make will match the two samples together in the same row.

tt_make(
  config = my_config2,
  new_variable = "fpsenow",
  unit = "hhf",
  dds = my_dds,
  write = F,
  merge = T,
  univ_cases = case_when(sample == "et2019a_hh" ~ "Women aged 15-49 who are currently using a family planning method and have ever experienced side effects from a family planning method.")
)
invisible(capture.output(
  tt <- tt_make(
    config = my_config2,
    new_variable = "fpsenow",
    unit = "hhf",
    dds = my_dds,
    write = F,
    merge = T,
    univ_cases = case_when(sample == "et2019a_hh" ~ "Women aged 15-49 who are currently using a family planning method and have ever experienced side effects from a family planning method.")
  )%>%
    replace(is.na(.), "")%>%
    kable("html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      fixed_thead = T) %>%
    scroll_box(height = "500px")
))
tt


mgunther87/ipumsPMA documentation built on Aug. 1, 2020, 12:22 a.m.