In mgunther87/ipumsPMA: Common functions for IPUMS PMA staff

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
knitr::opts_knit$set(root.dir = "../data")
library(ipumsPMA)
library(kableExtra)

Basic usage

PMA description files contain the text comprising the DESCRIPTION and COMPARABILITY tabs on our website, and they also comprise the text shown on the CODES tab whenever a variable contains values that aren't recoded by a translation table. Each file is parsed by XML tags that direct different blocks of text to the correct tab on the website.

Rather than generate description files manually, PMA generates them with a template via desc_make. This function solicits a few pieces of user input, and then inserts it Mad Libs-style into this template:

<vardesc>

<var>
VARNAME
</var>

<desc>
VARNAME reports INDICATED.

The question associated with this variable was included in the SURVEY.
</desc>

<comp>
There are no comparability issues.
</comp>

<comment>
</comment>

</vardesc>

desc_make works with information from an existing translation table and combines it with user-supplied arguments to build text from a template. This translation table may be referenced by variable name only if it currently exists in the PMA variables folder; otherwise, it must be referenced by the full path to the location where it currently exists. Note: desc_make will always output a description file to the user's working directory, and not the directory where the translation table was found (unless they set to be the same). Usually, your working directory should be your "in_progress" folder.

Suppose I'm working with the translation table for ABOREV, which is in my "in_progress" folder. While option 1 would find the requested translation table, it would write a new description file to the user's current working directory, which is not explicitly specified. Option 2 is a better choice because it sets the working directory first; this allows us to refer to the translation table by file name, and to know precisely where the new description file will be written:

# Option 1 (functional)
desc_make(
  "Z:/pma/variables/tt_work/Matt/in_progress/aborev_tt.xls"
)

# Option 2 (best practice)
setwd("Z:/pma/variables/tt_work/Matt/in_progress/")
desc_make("aborev_tt.xls")

desc_make("aborev_tt.xls", write = F)

Notice that, throughout the descrition text, "VARNAME" was replaced by the variable name found in the translation table. The remaining all-caps text can be replaced with the other arguments in desc_make. Note that the remaining examples assume the working directory has been set, as in Option 2 above, and use preview-only mode via the argument write = F (see discussion below).

desc_make(
  "aborev_tt.xls",
  INDICATED = "whether the woman has ever had an abortion",
  SURVEY = "female questionnaire",
  write = F
)

Adding a "condition" statement

Sometimes, it may be helpful to suggest something about the universe of a variable in the description section (particularly if the universe is very complex). In that case, we use the argument CONDITION to put a conditional clause at the beginning of the description, just before the variable name first appears:

desc_make(
  "aborev_tt.xls",
  CONDITION = "For women aged 15-49 who have ever been pregnant",
  INDICATED = "whether the woman has ever had an abortion",
  SURVEY = "female questionnaire",
  write = F
)

Notice that a comma is inserted automatically after the conditional clause, and that we made the subject of the clause a plural noun.

Comparability

If there are differences between the available samples for a particular variable, we note them in the comparability section. For example, in the case of ABOREV, the universe for "ug2018a_hh" includes only women aged 15-19 who indicated that they'd ever had an unwanted pregnancy in a previous question; the universe for "ng2018a_hh" is broader, in that it includes all women aged 15-49.

desc_make(
  "aborev_tt.xls",
  CONDITION = "For women aged 15-49 who have ever been pregnant",
  INDICATED = "whether the woman has ever had an abortion",
  SURVEY = "female questionnaire",
  COMPARABILITY = "There is a minor difference in the universe parameters between samples: the Uganda 2018 sample includes only women aged 15-19 who indicated that they'd ever had an unwanted pregnancy; the universe for for Niger 2018 sample includes all women aged 15-49.",
  write = F
)

Manual recode values

desc_make will automatically create a section to display recoded values for any translation table containing a "partial recode" sample (marked by the number 2 in the "no recode" row). For example, the variable BIRTHYR is only partially recoded: we use the raw input data as a year, except for cases where it represents a code for different types of "missing" data. The translation table only partially recodes this variable: it only recodes the missing data codes from their original value to standard values used throughout the PMA project.

desc_make(
  "birthyr_tt.xls",
  INDICATED = "the woman's year of birth",
  SURVEY = "female questionnaire",
  write = F
)

Output codes and labels are simply transferred from the "tt" block of the translation table, and then inserted below some additional template language for the new manualcode section. The width of the variable is also calculated automatically (as in "This is a 4-digit variable").

Notice that, by default, the template language reproduces the text provided by INDICATED. Here's what that template looks like if INDICATED is not provided:

desc_make("birthyr_tt.xls", write = T)

Sometimes, you may not want to simply reproduce INDICATED in the manualcode section (perhaps because it's too long, or because it seems awkward). You may override the default behavior with the argument MANUAL_IND:

desc_make(
  "birthyr_tt.xls",
  INDICATED = "the woman's year of birth",
  SURVEY = "female questionnaire",
  MANUAL_IND = "a year", 
  write = F
)

Writing / Overwriting

desc_make will automatically write a new description file in your working directory. It will overwrite any file in that location that shares the name of the file it intends to write. If you want to prevent this, use the argument write = F, and R will merely generate a preview of the output in your console; this is particularly handy if you're doing a huge batch job (see below).

As it turns out, desc_make produces one of two kinds of files depending on your operating system. Mac users get a .doc Word document, and no further action is needed. Windows users get a .txt file that must be manually saved as a .doc Word document (this is because Pandoc can only write .docx files at this time, and these files don't work well with our XML system). Sorry, Windows users!

Batch jobs

You might be wondering, "does desc_make actually save us any time?" If you only ever use the function to handle one variable at a time, the answer is "maybe not". However, desc_make becomes incredibly valuable when used to iterate through several variables that share a lot of common language. Take, for example, the long series of variables asking about household possessions.

Recommended: using named lists

If I wanted to create description files for several of these "possession" variables all at once, I could create a named list where each item contains the text I would like to change between iterations. I then pass new text to INDICATED in each iteration, but keep the rest of the arguments the same:

possessions <- list(
  aircon = "air conditioning",
  bed = "a bed",
  bike = "a bike",
  biostove = "an improved biomass stove",
  boatwmotor = "a boat with a motor",
  boatnomotor = "a boat with no motor",
  bwtv = "a black and white television"
)

for(var in names(possessions)){
  path <- paste0(var,"_tt.xls")
  desc_make(
    path,
    INDICATED = paste("whether the household has", possessions[[var]]),
    SURVEY = "female questionnaire",
    write = F
  )
}

Recommended: importing from tracking sheets

Creating a named list can also be tedious if you have lots of similar variables: in our example, there are more than 60 "possession" variables! Happily, it's possible to use PMA tracking sheets to save you the trouble (this approach also minimizes the opportunities to make a typo).

The seven variables in our example are currently in the Household and Female tracking sheet on the Updated vars tab. The function tracking_get() can import this tab as a tibble, which we can then manipulate to generate text for each variable's INDICATED argument. Here, the variable labels happen to be stored in the "Notes/issues" column.

Notice: if we put all of the translation tables together in a subfolder (which I call "possessions"), we can iterate through them without bothering to specify any variable names at all. I do this with the help of the function list.files(), after I move my working directory downward one level with setwd("./possessions")(this step is important, so that the translation tables can still be found by desc_make).

# Get the names of all files in my "possessions" folder
setwd("./possessions")
files <- list.files()

# Remove the text "_tt.xls" from each file name, leaving just the variable name
vars <- gsub(
  x = files,
  pattern = "_tt.xls",
  replacement = ""
)

# Make variable names uppercase to match the tracking sheet
vars <- toupper(vars)

# Get the tracking sheet
tracking <- tracking_get("hhf", "updated")%>%
  filter(`Integrated varname` %in% vars)%>%
  select(`Integrated varname`, `Notes/issues`)

# Use "print" to show the filtered / selected results
print(tracking)

vars <- c(
  "aircon",
  "bed",
  "bike",
  "biostove",
  "boatnomotor",
  "boatwmotor",
  "bwtv"
) %>% toupper()

tracking <- readRDS("../data/tracking.rds")

print(tracking)

Each variable label in the "Notes/issues" column can be transformed into the text for INDICATED simply by substituting the word "Has" for "whether the household has" with gsub:

for(var in vars){

  IND <- tracking %>%
    filter(`Integrated varname` == var) %>%
    select(`Notes/issues`) %>%
    gsub(pattern = "Has", replacement = "whether the household has")

  path <- paste0(tolower(var), "_tt.xls")

  desc_make(
    path, 
    INDICATED = IND,
    SURVEY = "female questionnaire",
    write = F
  )

}

Use with caution: translation table metadata

Similar to the last approach, this approach uses text substitution with a label imported directly from the translation table. This can be convenient if you're working with variable labels that can be easily subjected to regular substitutions; however, as we will see in our example, this is harder than it might appear!

Assume that I'm starting back in my "in progress" folder. I again begin by moving my working directory "down" one level to the "possessions" folder, which must contain one translation table for each of the variables I'm working on. This time, I use full.names = TRUE to return the full file path for each translation table; this will allow me to use the py TranslationTable module to quickly reference metadata from the correct file.

# Get the full path to each file in my "possessions" folder
setwd("./possessions")
paths <- list.files()

paths <- paste0(tolower(vars), "_tt.xls")

I'll be using two functions from the python module in each iteration. For example, with the first variable in paths:

# Return the variable name:
py$TranslationTable(paths[1], "pma")$variable

# Return the variable label:
py$TranslationTable(paths[1], "pma")$variable_label

Notice: these labels were made without the words "a" or "an", which can create difficulties for text substitution. For example, with the label for BED, we will need to manually determine whether to insert "a" or "an" before the word "bed". For this reason, it is normally advisable to use tracking sheets as a place to develop description-appropriate language, rather than importing labels directly from translation tables.

py$TranslationTable(paths[2], "pma")$variable_label

In each iteration through paths, I use gsub to directly substitute the word "Has" for "whether the household has". Notice that the articles "a" and "an" will need to be inserted manually later on:

for(path in paths){
  tt <- py$TranslationTable(path, "pma")
  VAR <- tt$variable
  IND <- gsub(
    x = tt$variable_label,
    patt = "Has",
    rep = "whether the household has"
  )
  desc_make(
    path, 
    INDICATED = IND,
    SURVEY = "female questionnaire",
    write = F
  )
}