extras/LinkExample.md

Comparing R Substitution Solutions

Background

When I started writing about methods for better "parametric programming" interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience:

Roughly I suggested two possible methods for making the task easier:

I mention dates to point out that this is something I have been inviting public comment on for some time.

Things change. Since the above time:

The rlang/tidyeval strategy is to capture un-evaluated user expressions (as a new object called a "quosure") and evaluate them with new language rules (with new bindings and something called an "overscope"). Also note the rlang/tidyeval strategy is full integration or re-writing of packages in terms of rlang/tidyeval; this isn't something you mix-in or turn on or off.

Some points I think that have been under-represented in previous discussions include:

The second point I think is particularly interesting. It means:

An R user who does not consider themselves an expert programmer could be maintaining code that they understand, but could not be expected to create from scratch.

Or:

Let's have some sympathy for the part-time R user.

This is the point we will emphasize in our new example.

The example

The design and discussion of substitution solutions should be driven from concrete realistic use cases. Working from larger examples gives us a taste of what working with each solution is like in practice. So, let's pretend to discuss social science (instead of programming).

Suppose an analyst, psychologist, medical doctor, or scientist is building an assessment for some aspects of behavior and anxiety.

Often such assessments involve selecting moving through a multiple-choice questionnaire and collecting a number of points that depend on answers selected. One such assessment is the Generalized Anxiety Disorder 7 questionnaire (or GAD-7). It is a very simple system as can be seen below.

One can treat such a test score as a classifier and assess it in terms of sensitivity, specificity, and different correspondence measures.

An obvious extension of such tests is to give a different number of points in different categories for each multiple-choice answer. For example we could imagine such a test where each answer gave a varying number of points in one of two categories called "withdrawal behavior" and "positive re-framing" (both in the sense of coping behaviors).

For example, our scientist might record the results of two subjects taking a test as follows:

d <- data.frame(
  subjectID = c(1,                   
                1,
                2,                   
                2),
  surveyCategory = c(
    'withdrawal behavior',
    'positive re-framing',
    'withdrawal behavior',
    'positive re-framing'
  ),
  assessmentTotal = c(5,                 
                      2,
                      3,                  
                      4),
  stringsAsFactors = FALSE
)

print(d)
##   subjectID      surveyCategory assessmentTotal
## 1         1 withdrawal behavior               5
## 2         1 positive re-framing               2
## 3         2 withdrawal behavior               3
## 4         2 positive re-framing               4
# or in "wide form":
library("cdata")
moveValuesToColumns(d, 
                    columnToTakeKeysFrom = 'surveyCategory',
                    columnToTakeValuesFrom = 'assessmentTotal',
                    rowKeyColumns = 'subjectID')
##   subjectID positive re-framing withdrawal behavior
## 1         1                   2                   5
## 2         2                   4                   3

A natural question is: how does one assign weights to each answer? One way would be to administer the test to a number of people the experimenter has classified as having either of the above mentioned behaviors and then performing a logistic regression to map assessment answers to the probability of a given diagnosis for this population. By re-scaling the weights and rounding them to small integers we could have a test point system that is very close to performing a logistic regression classification. We may be able to use the same assessment questions in a much more decisive manner than assigning all questions the same number of points.

This sort of idea is what one would expect from a mixed and collaborating team that includes medical experts, statistics experts, and programmers. After some work our team might work out that scoring the assessment can be done by the simple R dplyr pipeline:

suppressPackageStartupMessages(library("dplyr"))

scale <- 0.237

d %>%
  group_by(subjectID) %>%
  mutate(probability =
           exp(assessmentTotal * scale)/
           sum(exp(assessmentTotal * scale)))
## # A tibble: 4 x 4
## # Groups:   subjectID [2]
##   subjectID      surveyCategory assessmentTotal probability
##       <dbl>               <chr>           <dbl>       <dbl>
## 1         1 withdrawal behavior               5   0.6706221
## 2         1 positive re-framing               2   0.3293779
## 3         2 withdrawal behavior               3   0.4410258
## 4         2 positive re-framing               4   0.5589742

For each subject we take the row with maximal probability as the diagnosis. The diagnosis was already obvious from the original scores, the main addition is the diagnosis confidence is now available as a probability estimate.

Each step of the above pipeline is learn-able:

Suppose this assessment is tested and works well. It is then plausible that the team might ask their R expert to help them construct a much more complicated dplyr pipeline that better formats the results. Under the Harlan Mills' "Surgical Team" proposal (made famous in Frank Brook's The Mythical Man Month) we expect effective data science teams to have a diversity of deep expertise (not everybody know everything, but a lot is known by the total team). We expect a well staffed research team to include the statistician who worked out the sigmoid transform above, and a programmer who works out the pipeline we give below.

d %>%
  group_by(subjectID) %>%
  mutate(probability =
           exp(assessmentTotal * scale)/
           sum(exp(assessmentTotal * scale))) %>%
  arrange(probability, surveyCategory) %>%
  mutate(isDiagnosis = row_number() == n()) %>%
  filter(isDiagnosis) %>%
  ungroup() %>%
  select(subjectID, surveyCategory, probability) %>%
  rename(diagnosis = surveyCategory) %>%
  arrange(subjectID)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

This is indeed a long (and expert-level) pipeline. But the principle is:

Let's take this deliberately long (so as to be a strong test) example and see how hard the pipeline is to re-use under different methodologies.

Re-use

An issue that comes up is: can the team re-use the pipeline on another project? Suppose in their next project the ID column isn't "subjectID" but it is "patientID" (and so on). Obviously they can copy and paste the original pipeline and change the names (which is not a bad practice for the first few re-uses).

But once this procedure is going to be used many times it is a good idea to wrap it up or genericize it so it can be safely re-adapted (so the users can't accidentally forget to change one name one place).

I will now walk through a number of approaches to this in terms of how hard they are on the researcher. We are assuming their R expert does the wrapping for them, but then must explain the concepts to the part-time R user so they truly understand and can maintain the tools they are using.

For our example we assume all the column names are coming from variables set somewhere else (in another R script, or coming from a spreadsheet that is read into R, or some other source). The nature of the columns is constant from analysis to analysis, but the exact names used may vary. For our example the column names are:

idCol        <- "subjectID"
categoryCol  <- "surveyCategory"
linkScoreCol <- "assessmentTotal"
indicatorCol <- "isDiagnosis"
probScoreCol <- "probability"
outcomeCol   <- "diagnosis"

wrapr solution

In my opinion the easiest solution (in terms of cognitive load) is wrapr::let(). The R expert would share the following code:

library("wrapr")

let(
  c(
    IDCOL        = idCol,
    CATEGORYCOL  = categoryCol,
    LINKSCORECOL = linkScoreCol,
    INDICATORCOL = indicatorCol,
    PROBSCORECOL = probScoreCol,
    OUTCOMECOL   = outcomeCol
  ),

  d %>%
    group_by(IDCOL) %>%
    mutate(PROBSCORECOL =
             exp(LINKSCORECOL * scale)/
             sum(exp(LINKSCORECOL * scale))) %>%
    arrange(PROBSCORECOL, CATEGORYCOL) %>%
    mutate(INDICATORCOL = row_number() == n()) %>%
    filter(INDICATORCOL) %>%
    ungroup() %>%
    select(IDCOL, CATEGORYCOL, PROBSCORECOL) %>%
    rename(OUTCOMECOL = CATEGORYCOL) %>%
    arrange(IDCOL)
)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

The concept is:

"let() works as if you had written the code with the names substituted as shown in the c() block."

And there is ample documentation showing how this can be used. Notice creating this code is completely mechanical (replace concrete names with the all-caps place holders) and the execution has an easy mental model (the place-holders are replaced with names stored in the variables).

In this solution the adapted code looks like the original code.

replyr solution

The next easiest method in concept is replyr_apply_f_mapped().

The R expert would write the following, and the part-time R user (with some coaching) could maintain it.

library("replyr")

d %>%
  replyr_apply_f_mapped(
    nmap = c(
      IDCOL        = idCol,
      CATEGORYCOL  = categoryCol,
      LINKSCORECOL = linkScoreCol,
      INDICATORCOL = indicatorCol,
      PROBSCORECOL = probScoreCol,
      OUTCOMECOL   = outcomeCol
    ),

    f = . %>%
      group_by(IDCOL) %>%
      mutate(PROBSCORECOL =
               exp(LINKSCORECOL * scale)/
               sum(exp(LINKSCORECOL * scale))) %>%
      arrange(PROBSCORECOL, CATEGORYCOL) %>%
      mutate(INDICATORCOL = row_number() == n()) %>%
      filter(INDICATORCOL) %>%
      ungroup() %>%
      select(IDCOL, CATEGORYCOL, PROBSCORECOL) %>%
      rename(OUTCOMECOL = CATEGORYCOL) %>%
      arrange(IDCOL)
  )
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

What the code does is exactly this:

The concept is:

replyr_apply_f_mapped() renames columns and back.

Below is an illustrative example showing the column names seen inside and outside the user supplied function.

print(colnames(d))
## [1] "subjectID"       "surveyCategory"  "assessmentTotal"
d %>%
  replyr_apply_f_mapped(
    nmap = c(
      IDCOL        = idCol,
      CATEGORYCOL  = categoryCol,
      LINKSCORECOL = linkScoreCol,
      INDICATORCOL = indicatorCol,
      PROBSCORECOL = probScoreCol,
      OUTCOMECOL   = outcomeCol
    ),

    f = function(df) {
      df$PROBSCORECOL <- 1
      print(colnames(df))
      return(df)
    }
  ) %>%
  colnames()
## [1] "IDCOL"        "CATEGORYCOL"  "LINKSCORECOL" "PROBSCORECOL"

## [1] "subjectID"       "surveyCategory"  "assessmentTotal" "probability"

This is teachable and something the part-time R user can correctly extend and maintain. Though the user may possibly need to learn about wrapping a pipeline as an anonymous function (the ". %>%" notation).

rlang/tidyeval solution

For the rlang/tidyeval solution the expert writes the following code:

IDSYM        <- rlang::sym(idCol)
CATEGORYSYM  <- rlang::sym(categoryCol)
LINKSCORESYM <- rlang::sym(linkScoreCol)
INDICATORSYM <- rlang::sym(indicatorCol)
PROBSCORESYM <- rlang::sym(probScoreCol)
OUTCOMESYM   <- rlang::sym(outcomeCol)

d %>%
  group_by(!!IDSYM) %>%
  mutate(!!PROBSCORESYM :=
           exp((!!LINKSCORESYM) * scale)/
           sum(exp((!!LINKSCORESYM) * scale))) %>%
  arrange(!!PROBSCORESYM, !!CATEGORYSYM) %>%
  mutate(!!INDICATORSYM := row_number() == n()) %>%
  filter(!!INDICATORSYM) %>%
  ungroup() %>%
  select(!!IDSYM, !!CATEGORYSYM, !!PROBSCORESYM) %>%
  rename(!!OUTCOMESYM := !!CATEGORYSYM) %>%
  arrange(!!IDSYM)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

Several points have to be taught to the part-time R user if this code is to be maintained:

The above are just some syntax edge-cases, we haven't even gone into teaching rlang::sym(), "!!", and the theory and semantics of quasi-quotation.

seplyr solution

seplyr is an experiment to see what a referentially transparent (or completely value oriented) interface to dplyr would look like. Please don't think of seplyr as an adapter (though it is, it sends all work to dplyr), but as an illustration of what a completely value-oriented dplyr might look like (i.e., one that did not capture un-evaluated user code through non-standard evaluation). Roughly seplyr is an experiment of the form: "what if one tried harder with something like the new dplyr::*_at() verbs."

Most of the seplyr methods are named *_se() and are designed to be very similar to their dplyr equivalents (and some are nearly identical to dplyr::*_at() methods, rename_se() being a notable exception).

library("seplyr")
suppressPackageStartupMessages(library("glue"))

d %>%
  group_by_se(idCol) %>%
  mutate_se(probScoreCol :=
           glue('exp({linkScoreCol} * scale)/
                  sum(exp({linkScoreCol} * scale))')) %>%
  arrange_se(c(probScoreCol, categoryCol)) %>% 
  mutate_se(indicatorCol := "row_number() == n()") %>%
  filter_se(indicatorCol) %>%
  ungroup() %>%
  select_se(c(idCol, categoryCol, probScoreCol)) %>%
  rename_se(outcomeCol := categoryCol) %>%
  arrange_se(idCol)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

The concept is:

"Only mutate needs non-standard evaluation."

seplyr accepts general expressions many more places, but with proper organization and using a few temp-columns you really only need the full generality in mutate().

seplyr has its own issues:

The lesson from seplyr is the mutate() verb does indeed need some kind of expression manipulation tooling (direct string manipulation feeling too crude). However, for the rest of the verbs the value oriented notation is in fact quite natural, and really in no sense inferior to the dplyr originals.

Conclusion

Name substitution is a reasonable need that arises when re-using R work or when trying to iterate of column names. I have been publicly exploring variations of substitution systems so that R users can make an informed choice of one or more that most meets their needs and addresses their personal trade-offs between: power, safety, readability, and teachability. These sections are not each independent "yet another way of performing name substitution", but parts of a public conversation that should be had before name substitution is considered settled and fixed in stone.

A part-time R user will not have the background to quickly compare all of the available substitution systems. In fact such a user will only come to need a substitution system when they have a problem. So by definition they are in in the middle of some other task. It is up to expert partners to evaluate and explain alternatives.

There is a temptation that if you are going to only teach one system it might as well be rlang/tidyeval as "that is what now comes with dplyr". I feel this is a false savings as while rlang/tidyeval "is already in dplyr" the rlang/tidyeval concepts and details are not "already in the user" (and in fact include a fairly number of irregular exceptions, needing to be taught and memorized).

Our preference is: wrapr::let(). wrapr::let() delivers a lot of (safe) power for a modest amount of cognitive load. Each of the above systems involves different trade-offs and compromises, and we feel one must really try a few in production before having an expert opinion.



WinVector/wrapr documentation built on Aug. 29, 2023, 4:51 a.m.