```{=html}

```r
knitr::opts_chunk$set(echo = TRUE)
library(etlTurtleNesting)
library(drake)
library(wastdr)
library(pointblank)
library(magrittr)
library(reactable)
library(gt)
readd(odkc_ex_2019)
readd(user_mapping_2019)
loadd(odkc_ex_2019)
loadd(user_mapping_2019)

Currency

Step 0: Add new data collectors to WAStD

Coordinators of data capture programs supply us with a spreadsheet in this exact format:

We add columns:

Step 1: wastdr takes a wild guess

During the data import from ODK Central to WAStD, wastdr maps each distinct ODK Collect "username" as written by data collectors to actual WAStD user profiles.

The matching is done by fuzzyjoin::stringdist_left_join using the Jaro-Winker distance between the ODKC username and the combined WAStD name and aliases.

Details about the distance measures are here.

The Jaro-Winker distance was chosen as it returns the highest number of correct matches.

Step 2: Manually update WAStD user aliases to improve mapping

In the automated validation output we show all user mappings with a dissimilarity above a threshold. This should separate mismatches from matches.

The validation table shows columns:

For each mismatch: Find the correct WAStD user, and add the odkc_username to the WAStD User's field "aliases", then "save" or "save and continue editing" the WAStD user profile. It is absolutely essential to separate the aliases with a comma. e.g. Florian Mayer's aliases can be "Flo Mayer, FlorianM, FloJo", which then will precisely map to "Flo Mayer", "FlorianM", and "FloJo".

Note: updating WAStD will not refresh this report automatically - we have to re-run the data import.

When we re-run the user matching with fresh data, this user should match up. Some will won't, which we'll handle in the next step.

The QA validation will pick up all username matches with a dissimilarity of 0.01 or higher. Perfect matches will have a dissimilarity below 0.01.

You can download a spreadsheet of mismatches from the CSV button:

a <- user_mapping_2019 %>%
    annotate_user_mapping() %>%
    pointblank::create_agent() %>%
    pointblank::col_vals_lt("dist", 0.01) %>%
    interrogate()
a

Alternatively, you can work through the issues directly in the table below.

Your actions:

a %>%
  pointblank::get_data_extracts() %>%
  magrittr::extract2(1) %>%
  dplyr::rowwise() %>% 
  dplyr::mutate(active_at = get_user_area(odkc_ex_2019, odkc_username)) %>% 
  dplyr::arrange(active_at, -dist) %>% 
  gt::gt() %>%
  gt::fmt_markdown(columns = TRUE) %>%
  gt::cols_label(
    odkc_username = gt::html("<h5>They wrote</h5>\n<small>ODK Collect username</small>"),
    active_at = gt::html("<small>Active at</small>"),
    odkc_un_trim = gt::html("<h5>We searched</h5>\n<small>Cleaned ODK name</small>"),
    wastd_matched = gt::html("<h5>We matched</h5>\n<small>WAStD User</small>"),
    search_wastd = gt::html("<h5>Search</h5>\n<small>likely candidates</small>"),
    dist = gt::html("<h5>Dissimilarity</h5>\n<small>smaller = better</small>")
  ) %>%
  gt::tab_spanner(
    label = "WASTD User match",
    columns = vars(odkc_username,
                   active_at,
                   odkc_un_trim,
                   wastd_matched, search_wastd, dist)
  ) %>%
  gt::tab_spanner(
    label = "WASTD User profile (chosen as most likely)",
    columns = vars(role, pk, username, name, nickname, aliases, email, phone)
  )


dbca-wa/etlTurtleNesting documentation built on Nov. 18, 2022, 8:03 a.m.