knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(convo)

Define

Define stubs and optional metadata in YAML:

```{yaml, code=xfun::read_utf8(system.file("", "ex-convo.yml", package = "convo")), eval = FALSE}

Read into R:

```r
filepath <- system.file("", "ex-convo.yml", package = "convo")
convo <- read_convo(filepath)
print(convo)

Check

Before using our convo to evaluate names, we may want to check that it itself follows good practices. There are a few exploratory tools to help us identify suboptimal aspects. To illustrate, let's consider a slightly more problematic controlled vocabulary that what we read in above.

First, the pivot_convo() function can help us determine if any stubs are used in multiple levels, which could potentially cause confusion or misinterpretation. In the example below, we see that the stub "CAT" is used in level 1 to denote a categorical and in level 2 to denote the animal "CAT".

convo_draft <- list(c("IND", "AMT", "CAT"), c("DOG", "CAT"))
pivot_convo(convo_draft)

We can also also cluster stubs and search for possible redundancy with the cluster_convo() function. Here, the level 1 stubs are mostly reasonable (but "IND" and "IS" might both be redundant and represent binary variables), but the level 2 stubs are highly redundant with "ACCOUNT", "ACCT", and "ACCNT" likely all representing the same concept.

convo_draft <- list(c("IND", "IS", "AMT", "AMOUNT", "CAT", "CD"),
              c("ACCOUNT", "ACCT", "ACCNT", "PROSPECT", "CUSTOMER"))
clusts <- cluster_convo(convo_draft)
plot(clusts[[1]])
plot(clusts[[2]])

In the plots above, we can see that clustering does not help surface the problematic level 1 duplication, but it might help us notice the level 2 redundancies. Thus, this is a manual, exploratory tool but not guaranteed to highlight all problems.

From here forward, we return to using the original convo read in above for demonstration.

Evaluate & Compare

Evaluate a set of names (e.g. variable or file names) against the convo object to find violations:

col_names <- c("ID_A", "IND_A", "XYZ_D", "AMT_B", "AMT_Q", "ID_A_1234", "ID_A_12")
evaluate_convo(convo, col_names)

Compare stub lists and identify new potential entries (stubs used in variables but not in controlled vocabulary):

convo_colnames <- parse_stubs(col_names)
convo_colnames
compare_convo(convo_colnames, convo, fx = "setdiff")

(Note that if your names are separated by a different delimiter than _, you may pass that to parse_stubs() and most other functions shown in this demo using the sep = argument.)

If desired, newly uncovered stubs can be manually added to the convo:

convo2 <- add_convo_stub(convo, level = 2, stub = "B", desc = "Type B")
convo2 

Or, alternatively, the compare_convo() function can accept a "union" option to merge needed new stubs between two objects:

convo_union <- compare_convo(convo_colnames, convo, fx = "union")
convo_union

Record

After conducting set operations on a convo as shown above, a new convo YAML specification can be written back out to YAML:

write_convo(convo_union, filename = "new-convo.yml", path = tempdir())

```{yaml, code= file.path(tempdir(), "new-convo.yml"), eval = FALSE}

## Validate 

Generate a `pointblank` agent for data validation:

```r
filepath <- system.file("", "ex-convo.yml", package = "convo")
convo <- read_convo(filepath)
agent <- create_pb_agent(convo, data.frame(IND_A = 1, IND_B = 5, DT_B = as.Date("2020-01-01")))
pointblank::interrogate(agent)

Or create a pointblank YAML file for portability:

tmp <- gsub("\\\\", "/", tempdir())
write_pb(convo, c("IND_A", "AMT_B"), filename = "convo-validation.yml", path = tmp)
filepath <- system.file("", "ex-convo.yml", package = "convo")
convo <- read_convo(filepath)
write_pb(convo, c("IND_A", "AMT_B"), filename = "convo-validation.yml", path = tempdir())

```{yaml, code=xfun::read_utf8(file.path(tempdir(), "convo-validation.yml")), eval = FALSE}

## Document

Make data dictionary:

```r
vars <- c("AMT_A_2019", "IND_C_2020")
desc_df <- describe_names(vars, convo, desc_str = "{level1} of {level2} in given year")
DT::datatable(desc_df)

Visualize data as controlled vocabulary components:

vbls <-  c("AMT_A_2019", "AMT_B", "AMT_C", "IND_A", "IND_B_2020")
vbls_df <- parse_df(vbls)
viz_names(vbls_df)

Describe an overall convo specification:. Optionally include the "contracts" (or validation checks) in the documentation (also powered by pointblank):

desc_df <- describe_convo(convo, include_valid = TRUE, for_DT = TRUE)
DT::datatable(desc_df, escape = FALSE)

Beyond Column Names

convo can manage controlled vocabularies beyond column names, as well. For example, convo can help document files is a large project.

filenames <- c("analysis/validation-a.Rmd",
               "analysis/validation-b.Rmd",
               "analysis/analysis.Rmd",
               "analysis/report.Rmd",
               "src/script-a.sql",
               "src/script-b.sql")
filenames_clean <- gsub("\\.[A-Za-z]+$", "", basename(filenames))
parse_stubs(filenames_clean, sep = "-")


emilyriederer/convo documentation built on Oct. 5, 2021, 4:08 p.m.