knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(convo)
Define stubs and optional metadata in YAML:
```{yaml, code=xfun::read_utf8(system.file("", "ex-convo.yml", package = "convo")), eval = FALSE}
Read into R: ```r filepath <- system.file("", "ex-convo.yml", package = "convo") convo <- read_convo(filepath) print(convo)
Before using our convo
to evaluate names, we may want to check that it itself follows good practices. There are a few exploratory tools to help us identify suboptimal aspects. To illustrate, let's consider a slightly more problematic controlled vocabulary that what we read in above.
First, the pivot_convo()
function can help us determine if any stubs are used in multiple levels, which could potentially cause confusion or misinterpretation. In the example below, we see that the stub "CAT" is used in level 1 to denote a categorical and in level 2 to denote the animal "CAT".
convo_draft <- list(c("IND", "AMT", "CAT"), c("DOG", "CAT")) pivot_convo(convo_draft)
We can also also cluster stubs and search for possible redundancy with the cluster_convo()
function. Here, the level 1 stubs are mostly reasonable (but "IND" and "IS" might both be redundant and represent binary variables), but the level 2 stubs are highly redundant with "ACCOUNT", "ACCT", and "ACCNT" likely all representing the same concept.
convo_draft <- list(c("IND", "IS", "AMT", "AMOUNT", "CAT", "CD"), c("ACCOUNT", "ACCT", "ACCNT", "PROSPECT", "CUSTOMER")) clusts <- cluster_convo(convo_draft) plot(clusts[[1]]) plot(clusts[[2]])
In the plots above, we can see that clustering does not help surface the problematic level 1 duplication, but it might help us notice the level 2 redundancies. Thus, this is a manual, exploratory tool but not guaranteed to highlight all problems.
From here forward, we return to using the original convo
read in above for demonstration.
Evaluate a set of names (e.g. variable or file names) against the convo
object to find violations:
col_names <- c("ID_A", "IND_A", "XYZ_D", "AMT_B", "AMT_Q", "ID_A_1234", "ID_A_12") evaluate_convo(convo, col_names)
Compare stub lists and identify new potential entries (stubs used in variables but not in controlled vocabulary):
convo_colnames <- parse_stubs(col_names) convo_colnames compare_convo(convo_colnames, convo, fx = "setdiff")
(Note that if your names are separated by a different delimiter than _
, you may pass that to parse_stubs()
and most other functions shown in this demo using the sep =
argument.)
If desired, newly uncovered stubs can be manually added to the convo
:
convo2 <- add_convo_stub(convo, level = 2, stub = "B", desc = "Type B") convo2
Or, alternatively, the compare_convo()
function can accept a "union"
option to merge needed new stubs between two objects:
convo_union <- compare_convo(convo_colnames, convo, fx = "union") convo_union
After conducting set operations on a convo
as shown above, a new convo
YAML specification can be written back out to YAML:
write_convo(convo_union, filename = "new-convo.yml", path = tempdir())
```{yaml, code= file.path(tempdir(), "new-convo.yml"), eval = FALSE}
## Validate Generate a `pointblank` agent for data validation: ```r filepath <- system.file("", "ex-convo.yml", package = "convo") convo <- read_convo(filepath) agent <- create_pb_agent(convo, data.frame(IND_A = 1, IND_B = 5, DT_B = as.Date("2020-01-01"))) pointblank::interrogate(agent)
Or create a pointblank
YAML file for portability:
tmp <- gsub("\\\\", "/", tempdir()) write_pb(convo, c("IND_A", "AMT_B"), filename = "convo-validation.yml", path = tmp)
filepath <- system.file("", "ex-convo.yml", package = "convo") convo <- read_convo(filepath) write_pb(convo, c("IND_A", "AMT_B"), filename = "convo-validation.yml", path = tempdir())
```{yaml, code=xfun::read_utf8(file.path(tempdir(), "convo-validation.yml")), eval = FALSE}
## Document Make data dictionary: ```r vars <- c("AMT_A_2019", "IND_C_2020") desc_df <- describe_names(vars, convo, desc_str = "{level1} of {level2} in given year") DT::datatable(desc_df)
Visualize data as controlled vocabulary components:
vbls <- c("AMT_A_2019", "AMT_B", "AMT_C", "IND_A", "IND_B_2020") vbls_df <- parse_df(vbls) viz_names(vbls_df)
Describe an overall convo
specification:. Optionally include the "contracts" (or validation checks) in the documentation (also powered by pointblank
):
desc_df <- describe_convo(convo, include_valid = TRUE, for_DT = TRUE) DT::datatable(desc_df, escape = FALSE)
convo
can manage controlled vocabularies beyond column names, as well. For example, convo
can help document files is a large project.
filenames <- c("analysis/validation-a.Rmd", "analysis/validation-b.Rmd", "analysis/analysis.Rmd", "analysis/report.Rmd", "src/script-a.sql", "src/script-b.sql") filenames_clean <- gsub("\\.[A-Za-z]+$", "", basename(filenames)) parse_stubs(filenames_clean, sep = "-")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.