create_dictionary: Create a Dictionary for a Data Table or Data Frame
In PHSKC-APDE/rads: Assisted computation of King County public health data

create_dictionary

R Documentation

Create a Dictionary for a Data Table or Data Frame

Description

This function generates a dictionary of the variables in a given data.table or data.frame, detailing the variable types and a sample of unique values. Optionally, it can incorporate descriptions and notes from a reference data.table or data.frame.

Usage

create_dictionary(ph.data,
                  source,
                  suppress = NULL,
                  varsort = FALSE,
                  max_unique_values = 8,
                  truncation_threshold = 5,
                  ph.ref = NULL)

Arguments

`ph.data`	A data.table or data.frame containing the dataset to be analyzed.
`source`	A character string indicating the source of the data, e.g., `"[my_schema].[my_table]"`.
`suppress`	A character vector of variable names that will have sample values suppressed in the dictionary output (e.g., Social Security Numbers). Default is NULL.
`varsort`	A logical value (`TRUE` or `FALSE`) indicating whether to sort the dictionary alphabetically by variable name. Default is FALSE, which keeps the original column order.
`max_unique_values`	An integer indicating the maximum number of unique values to display for each variable. If a variable has more unique values than this, it will be truncated or summarized. Default is 8.
`truncation_threshold`	An integer indicating how many values to display before truncating with an ellipsis, when the number of unique values exceeds max_unique_values. Only applies to non-numeric variables. Default is 5.
`ph.ref`	An optional reference data.table or data.frame with columns: `source`, `varname`, `desc`, and `notes`. It will be merged on to the new data dictionary. Default is NULL.

Details

The create_dictionary function generates a dictionary from the provided data.table or data.frame, indicating the variable types and listing unique values for each variable. Different variable types are handled as follows:

Character and logical variables: If the number of unique values exceeds max_unique_values, the function displays the first truncation_threshold values followed by an ellipsis.
Factor variables: The function displays factor levels and their corresponding integer codes. These are displayed following the rules for character values.
Numeric variables (integer, numeric): If the number of unique values exceeds max_unique_values, the function displays the minimum and maximum values.
Date and datetime variables: Treated similarly to numeric variables, showing minimum and maximum values if there are too many unique values.
Other types: For non-atomic types (e.g., lists), the function suggests checking the original dataset structure.

Users can hide the unique values of sensitive variables (e.g., phone numbers in ph.data) using the suppress parameter. Additionally, if a reference data.table or data.frame (ph.ref) is provided, it will merge descriptions and notes into the output.

Value

A data.table with the following columns:

source: Character: The source of the data.
varname: Character: The name of the variable.
vartype: Character: The type of the variable (e.g., factor, character, logical, integer, numeric, date, datetime, other).
values: Character: A sample of unique values or a range if the number of unique values exceeds max_unique_values.
factor_labels: Character: Labels for factor levels if the variable is a factor.
desc: Character: Description of the variable. This column is only filled if a ph.ref data frame is provided.
notes: Character: Additional notes about the variable. This column is only filled if a ph.ref data frame is provided.
dict_updated: Date: The date the dictionary was created, i.e., the date you ran this function.

Examples

library(data.table)
dt <- data.table(
  xID = paste0(sample(LETTERS, size = 1000, replace = TRUE),
               sample(c(12345L:99999L), size = 1000, replace = TRUE)),
  xlogical = sample(c(TRUE, FALSE), size = 1000, replace = TRUE),
  xchar_long = sample(c(LETTERS), size = 1000, replace = TRUE),
  xchar_short = sample(c('a', 'b', 'c', 'd'), size = 1000, replace = TRUE),
  xfactor = factor(sample(1L:4L, size = 1000, replace = TRUE),
                   levels = 1L:4L,
                   labels = c('One', 'Two', 'Three', 'Four')),
  xbinary = sample(c(0, 1), size = 1000, replace = TRUE),
  xinteger_long = sample(c(0L:5000L), size = 1000, replace = TRUE),
  xinteger_short = sample(c(0:4), size = 1000, replace = TRUE),
  xnumeric = runif(1000, 0, 100),
  xdate_long = as.Date(sample(c(as.Date('1900-01-01'):as.Date('1999-12-31')),
                       size = 1000,
                       replace = TRUE)),
  xdate_short = as.Date(sample(c(as.Date('2000-01-01'):as.Date('2000-01-04')),
                        size = 1000,
                        replace = TRUE)),
  xdatetime_long = as.POSIXct(
                    runif(1000,
                          min = as.numeric(as.POSIXct('2023-01-01 00:00:00')),
                          max = as.numeric(Sys.time())),
                          origin = "1970-01-01"),
  xother = sample(list(c(1:3), c(2:4), c(3:5), c(4:6)),
                  size = 1000,
                  replace = TRUE)
)

dictionary1 <- create_dictionary(ph.data = dt,
                                 source = 'test dataset',
                                 suppress = c('xID'),
                                 varsort = FALSE)
print(dictionary1[])

ph.ref <- data.table(
  source = rep('test dataset', 4),
  varname = c('xID', 'xbinary', 'xlogical', 'xchar_long', 'xfactor',
             'xinteger_long'),
  desc = c('ID', 'Binary variable', 'Logical variable',
           'Character variable with long names',
           'Factor variable with labels',
           'Integer variable with long range'),
  notes = c('Sample IDs', 'Generic binary', 'Important', 'Check values',
            'Categorical data', 'Range from 0 to 5000')
)

dictionary2 <- create_dictionary(ph.data = dt,
                                 source = 'test dataset',
                                 suppress = c('xID'),
                                 varsort = FALSE,
                                 ph.ref = ph.ref)
print(dictionary2[])

PHSKC-APDE/rads documentation built on April 14, 2025, 10:47 a.m.