create_dictionary: Create a Dictionary for a Data Table or Data Frame

View source: R/create_dictionary.R

create_dictionaryR Documentation

Create a Dictionary for a Data Table or Data Frame

Description

This function generates a dictionary of the variables in a given data.table or data.frame, detailing the variable types and a sample of unique values. Optionally, it can incorporate descriptions and notes from a reference data.table or data.frame.

Usage

create_dictionary(ph.data,
                  source,
                  suppress = NULL,
                  varsort = FALSE,
                  max_unique_values = 8,
                  truncation_threshold = 5,
                  ph.ref = NULL)

Arguments

ph.data

A data.table or data.frame containing the dataset to be analyzed.

source

A character string indicating the source of the data, e.g., "[my_schema].[my_table]".

suppress

A character vector of variable names that will have sample values suppressed in the dictionary output (e.g., Social Security Numbers). Default is NULL.

varsort

A logical value (TRUE or FALSE) indicating whether to sort the dictionary alphabetically by variable name. Default is FALSE, which keeps the original column order.

max_unique_values

An integer indicating the maximum number of unique values to display for each variable. If a variable has more unique values than this, it will be truncated or summarized. Default is 8.

truncation_threshold

An integer indicating how many values to display before truncating with an ellipsis, when the number of unique values exceeds max_unique_values. Only applies to non-numeric variables. Default is 5.

ph.ref

An optional reference data.table or data.frame with columns: source, varname, desc, and notes. It will be merged on to the new data dictionary. Default is NULL.

Details

The create_dictionary function generates a dictionary from the provided data.table or data.frame, indicating the variable types and listing unique values for each variable. Different variable types are handled as follows:

  • Character and logical variables: If the number of unique values exceeds max_unique_values, the function displays the first truncation_threshold values followed by an ellipsis.

  • Factor variables: The function displays factor levels and their corresponding integer codes. These are displayed following the rules for character values.

  • Numeric variables (integer, numeric): If the number of unique values exceeds max_unique_values, the function displays the minimum and maximum values.

  • Date and datetime variables: Treated similarly to numeric variables, showing minimum and maximum values if there are too many unique values.

  • Other types: For non-atomic types (e.g., lists), the function suggests checking the original dataset structure.

Users can hide the unique values of sensitive variables (e.g., phone numbers in ph.data) using the suppress parameter. Additionally, if a reference data.table or data.frame (ph.ref) is provided, it will merge descriptions and notes into the output.

Value

A data.table with the following columns:

source

Character: The source of the data.

varname

Character: The name of the variable.

vartype

Character: The type of the variable (e.g., factor, character, logical, integer, numeric, date, datetime, other).

values

Character: A sample of unique values or a range if the number of unique values exceeds max_unique_values.

factor_labels

Character: Labels for factor levels if the variable is a factor.

desc

Character: Description of the variable. This column is only filled if a ph.ref data frame is provided.

notes

Character: Additional notes about the variable. This column is only filled if a ph.ref data frame is provided.

dict_updated

Date: The date the dictionary was created, i.e., the date you ran this function.

Examples

library(data.table)
dt <- data.table(
  xID = paste0(sample(LETTERS, size = 1000, replace = TRUE),
               sample(c(12345L:99999L), size = 1000, replace = TRUE)),
  xlogical = sample(c(TRUE, FALSE), size = 1000, replace = TRUE),
  xchar_long = sample(c(LETTERS), size = 1000, replace = TRUE),
  xchar_short = sample(c('a', 'b', 'c', 'd'), size = 1000, replace = TRUE),
  xfactor = factor(sample(1L:4L, size = 1000, replace = TRUE),
                   levels = 1L:4L,
                   labels = c('One', 'Two', 'Three', 'Four')),
  xbinary = sample(c(0, 1), size = 1000, replace = TRUE),
  xinteger_long = sample(c(0L:5000L), size = 1000, replace = TRUE),
  xinteger_short = sample(c(0:4), size = 1000, replace = TRUE),
  xnumeric = runif(1000, 0, 100),
  xdate_long = as.Date(sample(c(as.Date('1900-01-01'):as.Date('1999-12-31')),
                       size = 1000,
                       replace = TRUE)),
  xdate_short = as.Date(sample(c(as.Date('2000-01-01'):as.Date('2000-01-04')),
                        size = 1000,
                        replace = TRUE)),
  xdatetime_long = as.POSIXct(
                    runif(1000,
                          min = as.numeric(as.POSIXct('2023-01-01 00:00:00')),
                          max = as.numeric(Sys.time())),
                          origin = "1970-01-01"),
  xother = sample(list(c(1:3), c(2:4), c(3:5), c(4:6)),
                  size = 1000,
                  replace = TRUE)
)

dictionary1 <- create_dictionary(ph.data = dt,
                                 source = 'test dataset',
                                 suppress = c('xID'),
                                 varsort = FALSE)
print(dictionary1[])

ph.ref <- data.table(
  source = rep('test dataset', 4),
  varname = c('xID', 'xbinary', 'xlogical', 'xchar_long', 'xfactor',
             'xinteger_long'),
  desc = c('ID', 'Binary variable', 'Logical variable',
           'Character variable with long names',
           'Factor variable with labels',
           'Integer variable with long range'),
  notes = c('Sample IDs', 'Generic binary', 'Important', 'Check values',
            'Categorical data', 'Range from 0 to 5000')
)

dictionary2 <- create_dictionary(ph.data = dt,
                                 source = 'test dataset',
                                 suppress = c('xID'),
                                 varsort = FALSE,
                                 ph.ref = ph.ref)
print(dictionary2[])


PHSKC-APDE/rads documentation built on April 14, 2025, 10:47 a.m.