View source: R/create_dictionary.R
create_dictionary | R Documentation |
This function generates a dictionary of the variables in a given data.table or data.frame, detailing the variable types and a sample of unique values. Optionally, it can incorporate descriptions and notes from a reference data.table or data.frame.
create_dictionary(ph.data,
source,
suppress = NULL,
varsort = FALSE,
max_unique_values = 8,
truncation_threshold = 5,
ph.ref = NULL)
ph.data |
A data.table or data.frame containing the dataset to be analyzed. |
source |
A character string indicating the source of the data, e.g.,
|
suppress |
A character vector of variable names that will have sample values suppressed in the dictionary output (e.g., Social Security Numbers). Default is NULL. |
varsort |
A logical value ( |
max_unique_values |
An integer indicating the maximum number of unique values to display for each variable. If a variable has more unique values than this, it will be truncated or summarized. Default is 8. |
truncation_threshold |
An integer indicating how many values to display before truncating with an ellipsis, when the number of unique values exceeds max_unique_values. Only applies to non-numeric variables. Default is 5. |
ph.ref |
An optional reference data.table or data.frame with columns:
|
The create_dictionary
function generates a dictionary from the
provided data.table or data.frame, indicating the variable types and listing
unique values for each variable. Different variable types are handled as
follows:
Character and logical variables: If the number of unique
values exceeds max_unique_values
, the function displays the first
truncation_threshold
values followed by an ellipsis.
Factor variables: The function displays factor levels and their corresponding integer codes. These are displayed following the rules for character values.
Numeric variables (integer, numeric): If the number of
unique values exceeds max_unique_values
, the function displays the
minimum and maximum values.
Date and datetime variables: Treated similarly to numeric variables, showing minimum and maximum values if there are too many unique values.
Other types: For non-atomic types (e.g., lists
), the
function suggests checking the original dataset structure.
Users can hide the unique values of sensitive variables (e.g., phone numbers
in ph.data
) using the suppress
parameter. Additionally, if a reference
data.table or data.frame (ph.ref
) is provided, it will merge descriptions
and notes into the output.
A data.table with the following columns:
Character: The source of the data.
Character: The name of the variable.
Character: The type of the variable (e.g., factor, character, logical, integer, numeric, date, datetime, other).
Character: A sample of unique values or a range if the number
of unique values exceeds max_unique_values
.
Character: Labels for factor levels if the variable is a factor.
Character: Description of the variable. This column is only
filled if a ph.ref
data frame is provided.
Character: Additional notes about the variable. This column is
only filled if a ph.ref
data frame is provided.
Date: The date the dictionary was created, i.e., the date you ran this function.
library(data.table)
dt <- data.table(
xID = paste0(sample(LETTERS, size = 1000, replace = TRUE),
sample(c(12345L:99999L), size = 1000, replace = TRUE)),
xlogical = sample(c(TRUE, FALSE), size = 1000, replace = TRUE),
xchar_long = sample(c(LETTERS), size = 1000, replace = TRUE),
xchar_short = sample(c('a', 'b', 'c', 'd'), size = 1000, replace = TRUE),
xfactor = factor(sample(1L:4L, size = 1000, replace = TRUE),
levels = 1L:4L,
labels = c('One', 'Two', 'Three', 'Four')),
xbinary = sample(c(0, 1), size = 1000, replace = TRUE),
xinteger_long = sample(c(0L:5000L), size = 1000, replace = TRUE),
xinteger_short = sample(c(0:4), size = 1000, replace = TRUE),
xnumeric = runif(1000, 0, 100),
xdate_long = as.Date(sample(c(as.Date('1900-01-01'):as.Date('1999-12-31')),
size = 1000,
replace = TRUE)),
xdate_short = as.Date(sample(c(as.Date('2000-01-01'):as.Date('2000-01-04')),
size = 1000,
replace = TRUE)),
xdatetime_long = as.POSIXct(
runif(1000,
min = as.numeric(as.POSIXct('2023-01-01 00:00:00')),
max = as.numeric(Sys.time())),
origin = "1970-01-01"),
xother = sample(list(c(1:3), c(2:4), c(3:5), c(4:6)),
size = 1000,
replace = TRUE)
)
dictionary1 <- create_dictionary(ph.data = dt,
source = 'test dataset',
suppress = c('xID'),
varsort = FALSE)
print(dictionary1[])
ph.ref <- data.table(
source = rep('test dataset', 4),
varname = c('xID', 'xbinary', 'xlogical', 'xchar_long', 'xfactor',
'xinteger_long'),
desc = c('ID', 'Binary variable', 'Logical variable',
'Character variable with long names',
'Factor variable with labels',
'Integer variable with long range'),
notes = c('Sample IDs', 'Generic binary', 'Important', 'Check values',
'Categorical data', 'Range from 0 to 5000')
)
dictionary2 <- create_dictionary(ph.data = dt,
source = 'test dataset',
suppress = c('xID'),
varsort = FALSE,
ph.ref = ph.ref)
print(dictionary2[])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.