Defines functions count_f_str_check separate_int_dig parse_fmt parse_hug_char gather_settings f_str

Documented in f_str

### Formatting

#' Create a `f_str` object
#' `f_str` objects are intended to be used within the function
#' `set_format_strings`. The `f_str` object carries information that powers a
#' significant amount of layer processing. The `format_string` parameter is
#' capable of controlling the display of a data point and decimal precision. The
#' variables provided in `...` control which data points are used to populate
#' the string formatted output.
#' @details Format strings are one of the most powerful components of 'Tplyr'.
#'   Traditionally, converting numeric values into strings for presentation can
#'   consume a good deal of time. Values and decimals need to align between
#'   rows, rounding before trimming is sometimes forgotten - it can become a
#'   tedious mess that is realistically not an important part of the analysis
#'   being performed. 'Tplyr' makes this process as simple as we can, while
#'   still allowing flexibility to the user.
#'   Tplyr provides both manual and automatic decimal precision formatting. The
#'   display of the numbers in the resulting data frame is controlled by the
#'   `format_string` parameter. For manual precision, just like dummy values may
#'   be presented on your mocks, integer and decimal precision is specified by
#'   the user providing a string of 'x's for how you'd like your numbers
#'   formatted. If you'd like 2 integers with 3 decimal places, you specify your
#'   string as 'xx.xxx'. 'Tplyr' does the work to get the numbers in the right
#'   place.
#'   To take this a step further, automatic decimal precision can also be
#'   obtained based on the collected precision within the data. When creating
#'   tables where results vary by some parameter, different results may call for
#'   different degrees of precision. To use automatic precision, use a single
#'   'a' on either the integer and decimal side. If you'd like to use increased
#'   precision (i.e. you'd like mean to be collected precision +1), use 'a+1'.
#'   So if you'd like both integer and and decimal precision to be based on the
#'   data as collected, you can use a format like 'a.a' - or for collected+1
#'   decimal precision, 'a.a+1'.  You can mix and match this with manual formats
#'   as well, making format strings such as 'xx.a+1'.
#'   If you want two numbers on the same line, you provide two sets of x's. For
#'   example, if you're presenting a value like "mean (sd)" - you could provide
#'   the string 'xx.xx (xx.xxx)', or perhaps 'a.a+1 (a.a+2). Note that you're
#'   able to provide different integer lengths and different decimal precision
#'   for the two values. Each format string is independent and relates only to
#'   the format specified.
#'   As described above, when using 'x' or 'a', any other character within the
#'   format string will stay stationary. So for example, if your format string
#'   is 'xx (xxx.x)', your number may format as '12 ( 34.5)'. So the left side
#'   parenthesis stays fixed. In some displays, you may want the parenthesis to
#'   'hug' your number. Following this example, when allotting 3 spaces for the
#'   integer within parentheses, the parentehsis should shift to the right,
#'   making the numbers appear '12  (34.5)'. Using `f_str()` you can achieve
#'   this by using a capital 'X' or 'A'. For this example, the format string
#'   would be 'xx (XXX.x)'.
#'   There are a two rules when using 'parenthesis hugging':
#'   - Capital letters should only be used on the integer side of a number
#'   - A character must precede the capital letter, otherwise there's no
#'   character to 'hug'
#'   The other parameters of the `f_str` call specify what values should fill
#'   the x's. `f_str` objects are used slightly differently between different
#'   layers. When declaring a format string within a count layer, `f_str()`
#'   expects to see the values `n` or `distinct_n` for event or distinct counts,
#'   `pct` or `distinct_pct` for event or distinct percentages, or `total` or
#'   `distinct_total` for denominator calculations. Note that in an `f_str()`
#'   for a count layer 'A' or 'a' are based on n counts, and therefore don't
#'   make sense to use in percentages. But in descriptive statistic layers,
#'   `f_str` parameters refer to the names of the summaries being performed,
#'   either by built in defaults, or custom summaries declared using
#'   [set_custom_summaries()]. See [set_format_strings()] for some more notes
#'   about layers specific implementation.
#'   An `f_str()` may also be used outside of a Tplyr table. The function
#'   [apply_formats()] allows you to apply an `f_str` within the context of
#'   [dplyr::mutate()] or more generally a vectorized function.
#' @section Valid `f_str()` Variables by Layer Type:
#'   Valid variables allowed within the `...` parameter of `f_str()` differ by
#'   layer type.
#'   - Count layers
#'     - `n`
#'     - `pct`
#'     - `total`
#'     - `distinct_n`
#'     - `distinct_pct`
#'     - `distinct_total`
#'   - Shift layers
#'     - `n`
#'     - `pct`
#'     - `total`
#'   - Desc layers
#'     - `n`
#'     - `mean`
#'     - `sd`
#'     - `median`
#'     - `var`
#'     - `min`
#'     - `max`
#'     - `iqr`
#'     - `q1`
#'     - `q3`
#'     - `missing`
#'     - Custom summaries created by [set_custom_summaries()]
#' @param format_string The desired display format. X's indicate digits. On the
#'   left, the number of x's indicates the integer length. On the right, the
#'   number of x's controls decimal precision and rounding. Variables are
#'   inferred by any separation of the 'x' values other than a decimal.
#' @param ... The variables to be formatted using the format specified in
#'   \code{format_string}.
#' @param empty The string to display when the numeric data is not available.
#'   For desc layers, an unnamed character vector will populate within the
#'   provided format string, set to the same width as the fitted numbers. Use a
#'   single element character vector, with the element named '.overall' to
#'   instead replace the whole string.
#' @return A `f_str` object
#' @export
#' @md
#' @examples
#' f_str("xx.x (xx.x)", mean, sd)
#' f_str("a.a+1 (a.a+2)", mean, sd)
#' f_str("xx.a (xx.a+1)", mean, sd)
#' f_str("xx.x, xx.x, xx.x", q1, median, q3)
#' f_str("xx (XXX.x%)", n, pct)
#' f_str("a.a+1 (A.a+2)", mean, sd)
f_str <- function(format_string, ..., empty=c(.overall='')) {

  # Capture the variables off of the ellipsis
  vars <- enexprs(...)
  # Null out the names for vars
  names(vars) <- NULL

  # Check format string class
  assert_has_class(format_string, "character")

  # Do a pre-check of the format string to catch invalid auto specifications
  if (str_detect(format_string, "AA|aa")) {
    stop(paste0("In f_str(), only use a single 'A' or 'a' on the integer or",
                " decimal side to trigger auto precision."), call.=TRUE)

  # Parse out the format string sections
  rx <- get_format_string_regex()
  formats <- str_extract_all(format_string, rx)[[1]]

  # Duplicate any '%' to escape them
  format_string_1 <- str_replace_all(format_string, "%", "%%")

  # Make the sprintf ready string
  repl_str <- str_replace_all(format_string_1, rx, "%s")

  # Make sure that if two formats were found, two variables exist
  assert_that(length(formats) == length(vars),
              msg = paste0("In `f_str` ", length(formats), " formats were entered in the format string ",
                           format_string, "but ", length(vars), " variables were assigned."))

  # Pull out the integer and decimal
  settings <- map(formats, gather_settings)

  # A value in settings will be <0 if it's an auto format
  auto_precision <- any(map_lgl(settings, ~ any(as.logical(.[c('auto_int', 'auto_dec')]))))
  hug_formatting <- any(map_lgl(settings, ~ !is.na(.['hug_char'])))

  # All ellipsis variables are names
  assert_that(all(map_lgl(vars, function(x) class(x) == "name")),
              msg = "In `f_str` all values submitted via `...` must be variable names.")

    list(format_string = format_string,
         vars = vars,
         formats = formats,
         settings = settings,
         size = nchar(format_string),
         repl_str = repl_str,
         auto_precision = auto_precision,
         hug_formatting = hug_formatting,

#' Gather the settings for a specific format string section
#' This function will collect specific settings about a format string section,
#' including integer and decimal length, whether autos were turned on, and hug
#' character settings/
#' @param x A character string representing a format string section
#' @return A named list of settings
#' @noRd
gather_settings <- function(x) {

  settings <- list(
    int = 0,
    dec = 0,
    auto_int = FALSE,
    auto_dec = FALSE,
    hug_char = NA_character_

  settings <- parse_hug_char(x, settings)
  settings <- separate_int_dig(x, settings)


#' Find if a hug character exists and attach to settings
#' @param x Format string section
#' @param settings A list of settings for a format string section
#' @return List of settings
#' @noRd
parse_hug_char <- function(x, settings) {

  # Find hugging
  if (str_detect(x, "X|A")) {

    # Look for characters preceding X or A that aren't X or A
    hug_char_rx <- regex("([^XA]+)[XA]")

    # Search the hug character and pull out all matches
    # x is guaranteed to be a single element vector so pull out first
    # element of the list
    hug_char_match <- str_match_all(x, hug_char_rx)[[1]]

    # If no rows, then X or A was used with no specified hug character
    if (nrow(hug_char_match) == 0) {
        paste0("In f_str(), an 'X' or 'A' was used but no hug character ",
               "was specified, such as a parenthesis. Use 'X' or 'A' to bind ",
               "a character within a format string."),

    # The match matrix can't be more than one row. If it is, it was probably
    # because X or A were placed before and after a decimal, so show the user
    if (nrow(hug_char_match) > 1) {
      err_msg <- paste0(
        "In f_str(), invalid format string specification. The following section",
        " failed to parse:\n\t'", x,
        "'\nThe issue is present with a hug character. Was 'X' or 'A' used after",
        " a decimal?"
      stop(err_msg, call.=FALSE)

    # If X or A was used after the decimal at all, that's also invalid so error
    # out as well
    if (str_detect(hug_char_match[1,1], fixed("."))) {
        paste0("In f_str(), 'X' or 'A' can only be used on the left side of a",
               " decimal within a format string."),

    # The hug char is in a capture group, so we pull it out of the match
    settings$hug_char <- hug_char_match[1,2]


#' Parse a portion of a string format
#' After the string is split by the decimal, parse what remains
#' Auto formats will start at -1 and decrement by the + value
#' @param x Portioned string format
#' @return A numeric value. >0 is literal length, <0 is auto format
#' @noRd
parse_fmt <- function(x) {
  # If it's an auto format, grab the output value
  if (grepl('a|A', x)) {
    # Pick out the digit
    add <- replace_na(as.double(str_extract(x, '\\d+')), 0)
    # Auto formats will be -1 - the specified precision
    val <- 0 + add
    # Give an attribute that there's an auto format
    attr(val, 'auto') <- TRUE
  } else {
    val <- nchar(x)
    # Not auto format
    attr(val, 'auto') <- FALSE

#' Evaluate a portion of a format string to check the integer and digit lengths
#' @param x Format string section
#' @param settings A list of settings for a format string section
#' @return List of settings
#' @noRd
separate_int_dig <- function(x, settings){

  # Count the characters on each side of the decimal
  fields <- str_split(x, "\\.")[[1]]
  # Label the split segments
  names(fields) <- c('int', 'dec')[1:length(fields)]

  # Parse out length and auto info from each field and apply to settings
  num_chars <- map(fields, parse_fmt)
  auto <- map_lgl(num_chars, ~attr(.x, 'auto'))

  settings[names(num_chars)] <- as.numeric(num_chars)
  settings[paste0("auto_", names(auto))] <- auto

  # If a hug character is specified,subtract if from the integer length
  if (!is.na(settings$hug_char) && settings$auto_int) {
    settings$int <- settings$int + (nchar(settings$hug_char) - 1)


#' Helper for changing values on count f_str
#' @param ... The object passed to `set_format_strings` or `set_count_layer_formats`
#' @noRd
count_f_str_check <- function(...) {
  # Catch the arguments from the function call so useful errors can be thrown
  check <- enquos(...)

  # Make sure that all of the attachments were `f_str` objects
  for (i in seq_along(check)) {

    if (is_named(check)) {
      msg = paste0("In `set_format_string` entry `",names(check)[[i]],"` is not an `f_str` object. All assignmentes made within",
                   " `set_format_string` must be made using the function `f_str`. See the `f_str` documentation.")
    } else {
      msg = paste0("In `set_format_string` entry ",i," is not an `f_str` object. All assignmentes made within",
                   " `set_format_string` must be made using the function `f_str`. See the `f_str` documentation.")

    assert_that(class(quo_get_expr(check[[i]])) == "f_str" || (is_call(quo_get_expr(check[[i]])) && call_name(check[[i]]) == "f_str"),
                msg = msg)

  # Grab the named parameters
  params <- list2(...)

  # Currently supported format names
  valid_names <- c("n_counts", "riskdiff")

  # Raise error if names were invalid
  if (is_named(params)) {
    assert_that(all(names(params) %in% valid_names),
                msg = paste('Invalid format names supplied. Count layers only accept the following format names:',
                            paste(valid_names, collapse = ", "))

  } else {
    # If unnamed, then only one argument should have been supplied
    assert_that(length(params) == 1, msg = "If names are not supplied, count layers can only have on format supplied.")
    # Force the name in of n_counts
    names(params) <- "n_counts"

  # Check content of each f_str based on their supplied name
  for (name in names(params)) {

    if (name == "n_counts") {
      assert_that(all(params[['n_counts']]$vars %in% c("n", "pct", "distinct", "distinct_n", "distinct_pct", "total", "distinct_total")),
                  msg = "f_str for n_counts in a count_layer can only be n, pct, distinct, distinct_pct, total, or distinct_total")

      # Check to make sure both disintct(old), and distinct_n(new) aren't passed
      assert_that(!all(c("distinct", "distinct_n") %in% params[["n_counts"]]$vars),
                  msg = "You can't pass both distinct and distinct_n, just use distinct_n")

      # Check to make sure duplicated parameters aren't passed
      assert_that(length(params[["n_counts"]]$vars) == length(unique(params[["n_counts"]]$vars)),
                  msg = "You've passed duplicate parameters to `set_format_strings`")

      # Replace the disinct with distinct_n
      if (any(params[["n_counts"]]$vars %in% "distinct")) {
        warning("The use of 'distinct' in count f_strs is discouraged. It was replaced with 'distinct_n' for consistancy.")
      params[["n_counts"]]$vars[params[["n_counts"]]$vars %in% "distinct"] <- "distinct_n"

    } else if (name == "riskdiff") {
      assert_that(all(params[['riskdiff']]$vars %in% c('comp', 'ref', 'dif', 'low', 'high')),
                  msg = "f_str for riskdiff in a count_layer can only be comp, ref, dif, low, or high")


Try the Tplyr package in your browser

Any scripts or data that you put into this service are public.

Tplyr documentation built on May 29, 2024, 10:37 a.m.