R/frequencies.R
In qol: Powerful 'SAS' Inspired Concepts for more Efficient Bigger Outputs

Documented in frequencies

#' Display Frequency Tables of Single Variables
#'
#' @description
#' [frequencies()] produces two kinds of tables for a quick overview of single variables.
#' The first table is for a broader overview and contains mean, sd, min, max, freq and missings.
#' The second table is the actual frequency table which shows the weighted sums, percentages
#' and unweighted frequencies per expression.
#'
#' @param data_frame A data frame in which are the variables to tabulate.
#' @param variables A vector of single variables to create frequency tables for.
#' @param formats A list in which is specified which formats should be applied to which variables.
#' @param by Compute tables stratified by the expressions of the provided variables.
#' @param weight Put in a weight variable to compute weighted results.
#' @param titles Specify one or more table titles.
#' @param footnotes Specify one or more table footnotes.
#' @param style A list of options can be passed to control the appearance of excel outputs.
#' Styles can be created with [excel_output_style()].
#' @param output The following output formats are available: console (default), text,
#' excel and excel_nostyle.
#' @param na.rm FALSE by default. If TRUE removes all NA values from the variables.
#' @param print TRUE by default. If TRUE prints the output, if FALSE doesn't print anything. Can be used
#' if one only wants to catch the output data frame.
#' @param monitor FALSE by default. If TRUE, outputs two charts to visualize the functions time consumption.
#'
#' @details
#' [frequencies()] is based on the 'SAS' procedure Proc Freq, which provides
#' efficient and readable ways to output frequency tables.
#'
#' To create a frequency table you only need to provide a single variable.
#' Nothing special about this. The real power comes into play, when you output your
#' tables as a fully styled 'Excel' workbook. Setting up a custom, reusable style is
#' as easy as setting up options like: provide a color for the table header, set the
#' font size for the row header, should borders be drawn for the table cells yes/no,
#' and so on.
#'
#' You also can provide multiple single variables to generate multiple, evenly designed
#' tables, all at once. For just a quick overview, rather than fully designed
#' tables, you can also just output the tables in ASCII style format.
#'
#' @return
#' Returns a list of two data tables containing the results for the frequency tables.
#'
#' @seealso
#' Creating a custom table style: [excel_output_style()], [modify_output_style()],
#' [number_format_style()], [modify_number_formats()].
#'
#' Creating formats: [discrete_format()] and [interval_format()].
#'
#' Functions that can handle formats and styles: [crosstabs()], [any_table()].
#'
#' Additional functions that can handle styles: [export_with_style()]
#'
#' Additional functions that can handle formats: [summarise_plus()], [recode()],
#' [recode_multi()]
#'
#' @examples
#' # Example data frame
#' my_data <- dummy_data(1000)
#'
#' # Define titles and footnotes. If you want to add hyperlinks you can do so by
#' # adding "link:" followed by the hyperlink to the main text.
#' titles <- c("This is title number 1 link: https://cran.r-project.org/",
#'             "This is title number 2",
#'             "This is title number 3")
#' footnotes <- c("This is footnote number 1",
#'                "This is footnote number 2",
#'                "This is footnote number 3 link: https://cran.r-project.org/")
#'
#' # Output frequencies tables
#' my_data |> frequencies(sex)
#' my_data |> frequencies(c(age, education),
#'                        weight = weight)
#'
#' # Also works with characters
#' my_data |> frequencies("sex")
#' my_data |> frequencies(c("age", "education"),
#'                        weight = "weight")
#'
#' # Applying formats and titles
#' sex. <- discrete_format(
#'     "Total"  = 1:2,
#'     "Male"   = 1,
#'     "Female" = 2)
#'
#' my_data |> frequencies(sex, formats(sex = sex.),
#'                        titles    = titles,
#'                        footnotes = footnotes)
#'
#' # Split frequencies by expressions of another variable
#' my_data |> frequencies(sex, by = education)
#'
#' # Get a list with two data tables for further usage
#' result_list <- my_data |> frequencies(sex, formats(sex = sex.))
#'
#' # Output in text file
#' my_data |> frequencies(sex, output = "text")
#'
#' # Output to Excel
#' my_data |> frequencies(sex, output = "excel")
#'
#' # With individual styling
#' my_style <- excel_output_style(header_back_color = "0077B6",
#'                                font              = "Times New Roman")
#'
#' my_data |> frequencies(sex, output = "excel", style = my_style)
#'
#' @export
frequencies <- function(data_frame,
                        variables,
                        formats   = c(),
                        by        = c(),
                        weight    = NULL,
                        titles    = c(),
                        footnotes = c(),
                        style     = excel_output_style(),
                        output    = "console",
                        na.rm     = FALSE,
                        print     = TRUE,
                        monitor   = FALSE){

    # Measure the time
    start_time <- Sys.time()

    monitor_df <- NULL |> monitor_start("Error handling", "Preparation")

    # First convert data frame to data table
    if (!data.table::is.data.table(data_frame)){
        data_frame <- data.table::as.data.table(data_frame)
    }

    # Evaluate formats early, otherwise apply formats can't evaluate them in unit
    # test situation.
    formats_list <- as.list(substitute(formats))[-1]

    formats <- stats::setNames(
        lapply(formats_list, function(expression){
            # Catch expression if passed as string
            if (is.character(expression)) {
                tryCatch(get(expression, envir = parent.frame()),
                         error = function(e) NULL)
            }
            # Catch expression if passed as symbol
            else{
                tryCatch(eval(expression, envir = parent.frame()),
                         error = function(e) NULL)
            }
        }),
        names(formats_list))

    # Look up variable names in format data frame to check whether there is an
    # interval or discrete format
    flag_interval <- FALSE

    for (current_var in names(formats)){
        format_df          <- formats[[current_var]]
        interval_variables <- c("from", "to")
        actual_variables   <- names(format_df)[1:2]

        if (identical(interval_variables, actual_variables)){
            flag_interval <- TRUE
            break
        }
    }

    ###########################################################################
    # Error handling
    ###########################################################################

    # Convert to character vectors
    variables_temp <- sub("^list\\(", "c(", gsub("\"", "", deparse(substitute(variables))))

    if (substr(variables_temp, 1, 2) == "c("){
        variables <- as.character(substitute(variables))
    }
    else if (!is_error(variables)){
        # Do nothing. In this case variables already contains the substituted variable names
        # while variables_temp is evaluated to the symbol passed into the function.
    }
    else{
        variables <- variables_temp
    }

    # Remove extra first character created with substitution
    variables <- variables[variables != "c"]

    provided_variables <- variables
    invalid_variables  <- variables[!variables %in% names(data_frame)]
    variables          <- variables[variables %in% names(data_frame)]

    if (length(invalid_variables) > 0){
        message(" ! WARNING: The provided analysis variable '", paste(invalid_variables, collapse = ", "), "' is not part of\n",
                "            the data frame. This variable will be omitted during computation.")
    }

    if (length(variables) == 0){
        message(" X ERROR: No valid analysis variables provided. Frequencies will be aborted.")
        return(invisible(NULL))
    }

    # Convert to character vectors
    by_temp <- sub("^list\\(", "c(", gsub("\"", "", deparse(substitute(by))))

    if (substr(by_temp, 1, 2) == "c("){
        by <- as.character(substitute(by))
    }
    else if (!is_error(by)){
        # Do nothing. In this case variables already contains the substituted variable names
        # while variables_temp is evaluated to the symbol passed into the function.
    }
    else{
        by <- by_temp
    }

    # Remove extra first character created with substitution
    by <- by[by != "c"]

    provided_by <- by
    invalid_by  <- by[!by %in% names(data_frame)]
    by          <- by[by %in% names(data_frame)]

    if (length(invalid_by) > 0){
        message(" ! WARNING: The provided by variable '", paste(invalid_by, collapse = ", "), "' is not part of\n",
                "            the data frame. This variable will be omitted during computation.")
    }

    invalid_by <- by[by %in% variables]

    if (length(invalid_by) > 0){
        message(" X ERROR: The provided by variable '", paste(invalid_by, collapse = ", "), "' is also part of\n",
                "          the frequency variables which is not allowed. Frequencies will be aborted.")
        return(invisible(NULL))
    }

    # Create temporary weight column if none is provided.
    # Also get the name of the weight variable as string.
    weight_temp <- sub("^list\\(", "c(", gsub("\"", "", deparse(substitute(weight))))

    if (weight_temp == "NULL" || substr(weight_temp, 1, 2) == "c("){
        weight_var <- ".temp_weight"
        data_frame[[".temp_weight"]] <- 1

        if (substr(weight_temp, 1, 2) == "c("){
            message(" ! WARNING: Only one variable for weight allowed. Evaluations will be unweighted.")
        }
    }
    else if (!is_numeric(data_frame[[weight_temp]])){
        weight_var <- ".temp_weight"
        data_frame[[".temp_weight"]] <- 1

        message(" ! WARNING: Provided weight variable is not numeric. Unweighted results will be computed.")
    }
    else{
        weight_var <- weight_temp

        # NA values in weight lead to errors therefor convert them to 0
        if (anyNA(data_frame[[weight_temp]])){
            message(" ~ NOTE: Missing values in weight variable '", weight_temp, "' will be converted to 0.")
        }
        data_frame[[weight_temp]] <- data.table::fifelse(is.na(data_frame[[weight_temp]]), 0, data_frame[[weight_temp]])

        # @Hack: so I don't have to check if .temp_weight exists later on
        data_frame[[".temp_weight"]] <- 1
    }

    # Check for invalid output option
    if (!tolower(output) %in% c("console", "text", "excel", "excel_nostyle")){
        message(" ! WARNING: Output format '", output, "' not available. Using 'console' instead.")

        output <- "console"
    }
    else{
        output <- tolower(output)
    }

    ###########################################################################
    # Frequency starts
    ###########################################################################

    monitor_df <- monitor_df |> monitor_next("Mean summary", "Summary")
    message("\n > Computing stats.")

    # For mean tabulation only numeric variables are allowed. So here all character
    # variables are excluded temporarily.
    vars_mean <- variables

    for (variable in vars_mean){
        if (is.character(data_frame[[variable]])){
            vars_mean <- vars_mean[vars_mean != variable]
        }
    }

    # In case no by variables are specified
    if (length(by) == 0){
        # Compute stats for mean summary
        mean_tab <- suppressMessages(data_frame |>
             summarise_plus(values     = vars_mean,
                            statistics = c("mean", "sd", "min", "max", "freq", "missing"),
                            weight     = weight_var,
                            notes      = FALSE,
                            na.rm      = na.rm) |>
                 drop_type_vars())
    }
    # In case by variables are specified
    else{
        mean_tab <- suppressMessages(data_frame |>
             summarise_plus(class      = by,
                            values     = vars_mean,
                            statistics = c("mean", "sd", "min", "max", "freq", "missing"),
                            formats    = formats,
                            weight     = weight_var,
                            nesting    = "all",
                            types      = by,
                            notes      = FALSE,
                            na.rm      = na.rm))

        if (!is.null(mean_tab)){
            mean_tab <- mean_tab |>
                fuse_variables("by_vars", by) |>
                collapse::fsubset(DEPTH > 0)
        }
    }

    # Put results for each variable below each other instead of beside
    mean_columns <- c("mean", "sd", "min", "max", "freq", "miss")
    all_rows     <- list()

    for (variable in vars_mean){
        # Extract all stats for a single variable and give them uniform names
        var_row <- suppressMessages(mean_tab |>
            keep("TYPE", "by_vars",
                 grep(variable,
                      names(mean_tab),
                      value = TRUE)))

        if (length(by) == 0){
            names(var_row) <- mean_columns
        }
        else{
            names(var_row) <- c("TYPE", "by_vars", mean_columns)
        }

        # Put variable name in front and put formatted row below the other ones
        # if there are any.
        var_row              <- cbind(variable, var_row)
        all_rows[[variable]] <- var_row
    }

    mean_tab <- data.table::rbindlist(all_rows, use.names = TRUE, fill = TRUE)

    # Summarise results for a frequency output per variable
    monitor_df <- monitor_df |> monitor_next("Freq summary", "Summary")

    data_frame[["var"]] <- 1

    # In case no by variables are specified
    if (length(by) == 0){
        freq_tab <- suppressMessages(data_frame |>
             summarise_plus(class      = variables,
                            values     = "var",
                            statistics = c("sum", "pct_group", "freq"),
                            formats    = formats,
                            weight     = weight_var,
                            nesting    = "single",
                            notes      = FALSE,
                            na.rm      = na.rm))
    }
    # In case by variables are specified
    else{
        group_vars   <- c(by, variables)
        combinations <- as.vector(outer(by, variables, paste, sep = "+"))
        combinations <- c("total", combinations)

        freq_tab <- suppressMessages(data_frame |>
             summarise_plus(class      = group_vars,
                            values     = "var",
                            statistics = c("sum", "pct_group", "freq"),
                            formats    = formats,
                            weight     = weight_var,
                            nesting    = "all",
                            types      = combinations,
                            notes      = FALSE,
                            na.rm      = na.rm)) |>
             fuse_variables("fused_vars", variables) |>
             fuse_variables("by_vars", by)

        freq_tab[1, "by_vars"] <- "total"
        freq_tab[["BY"]]       <- sub("\\+.*", "", freq_tab[["TYPE"]])
        freq_tab[["TYPE"]]     <- sub(".*\\+", "", freq_tab[["TYPE"]])
    }

    if (is.null(mean_tab) || is.null(freq_tab)){
        message(" X ERROR: Frequencies could not be computed.")
        return(invisible(NULL))
    }

    # Prepare table format for output
    message(" > Formatting tables.")

    if (output %in% c("console", "text")){
        monitor_df <- monitor_df |> monitor_next("Format tables", "Format tables")

        # In case no by variables are provided
        if (length(by) == 0){
            complete_mean  <- format_mean_text(mean_tab, vars_mean, mean_columns)
            complete_freq  <- format_freq_text(freq_tab, variables, formats,
                                               by, titles, footnotes)

            complete_table <- c(complete_mean, complete_freq)
        }
        # In case there are  by variables are provided
        else{
            complete_table <- format_by_text(mean_tab, freq_tab, variables, mean_columns,
                                             formats, by, titles, footnotes, na.rm)
        }
    }
    else if (output == "excel" || output == "excel_nostyle"){
        wb <- openxlsx2::wb_workbook() |>
            prepare_styles(style)

        monitor_df <- monitor_df |> monitor_end()

        # In case no by variables are provided
        if (length(by) == 0){
            wb_list <- format_mean_excel(mean_tab, mean_columns, style, output, wb = wb,
                                         monitor_df = monitor_df)
            wb_list <- format_freq_excel(wb_list[[1]], freq_tab, variables, formats,
                                         by, titles, footnotes, style, output, monitor_df = wb_list[[2]])

            wb         <- wb_list[[1]]
            monitor_df <- wb_list[[2]]
        }
        # In case there are  by variables are provided
        else{
            wb_list <- format_by_excel(mean_tab, freq_tab, variables, mean_columns,
                                       formats, by, titles, footnotes, style, output,
                                       na.rm, wb, monitor_df)

            wb         <- wb_list[[1]]
            monitor_df <- wb_list[[2]]
        }
    }

    # Output formatted table into different formats
    if (print){
        monitor_df <- monitor_df |> monitor_next("Output tables", "Output tables")

        if (output %in% c("console")){
            cat(paste(complete_table, collapse = "\n"), "\n\n")
        }
        else if (output == "text"){
            temp_file <- tempfile(fileext = ".txt")
            writeLines(complete_table, temp_file)
            file.show(temp_file)
        }
        else if (output == "excel" || output == "excel_nostyle"){
            if (is.null(style[["file"]])){
                if(interactive()){
                    wb$open()
                }
            }
            else{
                wb$save(file = style[["file"]], overwrite = TRUE)
            }
        }
    }

    monitor_df <- monitor_df |> monitor_end()
    monitor_df |> monitor_plot(draw_plot = monitor)

    end_time <- round(difftime(Sys.time(), start_time, units = "secs"), 3)
    message("\n- - - 'frequencies' execution time: ", end_time, " seconds\n")

    invisible(list("mean" = mean_tab,
                   "freq" = freq_tab))
}

###############################################################################
# Format numbers with decimal points as thousands
###############################################################################
#' Enhance Readability for Large Numbers
#'
#' @description
#' Format numbers with big marks to make them more readable in the text output.
#' Also set up number of decimals.
#'
#' @param number The number to format.
#' @param width Width provided by the largest number in the table column to format
#' all numbers i a column at an even width.
#' @param stat Stat that determines the numbers formatting.
#' @param decimals Number of decimals.
#'
#' @return
#' Returns a formatted number.
#'
#' @noRd
format_number <- function(number,
                          width,
                          stat,
                          decimals = 1){

    # Format numbers to make them more readable
    if (stat %in% c("freq", "miss", "var_freq", "var_cum_freq")){
        output_format <- format(number,
                                format       = "d",
                                decimal.mark = ",",
                                big.mark     = ".",
                                scientific   = FALSE,
                                nsmall       = 0)
    }
    else if (stat %in% c("min", "max")){
        output_format <- format(round(number, decimals),
                                format       = "d",
                                decimal.mark = ",",
                                big.mark     = ".",
                                scientific   = FALSE)
    }
    # Any other stat with decimal numbers
    else{
        output_format <- format(round(number, decimals),
                                format       = "f",
                                decimal.mark = ",",
                                big.mark     = ".",
                                scientific   = FALSE,
                                nsmall       = decimals)
    }

    # Remove NA
    output_format <- sub("NA", "", output_format)

    # Output new format
    format(output_format,
           width   = width,
           justify = "right")
}

###############################################################################
# Get maximum column width for equal formatting
###############################################################################
#' Get Maximum Column Width
#'
#' @description
#' Get maximum column width based on the largest text or number written in the column.
#'
#' @param data_frame The data frame which contains the columns to measure.
#' @param columns_to_format The actual columns from the data frame to be measured for
#' their widest texts or numbers.
#' @param decimals Number of decimals for numbers.
#'
#' @return
#' Returns a vector of column widths.
#'
#' @noRd
get_column_width <- function(data_frame,
                             columns_to_format,
                             decimals = 1){
    all_widths <- c()

    # Get the maximum column width by looking up which is the largest number
    # individually in each column.
    for (column in columns_to_format){
        max_width <- nchar(column)

        if (column %in% c("freq", "miss", "var_freq", "var_cum_freq")){
            max_width <- max(max_width,
                             nchar(format(data_frame[[column]],
                                          format     = "d",
                                          big.mark   = ",",
                                          scientific = FALSE,
                                          nsmall     = 0)))
        }
        # Any other stat with decimal numbers
        else{
            max_width <- max(max_width,
                             nchar(format(round(data_frame[[column]], decimals),
                                          format     = "f",
                                          big.mark   = ",",
                                          scientific = FALSE,
                                          nsmall     = decimals)))
        }

        all_widths <- c(all_widths, max_width)
    }

    all_widths
}

###############################################################################
# Format mean table for console output
###############################################################################
#' Format Mean Table Output (Text Based)
#'
#' @description
#' Format a short summarising table for all numeric variables.
#'
#' @param mean_tab The data frame which contains the information for this summarising
#' table.
#' @param variables The variables that should appear in the table.
#' @param mean_columns The column names from the data frames that should appear
#' in the table.
#'
#' @return
#' Returns a character vector with all formatted rows for the table.
#'
#' @noRd
format_mean_text <- function(mean_tab,
                             variables,
                             mean_columns){
    if (nrow(mean_tab) == 0){
        return(c())
    }

    if ("TYPE" %in% names(mean_tab)){
        mean_tab <- mean_tab |>
            dropp("TYPE", "by_vars")
    }

    # Get the maximum width of the provided variable names to determine the width
    # of the first column.
    first_column_width <- nchar("variable")

    for (variable in variables){
        first_column_width <- max(first_column_width,
                                  nchar(mean_tab[["variable"]]))
    }

    # Set header row formatting. Loop through all header columns and give
    # each column the optimal width.
    column_widths  <- get_column_width(mean_tab, mean_columns)
    header_columns <- c()

    for (i in seq_len(length(mean_columns))){
        header_columns <- c(header_columns,
                            format(mean_columns[i],
                                   width   = column_widths[i],
                                   justify = "right"))
    }

    # Put together the first two rows of the table as complete table header
    header_row <- paste0(format("variable",
                                width   = first_column_width,
                                justify = "left"), "  |   ",
                         paste(header_columns, collapse = "   "))

    complete_header <- c(header_row,
                         strrep("-", nchar(header_row)))

    # Format table column by column. Basically concatenate first column text and
    # corresponding values together while keeping the individual maximum column
    # width in mind.
    first_column <- paste0(format(mean_tab[["variable"]],
                                  width   = first_column_width,
                                  justify = "left"), "  |   ")

    # Loop through stat columns and format each number in a readable way
    formatted_cols <- vector("list", length(mean_columns))

    for (column in seq_len(length(mean_columns))){
        formatted_cols[[column]] <- format_number(mean_tab[column + 1],
                                                  width = column_widths[column],
                                                  stat  = names(mean_tab[column + 1]))
    }

    # Convert list to matrix
    formatted_matrix <- do.call(cbind, formatted_cols)

    # Convert matrix to single formatted rows
    all_rows <- paste0(first_column,
                       apply(formatted_matrix, 1, paste, collapse = "   "))

    # Output formatted result
    c("", complete_header, all_rows)
}

###############################################################################
# Format mean table for excel output
###############################################################################
#' Format Mean Table Output (Excel Based)
#'
#' @description
#' Format a short summarising table for all numeric variables.
#'
#' @param mean_tab The data frame which contains the information for this summarising
#' table.
#' @param mean_columns The column names from the data frames that should appear
#' in the table.
#' @param style A list containing the styling elements.
#' @param output Determines whether to style the output or to just quickly paste
#' the data.
#' @param wb An already created workbook to add more sheets to.
#' @param index Index of the current variable expression. Used for computation with
#' by variables.
#' @param monitor_df Data frame which stores the monitoring values.
#'
#' @return
#' Returns a list containing a formatted Excel workbook as well as the monitoring
#' data frame.
#'
#' @noRd
format_mean_excel <- function(mean_tab,
                              mean_columns,
                              style = excel_output_style(),
                              output,
                              wb    = NULL,
                              index = NULL,
                              monitor_df){
    monitor_df <- monitor_df |> monitor_start("Excel prepare (mean)", "Format mean")

    if (nrow(mean_tab) == 0){
        monitor_df <- monitor_df |> monitor_end()

        return(list(wb, monitor_df))
    }

    # Drop TYPE if present
    if ("TYPE" %in% names(mean_tab)){
        mean_tab <- mean_tab |> dropp("TYPE", "by_vars")
    }

    # Round values according to style options
    mean_tab <- mean_tab |>
        collapse::fmutate(mean = round(mean, style[["number_formats"]][["mean_decimals"]]),
                          sd   = round(sd,   style[["number_formats"]][["sd_decimals"]]),
                          min  = round(min,  style[["number_formats"]][["min_decimals"]]),
                          max  = round(max,  style[["number_formats"]][["max_decimals"]]))

    # Build header row
    header     <- c("variable", mean_columns)
    table_data <- rbind(header, as.matrix(mean_tab[, c("variable", mean_columns)]))

    # Get table ranges
    mean_ranges <- get_mean_tab_ranges(mean_tab, style)

    # If function is called with by variables the sheet names have to be differentiated
    if (!is.null(index)){
        wb$add_worksheet(paste0("means", index),
                         grid_lines = style[["grid_lines"]])
    }
    # If function is called without by variables
    else{
        wb$add_worksheet("means", grid_lines = style[["grid_lines"]])
    }

    # Add table data and format according to style options
    monitor_df <- monitor_df |> monitor_next("Excel data (mean)", "Format mean")

    wb$add_data(x          = mean_tab,
                start_col  = style[["start_column"]],
                start_row  = style[["start_row"]],
                na.strings = style[["na_symbol"]])

    # Only do the formatting when user specified it. With the excel_nostyle
    # option this whole part gets omitted to get a very quick unformatted
    # excel output.
    if (output == "excel"){

        monitor_df <- monitor_df |> monitor_next("Excel cell styles (mean)", "Format mean")
        wb <- wb |> handle_cell_styles(mean_ranges, style)

        # Set up inner table number formats
        col_index <- 1

        for (type in c("mean", "sd", "min", "max", "freq", "missing")){
            wb$add_cell_style(dims                = mean_ranges[[paste0("mean_col_ranges", col_index)]],
                              apply_number_format = TRUE,
                              num_fmt_id          = wb$styles_mgr$get_numfmt_id(paste0(type, "_numfmt")))

            col_index <- col_index + 1
        }

        # If there are width or heights defined
        monitor_df <- monitor_df |> monitor_next("Excel widths/heights (mean)", "Format mean")

        column_width <- style[["column_widths"]]
        row_heights  <- style[["row_heights"]]

        if (length(column_width) == 1 && length(row_heights) == 1){
            if (column_width != "auto" || row_heights != "auto"){
                wb <- wb |> handle_col_row_dimensions(mean_ranges,
                                                      ncol(mean_tab) + (style[["start_column"]] - 1),
                                                      nrow(mean_tab) + (style[["start_row"]] - 1),
                                                      style)
            }
        }
        # If both widths and heights should be determined automatically.
        else{
            wb <- wb |> handle_col_row_dimensions(mean_ranges,
                                                  style[["start_column"]],
                                                  nrow(mean_tab) + (style[["start_row"]] - 1),
                                                  style)
        }

        # If autofit columns is selected, set a manual size after the first table
        # column which is sufficient in most cases. The integrated autofit often
        # gives to narrow columns.
        if (length(column_width) == 1){
            if (column_width == "auto"){
                start_col <- style[["start_column"]] + 1

                wb$set_col_widths(cols   = start_col:mean_ranges[["table.end"]],
                                  widths = 10)
            }
        }

        wb$add_named_region(dims = mean_ranges[["whole_tab_range"]], name = "table", local_sheet = TRUE)
        wb$add_named_region(dims = mean_ranges[["table_range"]],     name = "data",  local_sheet = TRUE)
    }

    # Return workbook
    monitor_df <- monitor_df |> monitor_end()

    list(wb, monitor_df)
}

###############################################################################
# Format freq table for console output
###############################################################################
#' Format Frequency Table Output (Text Based)
#'
#' @description
#' Format a frequency table for each variable which shows the counts, sums and
#' percentages per variable expression.
#'
#' @param freq_tab The data frame which contains the information for this frequency
#' table.
#' @param variables The variables that should appear in the table.
#' @param formats A list in which is specified which formats should be applied to
#' which variable.
#' @param by Separate the frequency output by the expressions of the provided variables.
#' @param titles Character vector of titles to display above the table.
#' @param footnotes Character vector of footnotes to display under the table.
#'
#' @return
#' Returns a character vector with all formatted rows for the tables.
#'
#' @noRd
format_freq_text <- function(freq_tab,
                             variables,
                             formats,
                             by,
                             titles,
                             footnotes){
    complete_tabs <- c()

    # Generate a separate frequency table for each variable
    for (variable in variables){
        # Compute additional stats
        var_tab <- compute_cumulative(freq_tab, variable, formats)

        # If no weight variable was applied
        if (identical(var_tab[["var_sum"]], var_tab[["var_freq"]])){
            freq_columns <- c("var_freq", "pct_group", "var_cum_freq",
                              "var_cum_pct")

            freq_top_columns    <- c("    ", "       ", "cumulative", "cumulative")
            freq_bottom_columns <- c("freq", "percent",       "freq",    "percent")
        }
        # If weight variable was applied
        else{
            freq_columns <- c("var_sum", "pct_group", "var_cum_sum",
                              "var_cum_pct", "var_freq", "var_cum_freq")

            freq_top_columns    <- c("   ", "       ", "cumulative", "cumulative", "    ", "cumulative")
            freq_bottom_columns <- c("sum", "percent",        "sum",    "percent", "freq",       "freq")
        }

        var_tab <- var_tab |>
            data.table::setcolorder(c("fused_vars", freq_columns))

        # Get the maximum width of the provided variable names to determine the width
        # of the first column.
        first_column_width <- max(nchar(variable), nchar("total"))

        first_column_width <- max(first_column_width,
                                  nchar(stats::na.omit(var_tab[["fused_vars"]])))

        # Set header row formatting. Loop through all header columns and give
        # each column the optimal width.
        column_widths         <- get_column_width(var_tab, freq_columns, decimals = 3)
        header_top_columns    <- c()
        header_bottom_columns <- c()

        var_tab <- var_tab |> collapse::frename("pct_group" = "var_pct_group")

        for (i in seq_len(length(freq_bottom_columns))){
            header_top_columns <- c(header_top_columns,
                                    format(freq_top_columns[i],
                                           width   = column_widths[i],
                                           justify = "right"))

            header_bottom_columns <- c(header_bottom_columns,
                                       format(freq_bottom_columns[i],
                                              width   = column_widths[i],
                                              justify = "right"))
        }

        # Put together the first two rows of the table as complete table header
        header_top_row <- paste0(format("",
                                        width   = first_column_width,
                                        justify = "left"), "  |  ",
                                 paste(header_top_columns,    collapse = "  "), "\n")

        header_row <- paste0(header_top_row,
                             format(variable,
                                    width   = first_column_width,
                                    justify = "left"), "  |  ",
                             paste(header_bottom_columns, collapse = "  "))

        complete_header <- c(header_row,
                             strrep("-", nchar(header_top_row)))

        # Format table column by column. Basically concatenate first column text and
        # corresponding values together while keeping the individual maximum column
        # width in mind.
        first_column <- paste0(format(var_tab[["fused_vars"]],
                                      width   = first_column_width,
                                      justify = "left"), "  |  ")

        # Loop through stat columns and format each number in a readable way
        formatted_cols <- vector("list", length(freq_columns))

        for (column in seq_len(length(freq_columns))){
            # Weighted sums
            if (column %in% c(1, 3)){
                formatted_cols[[column]] <- format_number(var_tab[column + 1],
                                                          width    = column_widths[column],
                                                          stat     = names(var_tab[column + 1]),
                                                          decimals = 3)
            }
            # Percentages and Unweighted freq
            else{
                formatted_cols[[column]] <- format_number(var_tab[column + 1],
                                                          width = column_widths[column],
                                                          stat  = names(var_tab[column + 1]))
            }
        }

        # Convert list to matrix
        formatted_matrix <- do.call(cbind, formatted_cols)

        # Convert matrix to single formatted rows
        all_rows <- paste0(first_column,
                           apply(formatted_matrix, 1, paste, collapse = "  "))

        # If no multilabel formats are applied separate total row at the bottom from the rest
        if (!is_multilabel(formats, variable)){
            rows <- length(all_rows)

            all_rows <- c(all_rows[1:(rows - 1)],
                          strrep("-", nchar(header_top_row)),
                          all_rows[rows])
        }
        # If multilabel formats are applied remove total row
        else{
            all_rows <- all_rows[-length(all_rows)]

            # Only do this step with no by variables here otherwise the separate
            # by function down below will handle this message, so that it appears only
            # once and not for each loop.
            if (length(by) == 0){
                message(" ~ NOTE: The format for variable '", variable, "' is a multilabel.\n",
                        "         In this case cumulative results aren't computed properly.")
            }
        }

        # Insert titles and footnotes
        complete_table <- c(complete_header, all_rows)

        if (length(titles) > 0){
            complete_table <- c(titles, "", complete_table)
        }
        if (length(footnotes) > 0){
            complete_table <- c(complete_table, "____________", footnotes)
        }

        # Output formatted result
        complete_tabs <- c(complete_tabs, "\n", complete_table, "\n")

    }

    # Output formatted result
    complete_tabs
}


#' Compute Cumulative Variables
#'
#' @description
#' Compute separate cumulative variables for frequencies, weighted sums and
#' percentages.
#'
#' @param freq_tab The data frame which contains the information for this frequency
#' table.
#' @param variabls The currently computed variable.
#' @param formats A list in which is specified which formats should be applied to
#' which variable.
#'
#' @return
#' Returns a data frame with added cumulative variables.
#'
#' @noRd
compute_cumulative <- function(freq_tab,
                               variable,
                               formats){
    # Sort total to bottom if no by variables are specified
    if (!"BY" %in% names(freq_tab)){
        # Reduce data frame to the necessary rows
        var_tab <- freq_tab |> collapse::fsubset(TYPE %in% c("total", variable))

        # Generate cumulative variables first to have a complete data frame
        var_tab <- var_tab |> data.table::setorder(-"DEPTH")
    }
    # Compute new total if by variables are specified because otherwise the grand total would be
    # used for every single expression.
    else{
        total_temp <- freq_tab |>
            collapse::fsubset(TYPE %in% variable) |>
            collapse::fsummarise(across(c("var_sum", "var_pct_group", "var_freq"), collapse::fsum))

        var_tab <- data.table::rbindlist(list(freq_tab, total_temp), use.names = TRUE, fill = TRUE)
        var_tab <- var_tab |>
            collapse::fsubset(TYPE %in% c(variable, NA))
    }

    var_tab[["fused_vars"]][nrow(var_tab)] <- "total"

    var_tab[["var_cum_sum"]]  <- collapse::fcumsum(var_tab[["var_sum"]])
    var_tab[["var_cum_pct"]]  <- collapse::fcumsum(var_tab[["var_pct_group"]])
    var_tab[["var_cum_freq"]] <- collapse::fcumsum(var_tab[["var_freq"]])

    # Erase total cumulative values
    var_tab[["var_cum_sum"]][nrow(var_tab)]  <- NA
    var_tab[["var_cum_pct"]][nrow(var_tab)]  <- NA
    var_tab[["var_cum_freq"]][nrow(var_tab)] <- NA

    # Order columns in order of freq table
    var_tab |> collapse::frename("var_pct_group" = "pct_group")
}

###############################################################################
# Format freq table for excel output
###############################################################################
#' Format Frequency Table Output (Excel Based)
#'
#' @description
#' Format a frequency table for each variable which shows the counts, sums and
#' percentages per variable expression.
#'
#' @param wb An already created workbook to add more sheets to.
#' @param freq_tab The data frame which contains the information for this frequency
#' table.
#' @param variables The variables that should appear in the table.
#' @param formats A list in which is specified which formats should be applied to
#' which variable.
#' @param by Separate the frequency output by the expressions of the provided variables.
#' @param titles Character vector of titles to display above the table.
#' @param footnotes Character vector of footnotes to display under the table.
#' @param style A list containing the styling elements.
#' @param output Determines whether to style the output or to just quickly paste
#' the data.
#' @param by_info Text which contains the information which variable with which
#' expression is computed at the moment.Used for computation with by variables.
#' @param index Index of the current variable expression. Used for computation with
#' by variables.
#' @param monitor_df Data frame which stores the monitoring values.
#'
#' @return
#' Returns a list containing a formatted Excel workbook as well as the monitoring
#' data frame.
#'
#' @noRd
format_freq_excel <- function(wb,
                              freq_tab,
                              variables,
                              formats,
                              by,
                              titles,
                              footnotes,
                              style,
                              output,
                              by_info = NULL,
                              index   = NULL,
                              monitor_df){
    # Generate a separate frequency table for each variable
    for (variable in variables){
        monitor_df <- monitor_df |> monitor_start("Excel prepare (freq)", "Format freq")

        # Compute additional stats
        var_tab <- compute_cumulative(freq_tab, variable, formats)

        # Round values according to style options
        var_tab <- var_tab |>
            collapse::fmutate(var_sum     = round(var_sum,     style[["number_formats"]][["sum_decimals"]]),
                              var_cum_sum = round(var_cum_sum, style[["number_formats"]][["sum_decimals"]]),
                              pct_group   = round(pct_group,   style[["number_formats"]][["pct_decimals"]]),
                              var_cum_pct = round(var_cum_pct, style[["number_formats"]][["pct_decimals"]]))

        # If no weight variable was applied
        if (identical(var_tab[["var_sum"]], var_tab[["var_freq"]])){
            freq_columns <- c("var_freq", "pct_group", "var_cum_freq",
                              "var_cum_pct")

            freq_col_labels <- c("freq", "percent", "cumulative freq", "cumulative percent")

            by_info_col <- style[["start_column"]] + 6
        }
        # If weight variable was applied
        else{
            freq_columns <- c("var_sum", "pct_group", "var_cum_sum",
                              "var_cum_pct", "var_freq", "var_cum_freq")

            freq_col_labels <- c("sum", "percent", "cumulative sum",
                                 "cumulative percent", "freq", "cumulative freq")

            by_info_col <- style[["start_column"]] + 8
        }

        # If multilabel, remove automatically computed total observation.
        # Only do this step with no by variables here otherwise the separate
        # by function down below will handle this message, so that it appears only
        # once and not for each loop.
        if (is_multilabel(formats, variable)){
            if (length(by) == 0){
                message(" ~ NOTE: The format for variable '", variable, "' is a multilabel.
         In this case cumulative results aren't computed properly.")

                var_tab <- var_tab |>
                    collapse::fsubset(TYPE != "total")
            }
        }

        var_tab <- var_tab |>
            keep("fused_vars", freq_columns, order_vars = TRUE)

        names(var_tab) <- c(variable, freq_col_labels)

        # Get table ranges
        freq_ranges <- get_freq_tab_ranges(var_tab, titles, footnotes, style)

        # If function is called with by variables the sheet names have to be differentiated
        # and by info has to be written besides the table.
        if (!is.null(by_info)){
            wb$add_worksheet(paste0(variable, index),
                                    grid_lines = style[["grid_lines"]])
            wb$add_data(x         = by_info,
                        start_col = by_info_col,
                        start_row = style[["start_row"]])
        }
        else{
            wb$add_worksheet(variable, grid_lines = style[["grid_lines"]])
        }

        # Format titles and footnotes if there are any
        monitor_df <- monitor_df |> monitor_next("Excel titles/footnotes (freq)", "Format freq")
        wb <- wb |>
            format_titles_foot_excel(titles, footnotes, freq_ranges, style, output)

        # Add table data and format according to style options
        monitor_df <- monitor_df |> monitor_next("Excel data (freq)", "Format freq")

        wb$add_data(x          = var_tab,
                    start_col  = style[["start_column"]],
                    start_row  = freq_ranges[["header.row"]],
                    na.strings = style[["na_symbol"]])

        # Only do the formatting when user specified it. With the excel_nostyle
        # option this whole part gets omitted to get a very quick unformatted
        # excel output.
        if (output == "excel"){
            monitor_df <- monitor_df |> monitor_next("Excel cell styles (freq)", "Format freq")
            wb <- wb |> handle_cell_styles(freq_ranges, style)

            # Set up inner table number formats
            col_index <- 1
            col_types <- c("sum", "pct", "sum", "pct", "freq", "freq")

            if (!"sum" %in% names(var_tab)){
                col_types <- c("freq", "pct", "freq", "pct")
            }

            for (type in col_types){
                wb$add_cell_style(dims                = freq_ranges[[paste0("freq_col_ranges", col_index)]],
                                  apply_number_format = TRUE,
                                  num_fmt_id          = wb$styles_mgr$get_numfmt_id(paste0(type, "_numfmt")))

                col_index <- col_index + 1
            }

            # Adjust table dimensions
            monitor_df <- monitor_df |> monitor_next("Excel widths/heights (freq)", "Format freq")
            wb <- wb |> handle_col_row_dimensions(freq_ranges,
                                                  ncol(var_tab) + (style[["start_column"]] - 1),
                                                  nrow(var_tab) + (style[["start_row"]] - 1),
                                                  style)

            wb <- wb |> handle_auto_dimensions(freq_ranges,
                                               style)

            wb <- wb |> handle_header_table_dim(freq_ranges,
                                                style)

            wb$add_ignore_error(dims = freq_ranges[["cat_col_range"]], number_stored_as_text = TRUE)

            wb$add_named_region(dims = freq_ranges[["whole_tab_range"]], name = "table", local_sheet = TRUE)
            wb$add_named_region(dims = freq_ranges[["table_range"]],     name = "data",  local_sheet = TRUE)
        }

        monitor_df <- monitor_df |> monitor_end()
    }

    # Return workbook
    list(wb, monitor_df)
}

###############################################################################
# Format grouped by tables for console output
###############################################################################
#' Format Frequency Table Output with by Variables (Text Based)
#'
#' @description
#' Format a frequency table for each variable which shows the counts, sums and
#' percentages per variable expression.
#'
#' @param mean_tab The data frame which contains the information for this summarising
#' table.
#' @param freq_tab The data frame which contains the information for this frequency
#' table.
#' @param variables The variables that should appear in the table.
#' @param mean_columns The column names from the data frames that should appear
#' in the table.
#' @param formats A list in which is specified which formats should be applied to
#' which variable.
#' @param by Separate the output by the expressions of the provided variables.
#' @param titles Character vector of titles to display above the table.
#' @param footnotes Character vector of footnotes to display under the table.
#' @param na.rm If TRUE removes all NA values from the tabulation.
#'
#' @return
#' Returns a character vector with all formatted rows for the tables.
#'
#' @noRd
format_by_text <- function(mean_tab,
                           freq_tab,
                           variables,
                           mean_columns,
                           formats,
                           by,
                           titles,
                           footnotes,
                           na.rm){
    # Print message if multilabels are applied
    for (variable in variables){
        if (is_multilabel(formats, variable)){
            message(" ~ NOTE: The format for variable '", variable, "' is a multilabel.\n",
                    "         In this case cumulative results aren't computed properly.")
        }
    }

    complete_tabs <- c()

    # Loop through all by variables
    for (by_var in by){
        # Select by variables one by one
        mean_by <- mean_tab |>
            collapse::fsubset(mean_tab[["TYPE"]] == by_var)

        freq_by <- freq_tab |>
            collapse::fsubset(freq_tab[["BY"]] == "total" | freq_tab[["BY"]] == by_var)

        # Extract unique values
        if (anyNA(freq_by[["by_vars"]])){
            values <- c(unique(stats::na.omit(freq_by[["by_vars"]]))[-1], NA)
        }
        else{
            values <- unique(freq_by[["by_vars"]])
        }

        # Loop through all unique values to generate frequency tables per expression
        for (value in values){
            # In case NAs are removed
            if (is.na(value) && na.rm){
                next
            }

            # Put additional header together with the information which by variable
            # and which value is currently filtered.
            header <- paste0("| ", by_var, " = ", value, " |")

            complete_header <- c("\n", strrep("-", nchar(header)),
                                 header,
                                 strrep("-", nchar(header)))

            # Filter table by current by variable and value
            if (!is.na(value)){
                mean_temp <- mean_by |>
                    collapse::fsubset(mean_by[["by_vars"]] == value)

                freq_temp <- freq_by |>
                    collapse::fsubset(freq_by[["by_vars"]] == "total" | freq_by[["by_vars"]] == value)
            }
            else{
                mean_temp <- mean_by |>
                    collapse::fsubset(is.na(mean_by[["by_vars"]]))

                freq_temp <- freq_by |>
                    collapse::fsubset(freq_by[["by_vars"]] == "total" | is.na(freq_by[["by_vars"]]))
            }

            # Generate mean table as normal but base is filtered data frame
            current_mean <- format_mean_text(mean_temp,
                                             variables,
                                             mean_columns)

            # Generate frequency tables as normal but base is filtered data frame
            current_freq <- format_freq_text(freq_temp,
                                             variables,
                                             formats,
                                             by,
                                             titles,
                                             footnotes)

            # Output formatted result
            complete_tabs <- c(complete_tabs, complete_header, current_mean, current_freq)
        }
    }

    # Output formatted result
    complete_tabs
}

###############################################################################
# Format grouped by tables for excel output
###############################################################################
#' Format Frequency Table Output with by Variables (Excel Based)
#'
#' @description
#' Format a frequency table for each variable which shows the counts, sums and
#' percentages per variable expression.
#'
#' @param mean_tab The data frame which contains the information for this summarising
#' table.
#' @param freq_tab The data frame which contains the information for this frequency
#' table.
#' @param variables The variables that should appear in the table.
#' @param mean_columns The column names from the data frames that should appear
#' in the table.
#' @param formats A list in which is specified which formats should be applied to
#' which variable.
#' @param by Separate the output by the expressions of the provided variables.
#' @param titles Character vector of titles to display above the table.
#' @param footnotes Character vector of footnotes to display under the table.
#' @param style A list containing the styling elements.
#' @param output Determines whether to style the output or to just quickly paste
#' the data.
#' @param na.rm If TRUE removes all NA values from the tabulation.
#' @param monitor_df Data frame which stores the monitoring values.
#'
#' @return
#' Returns a list containing a formatted Excel workbook as well as the monitoring
#' data frame.
#'
#' @noRd
format_by_excel <- function(mean_tab,
                            freq_tab,
                            variables,
                            mean_columns,
                            formats,
                            by,
                            titles,
                            footnotes,
                            style,
                            output,
                            na.rm,
                            wb,
                            monitor_df){
    monitor_df <- monitor_df |> monitor_start("Excel prepare (by)", "Format by")

    # Print message if multilabels are applied
    for (variable in variables){
        if (is_multilabel(formats, variable)){
            message(" ~ NOTE: The format for variable '", variable, "' is a multilabel.\n",
                    "         In this case cumulative results aren't computed properly.")
        }
    }

    index <- 1

    # Loop through all by variables
    for (by_var in by){
        monitor_df <- monitor_df |> monitor_next(paste0("Excel prepare (", by_var, ")"), "Format by")

        # Select by variables one by one
        mean_by <- mean_tab |>
            collapse::fsubset(mean_tab[["TYPE"]] == by_var)

        freq_by <- freq_tab |>
            collapse::fsubset(freq_tab[["BY"]] == "total" | freq_tab[["BY"]] == by_var)

        # Extract unique values
        if (anyNA(freq_by[["by_vars"]])){
            values <- c(unique(stats::na.omit(freq_by[["by_vars"]]))[-1], NA)
        }
        else{
            values <- unique(freq_by[["by_vars"]])
        }

        # Loop through all unique values to generate frequency tables per expression
        for (value in values){
            # In case NAs are removed
            if (is.na(value) && na.rm){
                next
            }

            message("   + ", paste0(by_var, " = ", value))

            monitor_df <- monitor_df |> monitor_next(paste0("Excel mean (", by_var, "_", value, ")"), "Format by")

            # Put additional by info together with the information which by variable
            # and which value is currently filtered.
            by_info <- paste0(by_var, " = ", value)

            # Filter table by current by variable and value
            if (!is.na(value)){
                mean_temp <- mean_by |>
                    collapse::fsubset(mean_by[["by_vars"]] == value)

                freq_temp <- freq_by |>
                    collapse::fsubset(freq_by[["by_vars"]] == "total" | freq_by[["by_vars"]] == value)
            }
            else{
                mean_temp <- mean_by |>
                    collapse::fsubset(is.na(mean_by[["by_vars"]]))

                freq_temp <- freq_by |>
                    collapse::fsubset(freq_by[["by_vars"]] == "total" | is.na(freq_by[["by_vars"]]))
            }

            # Generate mean table as normal but base is filtered data frame
            if (nrow(mean_temp) > 0){
                wb_list <- format_mean_excel(mean_temp,
                                             mean_columns,
                                             style,
                                             output,
                                             wb,
                                             index,
                                             NULL)

                wb_list[[1]]$add_data(x         = by_info,
                                      start_col = style[["start_column"]] + 8,
                                      start_row = style[["start_row"]])
            }
            else{
                wb_list      <- list()
                wb_list[[1]] <- wb
            }

            monitor_df <- monitor_df |> monitor_next(paste0("Excel freq (", by_var, "_", value, ")"), "Format by")

            # Generate frequency tables as normal but base is filtered data frame
            wb_list <- format_freq_excel(wb_list[[1]],
                                         freq_temp,
                                         variables,
                                         formats,
                                         by,
                                         titles,
                                         footnotes,
                                         style,
                                         output,
                                         by_info,
                                         index,
                                         NULL)

            index <- index + 1

            wb <- wb_list[[1]]
        }
    }

    monitor_df <- monitor_df |> monitor_end()

    # Return workbook
    list(wb, monitor_df)
}
Any scripts or data that you put into this service are public.
qol documentation built on Dec. 14, 2025, 1:06 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
qol
Powerful 'SAS' Inspired Concepts for more Efficient Bigger Outputs

R/frequencies.R
In qol: Powerful 'SAS' Inspired Concepts for more Efficient Bigger Outputs

Defines functions format_by_excel format_by_text format_freq_excel compute_cumulative format_freq_text format_mean_excel format_mean_text get_column_width format_number frequencies

Documented in frequencies

Try the qol package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

qol Powerful 'SAS' Inspired Concepts for more Efficient Bigger Outputs

R/frequencies.R In qol: Powerful 'SAS' Inspired Concepts for more Efficient Bigger Outputs

Defines functions format_by_excel format_by_text format_freq_excel compute_cumulative format_freq_text format_mean_excel format_mean_text get_column_width format_number frequencies

Documented in frequencies

Try the qol package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

qol
Powerful 'SAS' Inspired Concepts for more Efficient Bigger Outputs

R/frequencies.R
In qol: Powerful 'SAS' Inspired Concepts for more Efficient Bigger Outputs