tab_many: Many cross-tables as one, with color helpers

View source: R/tab.R

tab_manyR Documentation

Many cross-tables as one, with color helpers

Description

A full-featured function to create, manipulate and format many cross-tables as one, using colors to make the printed tab more easily readable (in R terminal or exported to Excel with tab_xl). Since objects of class tabxplor_tab are also of class tibble, you can then use all dplyr verbs to modify the result, like select, arrange, filter or mutate.

Only breaks for attractions/over-representations (in green) should be given, as a vector of positive doubles, with length between 1 and 5. Breaks for aversions/under-representations (in orange/red) will simply be the opposite.

Usage

tab_many(
  data,
  row_vars,
  col_vars,
  tab_vars,
  wt,
  pct = "no",
  color = "no",
  OR = "no",
  chi2 = FALSE,
  na = "keep",
  levels = "all",
  na_drop_all,
  cleannames = NULL,
  compact = NULL,
  other_if_less_than = 0,
  other_level = "Others",
  ref = "auto",
  ref2 = "first",
  comp = "tab",
  ci = "no",
  conf_level = 0.95,
  method_cell = "wilson",
  method_diff = "ac",
  totaltab = "line",
  totaltab_name = "Ensemble",
  totrow = TRUE,
  totcol = "last",
  total_names = "Total",
  add_n = TRUE,
  add_pct = FALSE,
  digits = 0,
  subtext = "",
  filter
)

tab_get_vars(tabs, vars = c("row_var", "col_vars", "tab_vars"))

is_tab(x)

set_color_style(
  type = c("text", "bg"),
  theme = NULL,
  html_24_bit = c("blue_red", "green_red", "no"),
  custom_palette = NULL
)

get_color_style(
  mode = c("crayon", "color_code"),
  type = NULL,
  theme = NULL,
  html_24_bit = NULL
)

set_color_breaks(pct_breaks, mean_breaks, contrib_breaks)

get_color_breaks(brk, type = c("positive", "all"))

Arguments

data

A data frame.

row_vars

The row variable, which will be printed with one level per line. If numeric, it will be converted to factor. If more than one row_var if provided, a different table is made for each of them.

col_vars

<tidy-select> One column is printed for each level of each column variable. For numeric variables means are calculated, in a single column. To pass many variables you may use syntax col_vars = c(col_var1, col_var2, ...).

tab_vars

<tidy-select> One subtable is made for each combination of levels of the tab variables. To pass many variables you may use syntax tab_vars = c(tab_var1, tab_var2, ...). All tab variables are converted to factor. Leave empty to make a simple table.

wt

A weight variable, of class numeric. Leave empty for unweighted results.

pct

The type of percentages to calculate :

  • "row": row percentages.

  • "col": column percentages.

  • "all": frequencies for each subtable/group, if there is tab_vars.

  • "all_tabs": frequencies for the whole (set of) table(s).

The argument is vectorised over both row_vars and col_vars. You can then write as the following : pct = list(row_var1 = list("row", "col", "col"), row_var2 = list("col", "row", "row"))

color

The type of colors to print, as a single string. Vectorised over row_vars.

  • "no": by default, no colors are printed.

  • "diff": color percentages and means based on cells differences from totals (or from first cells when ref = "first").

  • "diff_ci": color pct and means based on cells differences from totals or first cells, removing coloring when the confidence interval of this difference is higher than the difference itself.

  • "after_ci": idem, but cut off the confidence interval from the difference first.

  • "contrib": color cells based on their contribution to variance (except mean columns, from numeric variables).

  • "OR": for pct == "col" or pct == "row", color based on odds ratios (or relative risks ratios)

  • "auto": frequencies (pct = "all", pct = "all_tabs") and counts are colored with "contrib". When ci = "diff", row and col percentages are colored with "after_ci" ; otherwise they are colored with "diff".

OR

With pct = "row" or pct = "col", calculate and print odds ratios (for binary variables) or relative risks ratios (for variables with 3 levels or more).

  • "no": by default, no OR are calculated.

  • "OR": print OR (instead of percentages).

  • "OR_pct": print OR, with percentages in bracket.

chi2

Set to TRUE to calculate Chi2 summaries with tab_chi2. Useful to print metadata, and to color cells based on their contribution to variance (color = "contrib"). Vectorised over row_vars.

na

The policy to adopt with missing values. It must be a single string.

  • na = "keep": by default, prints NA's as explicit "NA" level.

  • na = "drop": removes NA levels before making each table (tabs made with different column variables may have a different number of observations, and won't exactly have the same total columns).

  • "drop_all": remove NA's for all variables before making the tables.

levels

The levels of col_vars to keep (for more complex selections use dplyr::select). The argument is vectorised over col_vars.

  • "all": by default, all levels are kept.

  • "first": only keep the first level of each col_vars

  • "auto": keep the first level when col_var is only two levels, keep all levels otherwise

na_drop_all

<tidy-select> Removes all observations with a NA in any of the chosen variables, for all tables (tabs for each column variable will have the same number of observations).

cleannames

Set to TRUE to clean levels names, by removing prefix numbers like "1-", and text in parenthesis. All data formatting arguments are passed to tab_prepare.

compact

With several row_vars, set to TRUE to bind all tables in a single tabxplor_tab. If not provided, the value of getOption("tabxplor.compact") is taken (FALSE by default). Set options(tabxplor.compact = TRUE) to make this the default behaviour for all tables (but beware becauce it can break existing code).

other_if_less_than

When set to a positive integer, levels with less count than it will be merged into an "Others" level.

other_level

The name of the "Other" level, as a single string.

ref

The reference cell to calculate differences and ratios (used to print colors) :

  • "auto": by default, cell difference from the corresponding total (rows or cols depending on pct = "row" or pct = "col") is used for diff ; cell ratio from the first line (or col) is use for OR (odds ratio/relative risks ratio).

  • "tot": totals are always used.

  • "first": calculate cell difference or ratio from the first cell of the row or column (useful to color temporal developments).

  • n: when ref is an integer, the nth row (or column) is used for comparison.

  • "regex": when ref is a string, it it used as a regular expression, to match with the names of the rows (or columns). Be precise enough to match only one column or row, otherwise you get a warning message.

  • "no": not use ref and not calculate diffs to gain calculation time.

ref2

A second reference cell is needed to calculate odds ratios (or relative risks ratios). The first cell of the row or column is used by default. See ref above for the full list of possible values.

comp

The comparison level : by subtables/groups, or for the whole table. Vectorised over row_vars.

  • "tab": by default, contributions to variance, row differences from totals/first cells, and row confidence intervals for these differences, are calculated for each tab_vars group.

  • "all": compare cells to the general total line (provided there is a total table with a total row), or with the reference line of the total table when ref = "first", an integer or a regular expression.

ci

The type of confidence intervals to calculate, passed to tab_ci. Vectorised over row_vars.

  • "cell": absolute confidence intervals of cells percentages.

  • "diff": confidence intervals of the difference between a cell and the relative total cell (or relative first cell when ref = "first").

  • "auto": ci = "diff" for means and row/col percentages, ci = "cell" for frequencies ("all", "all_tabs").

By default, for percentages, with ci = "cell" Wilson's method is used, and with ci = "diff" Wald's method along Agresti and Caffo's adjustment. Means use classic method. This can be changed with method_cell and method_diff. By default, with ci = "cell", the result is printed in the ⁠[inf;sup]⁠ form. Set options("tabxplor.ci_print" = "moe") to print pct +- moe instead.

conf_level

The confidence level, as a single numeric between 0 and 1. Default to 0.95 (95%).

method_cell

Character string specifying which method to use with percentages for ci = "cell". This can be one out of: "wald", "wilson", "wilsoncc", "agresti-coull", "jeffreys", "modified wilson", "modified jeffreys", "clopper-pearson", "arcsine", "logit", "witting", "pratt", "midp", "lik" and "blaker". Defaults to "wilson". See BinomCI.

method_diff

Character string specifying which method to use with percentages for ci = "diff". This can be one out of: "wald", "waldcc", "ac", "score", "scorecc", "mn", "mee", "blj", "ha", "hal", "jp". Defaults to "ac", Wald interval with the adjustment according to Agresti, Caffo for difference in proportions and independent samples. See BinomDiffCI.

totaltab

The total table, if there are subtables/groups (i.e. when tab_vars is provided). Vectorised over row_vars.

  • "line": by default, add a general total line (necessary for calculations with comp = "all")

  • "table": add a complete total table (i.e. row_var by col_vars without tab_vars).

  • "no": not to draw any total table.

totaltab_name

The name of the total table, as a single string.

totrow

By default, total rows are printed. Set to FALSE to remove them (after calculations if needed). Vectorised over row_vars.

totcol

The policy with total columns. Vectorised over col_vars.

  • "last": by default, only prints a total column for the last column variable (of class factor, not numeric).

  • "each": print a total column for each column variable.

  • "no": remove all total columns (after calculations if needed).

total_names

The names of the totals, as a character vector of length one or two. Use syntax of type c("Total row", "Total column") to set different names for rows and cols.

add_n

For pct = "row" or pct = "col", set to FALSE not to add another column or row with unweighted counts (n).

add_pct

Set to TRUE to add a column with the frequencies of the row variable (for pct = "row") or a row with the frequencies of the column variable (for pct = "col").

digits

The number of digits to print, as a single integer, or an integer vector the same length as col_vars. The argument is vectorisez over col_vars.

subtext

A character vector to print rows of legend under the table.

filter

A dplyr::filter to apply to the data frame first, as a single string (which will be converted to code, i.e. to a call). Useful when printing multiples tabs with tibble::tribble, to use different filters for similar tables or simply make the field of observation more visible into the code.

tabs

A tibble of class tab, made with tab, tab_many or tab_plain.

vars

In tab_get_vars, a character vector containing the wanted vars names: "row_var", "col_vars" or "tab_vars".

x

A object to test with is_tab.

type

Default to "positive", which just print breaks for positive spreads. Set to all to get breaks for negative spreads as well.

theme

For set_color_style and get_color_style, is your console or html table background "light" or "dark" ? Default to RStudio theme.

html_24_bit

Use 24bits colors palettes for html tables : set to "green_red" or "blue_red". Only with mode = "color_code" (not mode = "crayon") and ⁠theme = "light⁠. Default to getOption("tabxplor.color_html_24_bit").

custom_palette

Possibility to provide a custom color styles, as a character vector of 10 html color codes (the five first for over-represented numbers, the five last for under-represented ones). The result is saved to options("tabxplor.color_style"). To discard, relaunch the function with custom_palette = NULL.

mode

By default, get_color_style returns a list of crayon coloring functions. Set to "color_code" to return html color codes.

pct_breaks

If they are to be changed, the breaks used for percentages. Default to c(0.05, 0.1, 0.2, 2, 0.3) : first color used when the pct of a cell is +5% superior to the pct of the related total ; second color used when it is +10% superior ; third +20% superior ; fourth *2 superior ; fifth +30% superior. When > 1, it does not take differences but ratio. The opposite for cells inferior to the total (without the *2 rule). With color = "after_ci", the first break is subtracted from all breaks (default becomes c(0, 0.05, 0.15, 2, 0.25) : +0%, +5%, +15%, *2, +25%).

mean_breaks

If they are to be changed, the breaks used for means. Default to c(1.15, 1.5, 2, 4) : first color used when the mean of a cell is superior to 1.15 times the mean of the related total row ; second color used when it is superior to 1.5 times ; etc. The opposite for cells inferior to the total. With color = "after_ci", the first break is divided from all breaks (default becomes c(1, 1.3, 1.7, 3.5)).

contrib_breaks

If they are to be changed, the breaks used for contributions to variance. Default to c(1, 2, 5, 10) : first color used when the contribution of a cell is superior to the mean contribution ; second color used when it is superior to 2 times the mean contribution ; etc. The global color (for example green or red/orange) is given by the sign of the spread.

brk

When missing, return all color breaks. Specify to return a given color break, among "pct", "mean", "contrib", "pct_ci" and "mean_ci".

Value

A tibble of class tab, possibly with colored reading helpers. When there are two row_vars or more, a list of tibble of class tab. All non-text columns are of class fmt, storing all the data necessary to print formats and colors. Columns with row_var and tab_vars are of class factor : every added factor will be considered as a tab_vars and used for grouping. To add text columns without using them in calculations, be sure they are of class character.

A list with the variables names.

A single logical.

Set global options "tabxplor.color_style_type" and "tabxplor.color_style_theme", used when printing tab objects.

A vector of crayon color functions, or a vector of color html codes.

Set the global option "tabxplor.color_breaks" as a list different double vectors, and also returns it invisibly.

The color breaks as a double vector, or list of double vectors.

Functions

  • tab_get_vars(): Get the variables names of a tabxplor tab

  • is_tab(): a test function for class tabxplor_tab

  • set_color_style(): define the color style used to print tab.

  • get_color_style(): get color styles as crayon functions or html codes.

  • set_color_breaks(): set the breaks used to print colors

  • get_color_breaks(): get the breaks currently used to print colors

Examples

# Make a summary table with many col_vars, showing only one specific level :

library(dplyr)
first_lvs <- c("Married", "$25000 or more", "Strong republican", "Protestant")
data <- forcats::gss_cat %>% mutate(across(
  where(is.factor),
  ~ forcats::fct_relevel(., first_lvs[first_lvs %in% levels(.)])
))
tab_many(data, race, c(marital, rincome, partyid, relig, age, tvhours),
         levels = "first", pct = "row", chi2 = TRUE, color = "auto")


# Can be used with map and tribble to program several tables with different parameters
#  all at once, in a readable way:

library(purrr)
library(tibble)
pmap(
  tribble(
    ~row_var, ~col_vars       , ~pct , ~filter              , ~subtext               ,
    "race"  , "marital"       , "row", NULL                 , "Source: GSS 2000-2014",
    "relig" , c("race", "age"), "row", "year %in% 2000:2010", "Source: GSS 2000-2010",
    NA_character_, "race"     , "no" , NULL                 , "Source: GSS 2000-2014",
  ),
  .f = tab_many,
  data = forcats::gss_cat, color = "auto", chi2 = TRUE)

set_color_style(type = "bg")
set_color_breaks(
  pct_breaks = c(0.05, 0.15, 0.3),
  mean_breaks = c(1.15, 2, 4),
  contrib_breaks = c(1, 2, 5)
)

BriceNocenti/tablr documentation built on April 12, 2025, 12:56 a.m.