desctable: Create Publication-Ready Descriptive Statistics Tables
In summata: Publication-Ready Summary Tables and Forest Plots

desctable

R Documentation

Create Publication-Ready Descriptive Statistics Tables

Description

Generates comprehensive descriptive statistics tables with automatic variable type detection, group comparisons, and appropriate statistical testing. This function is designed to create "Table 1"-style summaries commonly used in clinical and epidemiological research, with full support for continuous, categorical, and time-to-event variables.

Usage

desctable(
  data,
  by = NULL,
  variables,
  stats_continuous = c("median_iqr"),
  stats_categorical = "n_percent",
  digits = 1,
  p_digits = 3,
  conf_level = 0.95,
  p_per_stat = FALSE,
  na_include = FALSE,
  na_label = "Unknown",
  na_percent = FALSE,
  test = TRUE,
  test_continuous = "auto",
  test_categorical = "auto",
  total = TRUE,
  total_label = "Total",
  labels = NULL,
  number_format = NULL,
  ...
)

Arguments

`data`	Data frame or data.table containing the dataset to summarize. Automatically converted to a data.table for efficient processing.
`by`	Character string specifying the column name of the grouping variable for stratified analysis (e.g., treatment arm, exposure status). When `NULL` (default), produces overall summaries only without group comparisons or statistical tests.
`variables`	Character vector of variable names to summarize. Can include standard column names for continuous or categorical variables, and survival expressions using `Surv()` syntax (e.g., `"Surv(os_months, os_status)"`). Variables are processed in the order provided.
`stats_continuous`	Character vector specifying which statistics to compute for continuous variables. Multiple values create separate rows for each variable. Options: `"mean_sd"` - Mean `\pm` standard deviation `"median_iqr"` - Median [interquartile range] `"median_range"` - Median (minimum-maximum) `"range"` - Minimum-maximum only Default is `"median_iqr"`.
`stats_categorical`	Character string specifying the format for categorical variable summaries: `"n"` - Count only `"percent"` - Percentage only `"n_percent"` - Count (percentage) [default]
`digits`	Integer specifying the number of decimal places for continuous statistics. Default is 1.
`p_digits`	Integer specifying the number of decimal places for p-values. Values smaller than `10^(-p_digits)` are displayed as `"< 0.001"` (for `p_digits = 3`), `"< 0.0001"` (for `p_digits = 4`), etc. Default is 3.
`conf_level`	Numeric confidence level for confidence intervals in survival variable summaries (median survival time with CI). Must be between 0 and 1. Default is 0.95 (95% confidence intervals).
`p_per_stat`	Logical. If `TRUE`, displays p-values on each row (per statistic) rather than only on the first row of each variable. Useful when different statistics within a variable warrant separate significance testing. Default is `FALSE`.
`na_include`	Logical. If `TRUE`, missing values (NAs) are displayed as a separate category/row for each variable. If `FALSE`, missing values are silently excluded from calculations. Default is `FALSE`.
`na_label`	Character string used to label the missing values row when `na_include = TRUE`. Default is `"Unknown"`.
`na_percent`	Logical. Controls how percentages are calculated for categorical variables when `na_include = TRUE`: If `TRUE`, percentages include NAs in the denominator (all categories sum to 100%) If `FALSE`, percentages exclude NAs from the denominator (non-missing categories sum to 100%, missing shown separately) Only affects categorical variables. Default is `FALSE`.
`test`	Logical. If `TRUE`, performs appropriate statistical tests for group comparisons and adds a p-value column. Requires `by` to be specified. Tests are automatically selected based on variable type and test parameters. Default is `TRUE`.
`test_continuous`	Character string specifying the statistical test for continuous variables: `"auto"` - Automatic selection: t-test/ANOVA for means, Wilcoxon/Kruskal-Wallis for medians [default] `"t"` - Independent samples t-test (2 groups only) `"aov"` - One-way ANOVA (2+ groups) `"wrs"` - Wilcoxon rank-sum test (2 groups only) `"kwt"` - Kruskal-Wallis test (2+ groups)
`test_categorical`	Character string specifying the statistical test for categorical variables: `"auto"` - Automatic selection: Fisher exact test if any expected cell frequency < 5, otherwise `\chi^2` test [default] `"fisher"` - Fisher exact test `"chisq"` - `\chi^2` test
`total`	Logical or character string controlling the total column: `TRUE` or `"first"` - Include total column as first column after Variable/Group [default] `"last"` - Include total column as last column before p-value `FALSE` - Exclude total column
`total_label`	Character string for the total column header. Default is `"Total"`.
`labels`	Named character vector or list providing custom display labels for variables. Names should match variable names (or `Surv()` expressions), values are the display labels. Variables not in `labels` use their original names. Can also label the grouping variable specified in `by`. Default is `NULL`.
`number_format`	Character string or two-element character vector controlling thousand and decimal separators in formatted output. Named presets: `"us"` - Comma thousands, period decimal: `1,234.56` [default] `"eu"` - Period thousands, comma decimal: `1.234,56` `"space"` - Thin-space thousands, period decimal: `1 234.56` (SI/ISO 31-0) `"none"` - No thousands separator: `1234.56` Or provide a custom two-element vector `c(big.mark, decimal.mark)`, e.g., `c("'", ".")` for Swiss-style: `⁠1'234.56⁠`. When `NULL` (default), uses `getOption("summata.number_format", "us")`. Set the global option once per session to avoid passing this argument repeatedly: options(summata.number_format = "eu")
`...`	Additional arguments passed to the underlying statistical test functions (e.g., `var.equal = TRUE` for t-tests, `simulate.p.value = TRUE` for Fisher test).

Details

Variable Type Detection:

The function automatically detects variable types and applies appropriate summaries:

Continuous: Numeric variables (integer or double) receive statistics specified in stats_continuous
Categorical: Character, factor, or logical variables receive frequency counts and percentages
Time-to-Event: Variables specified as Surv(time, event) display median survival with confidence intervals (level controlled by conf_level)

Statistical Testing:

When test = TRUE and by is specified:

Continuous with "auto": Parametric tests (t-test, ANOVA) for mean-based statistics; non-parametric tests (Wilcoxon, Kruskal-Wallis) for median-based statistics
Categorical with "auto": Fisher exact test when any expected cell frequency < 5; \chi^2 test otherwise
Survival: Log-rank test for comparing survival curves
Range statistics: No p-value computed (ranges are descriptive)

Missing Data Handling:

Missing values are handled differently by variable type:

Continuous: NAs excluded from calculations; optionally shown as count when na_include = TRUE
Categorical: NAs can be included as a category when na_include = TRUE. The na_percent parameter controls whether percentages are calculated with or without NAs in the denominator
Survival: NAs in time or event excluded from analysis

Formatting Conventions:

All numeric output respects the number_format parameter. Separators within ranges and confidence intervals adapt automatically to avoid ambiguity:

Mean \pm SD: "45.2 \eqn{\pm} 12.3" (US) or "45,2 \eqn{\pm} 12,3" (EU)
Median [IQR]: "38.0 [28.0-52.0]" (US) or "38,0 [28,0-52,0]" (EU, en-dash separator)
Range: "18.0-75.0" (positive, US), "-5.0 to 10.0" (when bounds are negative)
Survival: "24.5 (21.2-28.9)" (US) or "24,5 (21,2-28,9)" (EU)
Counts \ge 1000: "1,234" (US) or "1.234" (EU)
p-values: "< 0.001" (US) or "< 0,001" (EU)

Value

A data.table with S3 class "desctable" containing formatted descriptive statistics. The table structure includes:

Variable: Variable name or label (from labels)
Group: For continuous variables: statistic type (e.g., "Mean \pm SD", "Median [IQR]"). For categorical variables: category level. Empty for variable name rows.
Total: Statistics for the total sample (if total = TRUE)
Group columns: Statistics for each group level (when by is specified). Column names match group levels.
p-value: Formatted p-values from statistical tests (when test = TRUE and by is specified)

The first row always shows sample sizes for each column. All numeric output (counts, statistics, p-values) respects the number_format setting for locale-appropriate formatting.

The returned object includes the following attributes accessible via attr():

raw_data: A data.table containing unformatted numeric values suitable for further statistical analysis or custom formatting. Includes additional columns for standard deviations, quartiles, etc.
by_variable: The grouping variable name used (value of by)
variables: The variables analyzed (value of variables)

Examples

# Load example clinical trial data
data(clintrial)

# Example 1: Basic descriptive table without grouping
desctable(clintrial,
        variables = c("age", "sex", "bmi"))



# Example 2: Grouped comparison with default tests
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex", "race", "bmi"))

# Example 3: Customize continuous statistics
desctable(clintrial,
        by = "treatment",
        variables = c("age", "bmi", "creatinine"),
        stats_continuous = c("median_iqr", "range"))

# Example 4: Change categorical display format
desctable(clintrial,
        by = "treatment",
        variables = c("sex", "race", "smoking"),
        stats_categorical = "n")  # Show counts only

# Example 5: Include missing values
desctable(clintrial,
        by = "treatment",
        variables = c("age", "smoking", "hypertension"),
        na_include = TRUE,
        na_label = "Missing")

# Example 6: Disable statistical testing
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex", "bmi"),
        test = FALSE)

# Example 7: Force specific tests
desctable(clintrial,
        by = "surgery",
        variables = c("age", "sex"),
        test_continuous = "t",      # t-test instead of auto
        test_categorical = "fisher") # Fisher test instead of auto

# Example 8: Adjust decimal places
desctable(clintrial,
        by = "treatment",
        variables = c("age", "bmi"),
        digits = 2,    # 2 decimals for continuous
        p_digits = 4)  # 4 decimals for p-values

# Example 9: Custom variable labels
labels <- c(
    age = "Age (years)",
    sex = "Sex",
    bmi = "Body Mass Index (kg/m\u00b2)",
    treatment = "Treatment Arm"
)

desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex", "bmi"),
        labels = labels)

# Example 10: Position total column last
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex"),
        total = "last")

# Example 11: Exclude total column
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex"),
        total = FALSE)

# Example 12: Survival analysis
desctable(clintrial,
        by = "treatment",
        variables = "Surv(os_months, os_status)")

# Example 13: Multiple survival endpoints
desctable(clintrial,
        by = "treatment",
        variables = c(
            "Surv(pfs_months, pfs_status)",
            "Surv(os_months, os_status)"
        ),
        labels = c(
            "Surv(pfs_months, pfs_status)" = "Progression-Free Survival",
            "Surv(os_months, os_status)" = "Overall Survival"
        ))

# Example 14: Mixed variable types
desctable(clintrial,
        by = "treatment",
        variables = c(
            "age", "sex", "race",           # Demographics
            "bmi", "creatinine",            # Labs
            "smoking", "hypertension",      # Risk factors
            "Surv(os_months, os_status)"    # Survival
        ))

# Example 15: Three or more groups
desctable(clintrial,
        by = "stage",  # Assuming stage has 3+ levels
        variables = c("age", "sex", "bmi"))
# Automatically uses ANOVA/Kruskal-Wallis and chi-squared

# Example 16: Access raw unformatted data
result <- desctable(clintrial,
                  by = "treatment",
                  variables = c("age", "bmi"))
raw_data <- attr(result, "raw_data")
print(raw_data)
# Raw data includes unformatted numbers, SDs, quartiles, etc.

# Example 17: Check which grouping variable was used
result <- desctable(clintrial,
                  by = "treatment",
                  variables = c("age", "sex"))
attr(result, "by_variable")  # "treatment"

# Example 18: NA percentage calculation options
# Include NAs in percentage denominator (all sum to 100%)
desctable(clintrial,
        by = "treatment",
        variables = "smoking",
        na_include = TRUE,
        na_percent = TRUE)

# Exclude NAs from denominator (non-missing sum to 100%)
desctable(clintrial,
        by = "treatment",
        variables = "smoking",
        na_include = TRUE,
        na_percent = FALSE)

# Example 19: Passing additional test arguments
# Equal variance t-test
desctable(clintrial,
        by = "sex",
        variables = "age",
        test_continuous = "t",
        var.equal = TRUE)

# Example 20: European number formatting
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex", "bmi"),
        number_format = "eu")

# Example 21: Complete Table 1 for publication
table1 <- desctable(
    data = clintrial,
    by = "treatment",
    variables = c(
        "age", "sex", "race", "ethnicity", "bmi",
        "smoking", "hypertension", "diabetes",
        "ecog", "creatinine", "hemoglobin",
        "site", "stage", "grade",
        "Surv(os_months, os_status)"
    ),
    labels = clintrial_labels,
    stats_continuous = c("median_iqr", "range"),
    total = TRUE,
    na_include = FALSE
)
print(table1)

summata documentation built on May 7, 2026, 5:07 p.m.