desctable: Create Publication-Ready Descriptive Statistics Tables

View source: R/desctable.R

desctableR Documentation

Create Publication-Ready Descriptive Statistics Tables

Description

Generates comprehensive descriptive statistics tables with automatic variable type detection, group comparisons, and appropriate statistical testing. This function is designed to create "Table 1"-style summaries commonly used in clinical and epidemiological research, with full support for continuous, categorical, and time-to-event variables.

Usage

desctable(
  data,
  by = NULL,
  variables,
  stats_continuous = c("median_iqr"),
  stats_categorical = "n_percent",
  digits = 1,
  p_digits = 3,
  conf_level = 0.95,
  p_per_stat = FALSE,
  na_include = FALSE,
  na_label = "Unknown",
  na_percent = FALSE,
  test = TRUE,
  test_continuous = "auto",
  test_categorical = "auto",
  total = TRUE,
  total_label = "Total",
  labels = NULL,
  number_format = NULL,
  ...
)

Arguments

data

Data frame or data.table containing the dataset to summarize. Automatically converted to a data.table for efficient processing.

by

Character string specifying the column name of the grouping variable for stratified analysis (e.g., treatment arm, exposure status). When NULL (default), produces overall summaries only without group comparisons or statistical tests.

variables

Character vector of variable names to summarize. Can include standard column names for continuous or categorical variables, and survival expressions using Surv() syntax (e.g., "Surv(os_months, os_status)"). Variables are processed in the order provided.

stats_continuous

Character vector specifying which statistics to compute for continuous variables. Multiple values create separate rows for each variable. Options:

  • "mean_sd" - Mean \pm standard deviation

  • "median_iqr" - Median [interquartile range]

  • "median_range" - Median (minimum-maximum)

  • "range" - Minimum-maximum only

Default is "median_iqr".

stats_categorical

Character string specifying the format for categorical variable summaries:

  • "n" - Count only

  • "percent" - Percentage only

  • "n_percent" - Count (percentage) [default]

digits

Integer specifying the number of decimal places for continuous statistics. Default is 1.

p_digits

Integer specifying the number of decimal places for p-values. Values smaller than 10^(-p_digits) are displayed as "< 0.001" (for p_digits = 3), "< 0.0001" (for p_digits = 4), etc. Default is 3.

conf_level

Numeric confidence level for confidence intervals in survival variable summaries (median survival time with CI). Must be between 0 and 1. Default is 0.95 (95% confidence intervals).

p_per_stat

Logical. If TRUE, displays p-values on each row (per statistic) rather than only on the first row of each variable. Useful when different statistics within a variable warrant separate significance testing. Default is FALSE.

na_include

Logical. If TRUE, missing values (NAs) are displayed as a separate category/row for each variable. If FALSE, missing values are silently excluded from calculations. Default is FALSE.

na_label

Character string used to label the missing values row when na_include = TRUE. Default is "Unknown".

na_percent

Logical. Controls how percentages are calculated for categorical variables when na_include = TRUE:

  • If TRUE, percentages include NAs in the denominator (all categories sum to 100%)

  • If FALSE, percentages exclude NAs from the denominator (non-missing categories sum to 100%, missing shown separately)

Only affects categorical variables. Default is FALSE.

test

Logical. If TRUE, performs appropriate statistical tests for group comparisons and adds a p-value column. Requires by to be specified. Tests are automatically selected based on variable type and test parameters. Default is TRUE.

test_continuous

Character string specifying the statistical test for continuous variables:

  • "auto" - Automatic selection: t-test/ANOVA for means, Wilcoxon/Kruskal-Wallis for medians [default]

  • "t" - Independent samples t-test (2 groups only)

  • "aov" - One-way ANOVA (2+ groups)

  • "wrs" - Wilcoxon rank-sum test (2 groups only)

  • "kwt" - Kruskal-Wallis test (2+ groups)

test_categorical

Character string specifying the statistical test for categorical variables:

  • "auto" - Automatic selection: Fisher exact test if any expected cell frequency < 5, otherwise \chi^2 test [default]

  • "fisher" - Fisher exact test

  • "chisq" - \chi^2 test

total

Logical or character string controlling the total column:

  • TRUE or "first" - Include total column as first column after Variable/Group [default]

  • "last" - Include total column as last column before p-value

  • FALSE - Exclude total column

total_label

Character string for the total column header. Default is "Total".

labels

Named character vector or list providing custom display labels for variables. Names should match variable names (or Surv() expressions), values are the display labels. Variables not in labels use their original names. Can also label the grouping variable specified in by. Default is NULL.

number_format

Character string or two-element character vector controlling thousand and decimal separators in formatted output. Named presets:

  • "us" - Comma thousands, period decimal: 1,234.56 [default]

  • "eu" - Period thousands, comma decimal: 1.234,56

  • "space" - Thin-space thousands, period decimal: 1 234.56 (SI/ISO 31-0)

  • "none" - No thousands separator: 1234.56

Or provide a custom two-element vector c(big.mark, decimal.mark), e.g., c("'", ".") for Swiss-style: ⁠1'234.56⁠.

When NULL (default), uses getOption("summata.number_format", "us"). Set the global option once per session to avoid passing this argument repeatedly:

    options(summata.number_format = "eu")
  
...

Additional arguments passed to the underlying statistical test functions (e.g., var.equal = TRUE for t-tests, simulate.p.value = TRUE for Fisher test).

Details

Variable Type Detection:

The function automatically detects variable types and applies appropriate summaries:

  • Continuous: Numeric variables (integer or double) receive statistics specified in stats_continuous

  • Categorical: Character, factor, or logical variables receive frequency counts and percentages

  • Time-to-Event: Variables specified as Surv(time, event) display median survival with confidence intervals (level controlled by conf_level)

Statistical Testing:

When test = TRUE and by is specified:

  • Continuous with "auto": Parametric tests (t-test, ANOVA) for mean-based statistics; non-parametric tests (Wilcoxon, Kruskal-Wallis) for median-based statistics

  • Categorical with "auto": Fisher exact test when any expected cell frequency < 5; \chi^2 test otherwise

  • Survival: Log-rank test for comparing survival curves

  • Range statistics: No p-value computed (ranges are descriptive)

Missing Data Handling:

Missing values are handled differently by variable type:

  • Continuous: NAs excluded from calculations; optionally shown as count when na_include = TRUE

  • Categorical: NAs can be included as a category when na_include = TRUE. The na_percent parameter controls whether percentages are calculated with or without NAs in the denominator

  • Survival: NAs in time or event excluded from analysis

Formatting Conventions:

All numeric output respects the number_format parameter. Separators within ranges and confidence intervals adapt automatically to avoid ambiguity:

  • Mean \pm SD: "45.2 \eqn{\pm} 12.3" (US) or "45,2 \eqn{\pm} 12,3" (EU)

  • Median [IQR]: "38.0 [28.0-52.0]" (US) or "38,0 [28,0-52,0]" (EU, en-dash separator)

  • Range: "18.0-75.0" (positive, US), "-5.0 to 10.0" (when bounds are negative)

  • Survival: "24.5 (21.2-28.9)" (US) or "24,5 (21,2-28,9)" (EU)

  • Counts \ge 1000: "1,234" (US) or "1.234" (EU)

  • p-values: "< 0.001" (US) or "< 0,001" (EU)

Value

A data.table with S3 class "desctable" containing formatted descriptive statistics. The table structure includes:

Variable

Variable name or label (from labels)

Group

For continuous variables: statistic type (e.g., "Mean \pm SD", "Median [IQR]"). For categorical variables: category level. Empty for variable name rows.

Total

Statistics for the total sample (if total = TRUE)

Group columns

Statistics for each group level (when by is specified). Column names match group levels.

p-value

Formatted p-values from statistical tests (when test = TRUE and by is specified)

The first row always shows sample sizes for each column. All numeric output (counts, statistics, p-values) respects the number_format setting for locale-appropriate formatting.

The returned object includes the following attributes accessible via attr():

raw_data

A data.table containing unformatted numeric values suitable for further statistical analysis or custom formatting. Includes additional columns for standard deviations, quartiles, etc.

by_variable

The grouping variable name used (value of by)

variables

The variables analyzed (value of variables)

See Also

survtable for detailed survival summary tables, fit for regression modeling, table2pdf for PDF export, table2docx for Word export, table2html for HTML export

Other descriptive functions: print.survtable(), survtable()

Examples

# Load example clinical trial data
data(clintrial)

# Example 1: Basic descriptive table without grouping
desctable(clintrial,
        variables = c("age", "sex", "bmi"))



# Example 2: Grouped comparison with default tests
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex", "race", "bmi"))

# Example 3: Customize continuous statistics
desctable(clintrial,
        by = "treatment",
        variables = c("age", "bmi", "creatinine"),
        stats_continuous = c("median_iqr", "range"))

# Example 4: Change categorical display format
desctable(clintrial,
        by = "treatment",
        variables = c("sex", "race", "smoking"),
        stats_categorical = "n")  # Show counts only

# Example 5: Include missing values
desctable(clintrial,
        by = "treatment",
        variables = c("age", "smoking", "hypertension"),
        na_include = TRUE,
        na_label = "Missing")

# Example 6: Disable statistical testing
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex", "bmi"),
        test = FALSE)

# Example 7: Force specific tests
desctable(clintrial,
        by = "surgery",
        variables = c("age", "sex"),
        test_continuous = "t",      # t-test instead of auto
        test_categorical = "fisher") # Fisher test instead of auto

# Example 8: Adjust decimal places
desctable(clintrial,
        by = "treatment",
        variables = c("age", "bmi"),
        digits = 2,    # 2 decimals for continuous
        p_digits = 4)  # 4 decimals for p-values

# Example 9: Custom variable labels
labels <- c(
    age = "Age (years)",
    sex = "Sex",
    bmi = "Body Mass Index (kg/m\u00b2)",
    treatment = "Treatment Arm"
)

desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex", "bmi"),
        labels = labels)

# Example 10: Position total column last
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex"),
        total = "last")

# Example 11: Exclude total column
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex"),
        total = FALSE)

# Example 12: Survival analysis
desctable(clintrial,
        by = "treatment",
        variables = "Surv(os_months, os_status)")

# Example 13: Multiple survival endpoints
desctable(clintrial,
        by = "treatment",
        variables = c(
            "Surv(pfs_months, pfs_status)",
            "Surv(os_months, os_status)"
        ),
        labels = c(
            "Surv(pfs_months, pfs_status)" = "Progression-Free Survival",
            "Surv(os_months, os_status)" = "Overall Survival"
        ))

# Example 14: Mixed variable types
desctable(clintrial,
        by = "treatment",
        variables = c(
            "age", "sex", "race",           # Demographics
            "bmi", "creatinine",            # Labs
            "smoking", "hypertension",      # Risk factors
            "Surv(os_months, os_status)"    # Survival
        ))

# Example 15: Three or more groups
desctable(clintrial,
        by = "stage",  # Assuming stage has 3+ levels
        variables = c("age", "sex", "bmi"))
# Automatically uses ANOVA/Kruskal-Wallis and chi-squared

# Example 16: Access raw unformatted data
result <- desctable(clintrial,
                  by = "treatment",
                  variables = c("age", "bmi"))
raw_data <- attr(result, "raw_data")
print(raw_data)
# Raw data includes unformatted numbers, SDs, quartiles, etc.

# Example 17: Check which grouping variable was used
result <- desctable(clintrial,
                  by = "treatment",
                  variables = c("age", "sex"))
attr(result, "by_variable")  # "treatment"

# Example 18: NA percentage calculation options
# Include NAs in percentage denominator (all sum to 100%)
desctable(clintrial,
        by = "treatment",
        variables = "smoking",
        na_include = TRUE,
        na_percent = TRUE)

# Exclude NAs from denominator (non-missing sum to 100%)
desctable(clintrial,
        by = "treatment",
        variables = "smoking",
        na_include = TRUE,
        na_percent = FALSE)

# Example 19: Passing additional test arguments
# Equal variance t-test
desctable(clintrial,
        by = "sex",
        variables = "age",
        test_continuous = "t",
        var.equal = TRUE)

# Example 20: European number formatting
desctable(clintrial,
        by = "treatment",
        variables = c("age", "sex", "bmi"),
        number_format = "eu")

# Example 21: Complete Table 1 for publication
table1 <- desctable(
    data = clintrial,
    by = "treatment",
    variables = c(
        "age", "sex", "race", "ethnicity", "bmi",
        "smoking", "hypertension", "diabetes",
        "ecog", "creatinine", "hemoglobin",
        "site", "stage", "grade",
        "Surv(os_months, os_status)"
    ),
    labels = clintrial_labels,
    stats_continuous = c("median_iqr", "range"),
    total = TRUE,
    na_include = FALSE
)
print(table1)




summata documentation built on May 7, 2026, 5:07 p.m.