Introduction to summarytabl

Overview

summarytabl is an R package designed to simplify the creation of summary tables for different types of data. It provides a set of functions that help you quickly describe:

Each function is clearly prefixed based on the type of data it summarizes, making it easy to identify and apply the right tool for your analysis.

Use these functions to summarize binary and nominal variables:

These functions are ideal for summarizing binary, ordinal, and Likert-scale variables in which respondents select one response per statement, question, or item:

For interval and ratio-level variables, use:

All functions work with data frames and tibbles, and each returns a tibble as output.

This document is organized into three sections, each focusing on a different set of functions for summarizing a specific type of variable.

To begin working with summarytabl, load the package:

library(summarytabl)

Keep reading to learn more about how each function works, or jump to the section that matches the type of variable or data you're working with.

Working with categorical variables

Let's explore how to use cat_tbl() and cat_group_tbl() to summarize categorical variables. We'll begin by summarizing a single categorical variable, race, from the nlsy dataset.

cat_tbl(data = nlsy, var = "race")

The function returns a tibble with three columns by default:

You can exclude certain values and eliminate missing values from the data using the ignore and na.rm arguments, respectively.

cat_tbl(data = nlsy, 
        var = "race",
        ignore = "Hispanic",
        na.rm = TRUE)

Suppose we want to create a contingency table to summarize two categorical variables. We can do this using the cat_group_tbl() function. In this example, we summarize race by bthwht. Before applying cat_group_tbl(), we'll recode the values of bthwht, changing 0 to regular_birthweight and 1 to low_birthweight.

nlsy_cross_tab <- 
  nlsy |>
  dplyr::select(c(race, bthwht)) |>
  dplyr::mutate(bthwht = ifelse(bthwht == 0, "regular_bithweight", "low_birthweight")) 

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht")

The function returns a tibble with four columns by default:

To pivot the output to the wide format, set pivot = "wider".

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              pivot = "wider")

To display only percentages, set only = "percent". You can also control how those percentages are calculated and displayed using the margins argument.

# Default: percentages across the full table sum to one
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              pivot = "wider",
              only = "percent")

# Rowwise: percentages sum to one across columns within each row
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              margins = "rows",
              pivot = "wider",
              only = "percent")

# Columnwise: percentages within each column sum to one
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              margins = "columns",
              pivot = "wider",
              only = "percent")

Sometimes, you may want to exclude specific values from your analysis. To do this, use a named vector or list to specify which values to exclude from the row_var and col_var variables. For example, in the case below, the Non-Black/Non-Hispanic category is excluded from the race variable (i.e., row_var) and to ensure that NAs are not returned in the final table, na.rm.row_var is set to TRUE.

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              na.rm.row_var = TRUE,
              ignore = c(race = "Non-Black,Non-Hispanic"))

When you need to exclude more than one value from row_var or col_var, use a named list. In the example below, both the Non-Black/Non-Hispanic and Hispanic categories are excluded from the race variable.

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              na.rm.row_var = TRUE,
              ignore = list(race = c("Non-Black,Non-Hispanic", "Hispanic")))

Working with multiple response and ordinal variables

Next, let's explore how to use select_tbl() and select_group_tbl() functions to summarize multiple response and ordinal variables. Multiple response and ordinal variables are commonly used in survey research, psychology, and health sciences. Examples include symptom checklists, scales like a depression index with multiple items, or questions allowing respondents to select all choices that apply to them.

The depressive dataset contains eight variables that share the same variable stem: dep, with each one representing a different item used to measure depression.

names(depressive)

Using the select_tbl() function, we can summarize participants' responses to these items by showing how many respondents chose each answer option (i.e., value) for every variable.

select_tbl(data = depressive, var_stem = "dep")

Alternatively, you can choose to summarize specific variables by passing their names to the var_stem argument and setting the var_input argument to "name".

select_tbl(data = depressive, 
           var_stem = c("dep_1", "dep_4", "dep_6"),
           var_input = "name")

By default, missing values are removed using listwise deletion. To switch to pairwise deletion instead, set na_removal = "pairwise".

select_tbl(data = depressive, 
           var_stem = "dep",
           na_removal = "pairwise")

To display the output in the wide format, set pivot = "wider".

select_tbl(data = depressive, 
           var_stem = "dep",
           na_removal = "pairwise",
           pivot = "wider")

It's common practice to group multiple response or ordinal variables by another variable. This type of descriptive analysis allows for meaningful comparisons across different segments of your dataset. With select_group_tbl(), you can create a summary table for multiple response and ordinal variables, grouped either by another variable in your dataset or by matching a pattern in the variable names. For example, we often want to summarize survey responses by race.

First, recode the race variable and the values for each of the eight depressive index variables in the depressive dataset, replacing numeric categories with descriptive string labels for easier interpretation.

dep_recoded <- 
  depressive |>
  dplyr::mutate(
    race = dplyr::case_match(.x = race,
                             1 ~ "Hispanic", 
                             2 ~ "Black", 
                             3 ~ "Non-Black/Non-Hispanic",
                             .default = NA)
  ) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("dep"),
      .fns = ~ dplyr::case_when(.x == 1 ~ "often", 
                                .x == 2 ~ "sometimes", 
                                .x == 3 ~ "hardly ever")
    ))

Next, use the select_group_tbl() function to summarize responses for all eight variables by race:

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race")

As with select_tbl(), setting the pivot argument to "wider" reshapes the table into the wide format, while using "pairwise" for the na_removal argument ensures missing values are addressed through pairwise deletion.

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider")

The ignore argument can be used to exclude specific values from analysis. In the example below, the value often is removed from all eight depression index variables, and the Non-Black/Non-Hispanic category is excluded from the race variable.

select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider",
                 ignore = c(dep = "often", race = "Non-Black/Non-Hispanic"))

When group_type is set to variable (the default), the margins argument controls how percentages are calculated and presented.

# Default: percentages across each variable sum to one
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider")

# Rowwise: for each value of the variable, the percentages 
# across all levels of the grouping variable sum to one
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 margins = "rows",
                 na_removal = "pairwise",
                 pivot = "wider")

# Columnwise: for each level of the grouping variable, 
# the percentages across all values of the variable sum 
# to one.
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 margins = "columns",
                 na_removal = "pairwise",
                 pivot = "wider")

Another way to use select_group_tbl() is to summarize responses that match a specific pattern, such as survey waves or time points. To enable this feature, set group_type = "pattern" and provide the desired pattern in the group argument. For example, the stem_social_psych dataset contains variables that capture student responses about their sense of belonging in the STEM community at two distinct time points: "w1" and "w2". You can summarize these responses using a pattern-based approach, where the time points (e.g., "w1" and "w2") serve as grouping variables.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern")

Use the group_name argument to assign a descriptive name to the column containing the matched pattern values.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave")

You can also include variable labels in your summary table by using the var_labels argument.

select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 var_labels = c(
                   belong_belongStem_w1 = "I feel like I belong in STEM (wave 1)",
                   belong_belongStem_w2 = "I feel like I belong in STEM (wave 2)"
                 ))

Finally, use the only argument to choose what information to return.

# Default: counts and percentages
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave")

# Counts only
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 only = "count")

# Percentages only
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 only = "percent")

Working with continuous variables

Finally, let’s look at how to use the mean_tbl() and mean_group_tbl() functions to summarize continuous variables. The mean_tbl() function allows you to generate descriptive statistics for either a set of continuous variables that share a common stem or for individual continuous variables. The resulting summary table includes key metrics such as the variable's mean, standard deviation, minimum value, maximum value, and the count of non-missing observations for each variable.

The sdoh dataset contains six variables describing characteristics of health care facilities, all of which begin with the prefix HHC_PCT. Using the mean_tbl() function, you can generate summary statistics for these variables:

mean_tbl(data = sdoh, var_stem = "HHC_PCT")

Alternatively, if you want to generate summary statistics for only a subset of those variables, you can specify their names directly in the var_stem argument and set var_input = "name" to indicate you're referencing variable names rather than a shared stem.

mean_tbl(
  data = sdoh,
  var_stem = c("HHC_PCT_HHA_PHYS_THERAPY",
               "HHC_PCT_HHA_OCC_THERAPY",
               "HHC_PCT_HHA_SPEECH"),
  var_input = "name"
)

You can also specify how missing values are removed, using the na_removal argument.

# Default listwise removal
mean_tbl(data = sdoh, var_stem = "HHC_PCT")

# Pairwise removal
mean_tbl(data = sdoh, 
         var_stem = "HHC_PCT",
         na_removal = "pairwise")

Consider adding variable labels using the var_labels argument to help make the variable names easier to interpret.

mean_tbl(data = sdoh, 
         var_stem = "HHC_PCT",
         na_removal = "pairwise",
         var_labels = c(
           HHC_PCT_HHA_NURSING="% agencies offering nursing care services",
           HHC_PCT_HHA_PHYS_THERAPY="% agencies offering physical therapy services",
           HHC_PCT_HHA_OCC_THERAPY="% agencies offering occupational therapy services",
           HHC_PCT_HHA_SPEECH="% agencies offering speech pathology services",
           HHC_PCT_HHA_MEDICAL="% agencies offering medical social services",
           HHC_PCT_HHA_AIDE="% agencies offering home health aide services"
         ))

Similar to working with multiple response variables, it's common practice to group continuous variables by another variable to enable meaningful comparisons across different segments of a dataset. The mean_group_tbl() function facilitates this type of descriptive analysis by generating summary statistics for continuous variables, grouped either by a specific variable in the dataset or by matching patterns in variable names. For example, it's often useful to present summary statistics by demographic categories such as region, gender, age, or race.

mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               group_type = "variable")

You can control which values to exclude and how missing data is handled using the ignore and na_removal arguments. To specify values to ignore, use a named vector or list, where each name corresponds to a variable stem or specific variable name.

# Default listwise removal
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               ignore = c(HHC_PCT = 0, REGION = "Northeast"))

# Pairwise removal
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               na_removal = "pairwise",
               ignore = c(HHC_PCT = 0, REGION = "Northeast"))

# Pairwise removal excluding several values from the same stem 
# or group variable.
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               na_removal = "pairwise",
               ignore = list(HHC_PCT = 0, REGION = c("Northeast", "South")))

Another way to use mean_group_tbl() is to summarize responses based on a shared pattern, such as survey time points. To enable this feature, set group_type = "pattern" and specify the desired pattern in the group argument.

Consider a dataset compiled by researchers examining how many symptoms participants reported they'd had after a long illness. In this (fictitious) dataset, responses are collected at three time points: "t1" (baseline), "t2" (6-month follow-up), and "t3" (one-year follow-up). Using a pattern-based approach, you can group variables by these time points to generate summary statistics for each phase of data collection.

In the example below, we first create the symptoms_data dataset and then use the mean_group_tbl() function to generate summary statistics for variables that begin with the prefix symptoms and contain a substring matching the pattern "_t\\d", an underscore followed by the letter "t" and a single digit, indicating different time points. The ignore argument is also used to exclude the value -999 from the analysis.

set.seed(0803)
symptoms_data <-
  data.frame(
    symptoms_t1 = sample(c(0:10, -999), replace = TRUE, size = 50),
    symptoms_t2 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50),
    symptoms_t3 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50)
  )

mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               ignore = c(symptoms = -999))

To make your output easier to understand, use the group_name argument to add a label to the column that shows grouping values or matched patterns. You can also use the var_labels argument to display descriptive labels for each variable.

mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999), 
               var_labels = c(symptoms_t1 = "# of symptoms at baseline",
                              symptoms_t2 = "# of symptoms at 6 months follow up",
                              symptoms_t3 = "# of symptoms at one-year follow up"))

Finally, you can choose what information to return using the only argument.

# Default: all summary statistics returned
# (mean, sd, min, max, nobs)
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999))

# Means and non-missing observations only
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999),
               only = c("mean", "nobs"))

# Means and standard deviations only
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999),
               only = c("mean", "sd"))


Try the summarytabl package in your browser

Any scripts or data that you put into this service are public.

summarytabl documentation built on Nov. 6, 2025, 5:07 p.m.