summarytabl is an R package designed to simplify the creation of summary tables for different types of data. It provides a set of functions that help you quickly describe:
Each function is clearly prefixed based on the type of data it summarizes, making it easy to identify and apply the right tool for your analysis.
Use these functions to summarize binary and nominal variables:
cat_tbl() creates a summary table for a categorical variable.cat_group_tbl() summarizes two categorical variables.These functions are ideal for summarizing binary, ordinal, and Likert-scale variables in which respondents select one response per statement, question, or item:
select_tbl() summarizes multiple response and ordinal variables.select_group_tbl() summarizes multiple response and ordinal variables by a group or pattern.For interval and ratio-level variables, use:
mean_tbl() generates summary statistics for continuous variables.mean_group_tbl() generates summary statistics for continuous variables by group or pattern.All functions work with data frames and tibbles, and each returns a tibble as output.
This document is organized into three sections, each focusing on a different set of functions for summarizing a specific type of variable.
To begin working with summarytabl, load the package:
library(summarytabl)
Keep reading to learn more about how each function works, or jump to the section that matches the type of variable or data you're working with.
Let's explore how to use cat_tbl() and cat_group_tbl() to summarize categorical variables. We'll begin by summarizing a single categorical variable, race, from the nlsy dataset.
cat_tbl(data = nlsy, var = "race")
The function returns a tibble with three columns by default:
race: the name of the variable being summarizedcount: the number of observations in each category of racepercent: the percentage of observations in each category of race, calculated relative to the totalYou can exclude certain values and eliminate missing values from the data using the ignore and na.rm arguments, respectively.
cat_tbl(data = nlsy, var = "race", ignore = "Hispanic", na.rm = TRUE)
Suppose we want to create a contingency table to summarize two categorical variables. We can do this using the cat_group_tbl() function. In this example, we summarize race by bthwht. Before applying cat_group_tbl(), we'll recode the values of bthwht, changing 0 to regular_birthweight and 1 to low_birthweight.
nlsy_cross_tab <- nlsy |> dplyr::select(c(race, bthwht)) |> dplyr::mutate(bthwht = ifelse(bthwht == 0, "regular_bithweight", "low_birthweight")) cat_group_tbl(data = nlsy_cross_tab, row_var = "race", col_var = "bthwht")
The function returns a tibble with four columns by default:
race: the name of the row_var variablebthwht: the name of the col_var variablecount: the number of observations for each combination of race and bthwht categories.percent: the percentage of observations for each combination of race and bthwht categories, calculated relative to the totalTo pivot the output to the wide format, set pivot = "wider".
cat_group_tbl(data = nlsy_cross_tab, row_var = "race", col_var = "bthwht", pivot = "wider")
To display only percentages, set only = "percent". You can also control how those percentages are calculated and displayed using the margins argument.
# Default: percentages across the full table sum to one cat_group_tbl(data = nlsy_cross_tab, row_var = "race", col_var = "bthwht", pivot = "wider", only = "percent") # Rowwise: percentages sum to one across columns within each row cat_group_tbl(data = nlsy_cross_tab, row_var = "race", col_var = "bthwht", margins = "rows", pivot = "wider", only = "percent") # Columnwise: percentages within each column sum to one cat_group_tbl(data = nlsy_cross_tab, row_var = "race", col_var = "bthwht", margins = "columns", pivot = "wider", only = "percent")
Sometimes, you may want to exclude specific values from your analysis. To do this, use a named vector or list to specify which values to exclude from the row_var and col_var variables. For example, in the case below, the Non-Black/Non-Hispanic category is excluded from the race variable (i.e., row_var) and to ensure that NAs are not returned in the final table, na.rm.row_var is set to TRUE.
cat_group_tbl(data = nlsy_cross_tab, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE, ignore = c(race = "Non-Black,Non-Hispanic"))
When you need to exclude more than one value from row_var or col_var, use a named list. In the example below, both the Non-Black/Non-Hispanic and Hispanic categories are excluded from the race variable.
cat_group_tbl(data = nlsy_cross_tab, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE, ignore = list(race = c("Non-Black,Non-Hispanic", "Hispanic")))
Next, let's explore how to use select_tbl() and select_group_tbl() functions to summarize multiple response and ordinal variables. Multiple response and ordinal variables are commonly used in survey research, psychology, and health sciences. Examples include symptom checklists, scales like a depression index with multiple items, or questions allowing respondents to select all choices that apply to them.
The depressive dataset contains eight variables that share the same variable stem: dep, with each one representing a different item used to measure depression.
names(depressive)
Using the select_tbl() function, we can summarize participants' responses to these items by showing how many respondents chose each answer option (i.e., value) for every variable.
select_tbl(data = depressive, var_stem = "dep")
Alternatively, you can choose to summarize specific variables by passing their names to the var_stem argument and setting the var_input argument to "name".
select_tbl(data = depressive, var_stem = c("dep_1", "dep_4", "dep_6"), var_input = "name")
By default, missing values are removed using listwise deletion. To switch to pairwise deletion instead, set na_removal = "pairwise".
select_tbl(data = depressive, var_stem = "dep", na_removal = "pairwise")
To display the output in the wide format, set pivot = "wider".
select_tbl(data = depressive, var_stem = "dep", na_removal = "pairwise", pivot = "wider")
It's common practice to group multiple response or ordinal variables by another variable. This type of descriptive analysis allows for meaningful comparisons across different segments of your dataset. With select_group_tbl(), you can create a summary table for multiple response and ordinal variables, grouped either by another variable in your dataset or by matching a pattern in the variable names. For example, we often want to summarize survey responses by race.
First, recode the race variable and the values for each of the eight depressive index variables in the depressive dataset, replacing numeric categories with descriptive string labels for easier interpretation.
dep_recoded <- depressive |> dplyr::mutate( race = dplyr::case_match(.x = race, 1 ~ "Hispanic", 2 ~ "Black", 3 ~ "Non-Black/Non-Hispanic", .default = NA) ) |> dplyr::mutate( dplyr::across( .cols = dplyr::starts_with("dep"), .fns = ~ dplyr::case_when(.x == 1 ~ "often", .x == 2 ~ "sometimes", .x == 3 ~ "hardly ever") ))
Next, use the select_group_tbl() function to summarize responses for all eight variables by race:
select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race")
As with select_tbl(), setting the pivot argument to "wider" reshapes the table into the wide format, while using "pairwise" for the na_removal argument ensures missing values are addressed through pairwise deletion.
select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", na_removal = "pairwise", pivot = "wider")
The ignore argument can be used to exclude specific values from analysis. In the example below, the value often is removed from all eight depression index variables, and the Non-Black/Non-Hispanic category is excluded from the race variable.
select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", na_removal = "pairwise", pivot = "wider", ignore = c(dep = "often", race = "Non-Black/Non-Hispanic"))
When group_type is set to variable (the default), the margins argument controls how percentages are calculated and presented.
# Default: percentages across each variable sum to one select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", na_removal = "pairwise", pivot = "wider") # Rowwise: for each value of the variable, the percentages # across all levels of the grouping variable sum to one select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", margins = "rows", na_removal = "pairwise", pivot = "wider") # Columnwise: for each level of the grouping variable, # the percentages across all values of the variable sum # to one. select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", margins = "columns", na_removal = "pairwise", pivot = "wider")
Another way to use select_group_tbl() is to summarize responses that match a specific pattern, such as survey waves or time points. To enable this feature, set group_type = "pattern" and provide the desired pattern in the group argument. For example, the stem_social_psych dataset contains variables that capture student responses about their sense of belonging in the STEM community at two distinct time points: "w1" and "w2". You can summarize these responses using a pattern-based approach, where the time points (e.g., "w1" and "w2") serve as grouping variables.
select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern")
Use the group_name argument to assign a descriptive name to the column containing the matched pattern values.
select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave")
You can also include variable labels in your summary table by using the var_labels argument.
select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", var_labels = c( belong_belongStem_w1 = "I feel like I belong in STEM (wave 1)", belong_belongStem_w2 = "I feel like I belong in STEM (wave 2)" ))
Finally, use the only argument to choose what information to return.
# Default: counts and percentages select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave") # Counts only select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", only = "count") # Percentages only select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", only = "percent")
Finally, let’s look at how to use the mean_tbl() and mean_group_tbl() functions to summarize continuous variables. The mean_tbl() function allows you to generate descriptive statistics for either a set of continuous variables that share a common stem or for individual continuous variables. The resulting summary table includes key metrics such as the variable's mean, standard deviation, minimum value, maximum value, and the count of non-missing observations for each variable.
The sdoh dataset contains six variables describing characteristics of health care facilities, all of which begin with the prefix HHC_PCT. Using the mean_tbl() function, you can generate summary statistics for these variables:
mean_tbl(data = sdoh, var_stem = "HHC_PCT")
Alternatively, if you want to generate summary statistics for only a subset of those variables, you can specify their names directly in the var_stem argument and set var_input = "name" to indicate you're referencing variable names rather than a shared stem.
mean_tbl( data = sdoh, var_stem = c("HHC_PCT_HHA_PHYS_THERAPY", "HHC_PCT_HHA_OCC_THERAPY", "HHC_PCT_HHA_SPEECH"), var_input = "name" )
You can also specify how missing values are removed, using the na_removal argument.
# Default listwise removal mean_tbl(data = sdoh, var_stem = "HHC_PCT") # Pairwise removal mean_tbl(data = sdoh, var_stem = "HHC_PCT", na_removal = "pairwise")
Consider adding variable labels using the var_labels argument to help make the variable names easier to interpret.
mean_tbl(data = sdoh, var_stem = "HHC_PCT", na_removal = "pairwise", var_labels = c( HHC_PCT_HHA_NURSING="% agencies offering nursing care services", HHC_PCT_HHA_PHYS_THERAPY="% agencies offering physical therapy services", HHC_PCT_HHA_OCC_THERAPY="% agencies offering occupational therapy services", HHC_PCT_HHA_SPEECH="% agencies offering speech pathology services", HHC_PCT_HHA_MEDICAL="% agencies offering medical social services", HHC_PCT_HHA_AIDE="% agencies offering home health aide services" ))
Similar to working with multiple response variables, it's common practice to group continuous variables by another variable to enable meaningful comparisons across different segments of a dataset. The mean_group_tbl() function facilitates this type of descriptive analysis by generating summary statistics for continuous variables, grouped either by a specific variable in the dataset or by matching patterns in variable names. For example, it's often useful to present summary statistics by demographic categories such as region, gender, age, or race.
mean_group_tbl(data = sdoh, var_stem = "HHC_PCT", group = "REGION", group_type = "variable")
You can control which values to exclude and how missing data is handled using the ignore and na_removal arguments. To specify values to ignore, use a named vector or list, where each name corresponds to a variable stem or specific variable name.
# Default listwise removal mean_group_tbl(data = sdoh, var_stem = "HHC_PCT", group = "REGION", ignore = c(HHC_PCT = 0, REGION = "Northeast")) # Pairwise removal mean_group_tbl(data = sdoh, var_stem = "HHC_PCT", group = "REGION", na_removal = "pairwise", ignore = c(HHC_PCT = 0, REGION = "Northeast")) # Pairwise removal excluding several values from the same stem # or group variable. mean_group_tbl(data = sdoh, var_stem = "HHC_PCT", group = "REGION", na_removal = "pairwise", ignore = list(HHC_PCT = 0, REGION = c("Northeast", "South")))
Another way to use mean_group_tbl() is to summarize responses based on a shared pattern, such as survey time points. To enable this feature, set group_type = "pattern" and specify the desired pattern in the group argument.
Consider a dataset compiled by researchers examining how many symptoms participants reported they'd had after a long illness. In this (fictitious) dataset, responses are collected at three time points: "t1" (baseline), "t2" (6-month follow-up), and "t3" (one-year follow-up). Using a pattern-based approach, you can group variables by these time points to generate summary statistics for each phase of data collection.
In the example below, we first create the symptoms_data dataset and then use the mean_group_tbl() function to generate summary statistics for variables that begin with the prefix symptoms and contain a substring matching the pattern "_t\\d", an underscore followed by the letter "t" and a single digit, indicating different time points. The ignore argument is also used to exclude the value -999 from the analysis.
set.seed(0803) symptoms_data <- data.frame( symptoms_t1 = sample(c(0:10, -999), replace = TRUE, size = 50), symptoms_t2 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50), symptoms_t3 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50) ) mean_group_tbl(data = symptoms_data, var_stem = "symptoms", group = "_t\\d", group_type = "pattern", ignore = c(symptoms = -999))
To make your output easier to understand, use the group_name argument to add a label to the column that shows grouping values or matched patterns. You can also use the var_labels argument to display descriptive labels for each variable.
mean_group_tbl(data = symptoms_data, var_stem = "symptoms", group = "_t\\d", group_type = "pattern", group_name = "time_point", ignore = c(symptoms = -999), var_labels = c(symptoms_t1 = "# of symptoms at baseline", symptoms_t2 = "# of symptoms at 6 months follow up", symptoms_t3 = "# of symptoms at one-year follow up"))
Finally, you can choose what information to return using the only argument.
# Default: all summary statistics returned # (mean, sd, min, max, nobs) mean_group_tbl(data = symptoms_data, var_stem = "symptoms", group = "_t\\d", group_type = "pattern", group_name = "time_point", ignore = c(symptoms = -999)) # Means and non-missing observations only mean_group_tbl(data = symptoms_data, var_stem = "symptoms", group = "_t\\d", group_type = "pattern", group_name = "time_point", ignore = c(symptoms = -999), only = c("mean", "nobs")) # Means and standard deviations only mean_group_tbl(data = symptoms_data, var_stem = "symptoms", group = "_t\\d", group_type = "pattern", group_name = "time_point", ignore = c(symptoms = -999), only = c("mean", "sd"))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.