Comparison with dplyr Tabulation
In rtables: Reporting Tables

knitr::opts_chunk$set(comment = "#")

```{css, echo=FALSE} .reveal .r code { white-space: pre; }

## Introduction

In this vignette, we would like to discuss the similarities and differences between `dplyr` and `rtable`.

Much of the `rtables` framework focuses on tabulation/summarizing of data and then the visualization of the table. In this vignette, we focus on summarizing data using `dplyr` and contrast it to `rtables`. We won't pay attention to the table visualization/markup and just derive the cell content.

Using `dplyr` to summarize data and `gt` to visualize the table is a good way if the tabulation is of a certain nature or complexity. However, there are tables such as the table created in the [`introduction`](https://insightsengineering.github.io/rtables/latest-tag/articles/introduction.html) vignette that take some effort to create with `dplyr`. Part of the effort is due to fact that when using `dplyr` the table data is stored in `data.frame`s or `tibble`s which is not the most natural way to represent a table as we will show in this vignette.

If you know a more elegant way of deriving the table content with `dplyr` please let us know and we will update the vignette.


```r
library(rtables)
library(dplyr)

Here is the table and data used in the introduction vignette:

n <- 400

set.seed(1)

df <- tibble(
  arm = factor(sample(c("Arm A", "Arm B"), n, replace = TRUE), levels = c("Arm A", "Arm B")),
  country = factor(sample(c("CAN", "USA"), n, replace = TRUE, prob = c(.55, .45)), levels = c("CAN", "USA")),
  gender = factor(sample(c("Female", "Male"), n, replace = TRUE), levels = c("Female", "Male")),
  handed = factor(sample(c("Left", "Right"), n, prob = c(.6, .4), replace = TRUE), levels = c("Left", "Right")),
  age = rchisq(n, 30) + 10
) %>% mutate(
  weight = 35 * rnorm(n, sd = .5) + ifelse(gender == "Female", 140, 180)
)

lyt <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  summarize_row_groups() %>%
  split_rows_by("handed") %>%
  summarize_row_groups() %>%
  analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl

Getting Started

We will start by deriving the first data cell on row 3 (note, row 1 and 2 have content cells, see the introduction vignette). Cell 3,1 contains the mean age for left handed & female Canadians in "Arm A":

mean(df$age[df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female" & df$handed == "Left"])

or with dplyr:

df %>%
  filter(country == "CAN", arm == "Arm A", gender == "Female", handed == "Left") %>%
  summarise(mean_age = mean(age))

Further, dplyr gives us other verbs to easily get the average age of left handed Canadians for each group defined by the 4 columns:

df %>%
  group_by(arm, gender) %>%
  filter(country == "CAN", handed == "Left") %>%
  summarise(mean_age = mean(age))

We can further get to all the average age cell values with:

average_age <- df %>%
  group_by(arm, gender, country, handed) %>%
  summarise(mean_age = mean(age))

average_age

In rtable syntax, we need the following code to get to the same content:

lyt <- basic_table() %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  split_rows_by("handed") %>%
  analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl

As mentioned in the introduction to this vignette, please ignore the difference in arranging and formatting the data: it's possible to condense the rtable more and it is possible to make the tibble look more like the reference table using the gt R package.

In terms of tabulation for this example there was arguably not much added by rtables over dplyr.

Content Information

Unlike in rtables the different levels of summarization are discrete computations in dplyr which we will then need to combine

We first focus on the count and percentage information for handedness within each country (for each arm-gender pair), along with the analysis row mean values:

c_h_df <- df %>%
  group_by(arm, gender, country, handed) %>%
  summarize(mean = mean(age), c_h_count = n()) %>%
  ## we need the sum below to *not* be by country, so that we're dividing by the column counts
  ungroup(country) %>%
  # now the `handed` grouping has been removed, therefore we can calculate percent now:
  mutate(n_col = sum(c_h_count), c_h_percent = c_h_count / n_col)
c_h_df

which has 16 rows (cells) like the average_age data frame defined above. Next, we will derive the group information for countries:

c_df <- df %>%
  group_by(arm, gender, country) %>%
  summarize(c_count = n()) %>%
  # now the `handed` grouping has been removed, therefore we can calculate percent now:
  mutate(n_col = sum(c_count), c_percent = c_count / n_col)
c_df

Finally, we left_join() the two levels of summary to get a data.frame containing the full set of values which make up the body of our table (note, however, they are not in the same order):

full_dplyr <- left_join(c_h_df, c_df) %>% ungroup()

Alternatively, we could calculate only the counts in c_h_df, and use mutate() after the left_join() to divide the counts by the n_col values which are more naturally calculated within c_df. This would simplify c_h_df's creation somewhat by not requiring the explicit ungroup(), but it prevents each level of summarization from being a self-contained set of computations.

The rtables call in contrast is:

lyt <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  summarize_row_groups() %>%
  split_rows_by("handed") %>%
  summarize_row_groups() %>%
  analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl

We can now spot check that the values are the same

frm_rtables_h <- cell_values(
  tbl,
  rowpath = c("country", "CAN", "handed", "Right", "@content"),
  colpath = c("arm", "Arm B", "gender", "Female")
)[[1]]
frm_rtables_h

frm_dplyr_h <- full_dplyr %>%
  filter(country == "CAN" & handed == "Right" & arm == "Arm B" & gender == "Female") %>%
  select(c_h_count, c_h_percent)

frm_dplyr_h


frm_rtables_c <- cell_values(
  tbl,
  rowpath = c("country", "CAN", "@content"),
  colpath = c("arm", "Arm A", "gender", "Male")
)[[1]]

frm_rtables_c

frm_dplyr_c <- full_dplyr %>%
  filter(country == "CAN" & arm == "Arm A" & gender == "Male") %>%
  select(c_count, c_percent)

frm_dplyr_c

stopifnot(isTRUE(all.equal(frm_rtables_h, unname(unlist(frm_dplyr_h)))))
stopifnot(isTRUE(all.equal(frm_rtables_c, unname(unlist(frm_dplyr_c[1, ])))))

Further, the rtable syntax has hopefully also become a bit more straightforward to derive the cell values than with dplyr for this particular table.

Summary

In this vignette learned that:

many tables are quite easily created with dplyr and data.frame or tibble as data structure
dplyr keeps simple things simple
if tables have group summaries then repeating of information is required
rtables streamlines the construction of complex tables

We recommend that you continue reading the clinical_trials vignette where we create a number of more advanced tables using layouts.