knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = "#")
```{css, echo=FALSE} .reveal .r code { white-space: pre; }
## Introduction In this vignette we will introduce some theory behind using layouts for table creation. Much of the theory also holds true when using other table packages. For this vignette we will use the following packages: ```r library(dplyr) library(tibble) library(rtables)
The data we use is the following, created with random number generators:
add_subgroup <- function(x) paste0(tolower(x), sample(1:3, length(x), TRUE)) set.seed(1) df <- tibble( x = rnorm(100), c1 = factor(sample(c("A", "B", "C"), 100, replace = TRUE), levels = c("A", "B", "C")), r1 = factor(sample(c("U", "V", "W"), 100, replace = TRUE), levels = c("U", "V", "W")) ) %>% mutate( c2 = add_subgroup(c1), r2 = add_subgroup(r1), y = as.numeric(2 * as.numeric(c1) - 3 * as.numeric(r1)) ) %>% select(c1, c2, r1, r2, x, y) df
Let's look at a table that has 3 columns and 3 rows. Each row
represents a different analysis (functions foo
, bar
, zoo
that
return an rcell()
object):
A B C ------------------------------------------------ foo_label foo(df_A) foo(df_B) foo(df_C) bar_label bar(df_A) bar(df_B) bar(df_C) zoo_label zoo(df_A) zoo(df_B) zoo(df_C)
The data passed to the analysis functions are a subset defined by the respective column and:
df_A <- df %>% filter(c1 == "A") df_B <- df %>% filter(c1 == "B") df_C <- df %>% filter(c1 == "C")
Let's do this on the concrete data with analyze()
:
foo <- prod bar <- sum zoo <- mean lyt <- basic_table() %>% split_cols_by("c1") %>% analyze("x", function(df) foo(df$x), var_labels = "foo label", format = "xx.xx") %>% analyze("x", function(df) bar(df$x), var_labels = "bar label", format = "xx.xx") %>% analyze("x", function(df) zoo(df$x), var_labels = "zoo label", format = "xx.xx") tbl <- build_table(lyt, df) tbl
or if we wanted the x
variable instead of the data frame:
A B C ------------------------------------------------ foo_label foo(x_A) foo(x_B) foo(x_C) bar_label bar(x_A) bar(x_B) bar(x_C) zoo_label zoo(x_A) zoo(x_B) zoo(x_C)
where:
x_A <- df_A$x x_B <- df_B$x x_C <- df_C$x
The function passed to afun
is evaluated using argument matching. If
afun
has an argument x
the analysis variable specified in vars
in analyze()
is passed to the function, and if afun
has an
argument df
then a subset of the dataset is passed to afun
:
lyt2 <- basic_table() %>% split_cols_by("c1") %>% analyze("x", foo, var_labels = "foo label", format = "xx.xx") %>% analyze("x", bar, var_labels = "bar label", format = "xx.xx") %>% analyze("x", zoo, var_labels = "zoo label", format = "xx.xx") tbl2 <- build_table(lyt2, df) tbl2
Note that it is also possible that a function returns multiple rows
with in_rows()
:
lyt3 <- basic_table() %>% split_cols_by("c1") %>% analyze("x", function(x) { in_rows( "row 1" = rcell(mean(x), format = "xx.xx"), "row 2" = rcell(sd(x), format = "xx.xxx") ) }, var_labels = "foo label") %>% analyze("x", function(x) { in_rows( "more rows 1" = rcell(median(x), format = "xx.x"), "even more rows 1" = rcell(IQR(x), format = "xx.xx") ) }, var_labels = "bar label", format = "xx.xx") tbl3 <- build_table(lyt3, df) tbl3
This is how we recommend you specify the row names explicitly.
Let's say we would like to create the following table:
A B C -------------------------------------- U foo(df_UA) foo(df_UB) foo(df_UC) V foo(df_VA) foo(df_VB) foo(df_VC) W foo(df_WA) foo(df_WB) foo(df_WC)
where df_*
are subsets of df
as follows:
df_UA <- df %>% filter(r1 == "U", c1 == "A") df_VA <- df %>% filter(r1 == "V", c1 == "A") df_WA <- df %>% filter(r1 == "W", c1 == "A") df_UB <- df %>% filter(r1 == "U", c1 == "B") df_VB <- df %>% filter(r1 == "V", c1 == "B") df_WB <- df %>% filter(r1 == "W", c1 == "C") df_UC <- df %>% filter(r1 == "U", c1 == "C") df_VC <- df %>% filter(r1 == "V", c1 == "C") df_WC <- df %>% filter(r1 == "W", c1 == "C")
further note that df_*
are of the same class as df
,
i.e. tibble
s. Hence foo
aggregates the subset of our data to a
cell value.
Given a function foo
(ignore the ...
for now):
foo <- function(df, labelstr = "", ...) { paste(dim(df), collapse = " x ") }
we can start calculating the cell values individually:
foo(df_UA) foo(df_VA) foo(df_WA) foo(df_UB) foo(df_VB) foo(df_WB) foo(df_UC) foo(df_VC) foo(df_WC)
Now we are still missing the table structure:
matrix( list( foo(df_UA), foo(df_VA), foo(df_WA), foo(df_UB), foo(df_VB), foo(df_WB), foo(df_UC), foo(df_VC), foo(df_WC) ), byrow = FALSE, ncol = 3 )
In rtables
this type of tabulation is done with layouts
:
lyt4 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% analyze("x", foo) tbl4 <- build_table(lyt4, df) tbl4
or if we would not want to see the foo
label we would have to use:
lyt5 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% summarize_row_groups(cfun = foo, format = "xx") tbl5 <- build_table(lyt5, df) tbl5
but now the row labels have disappeared. This is because cfun
needs
to define its row label. So let's redefine foo
:
foo <- function(df, labelstr) { rcell(paste(dim(df), collapse = " x "), format = "xx", label = labelstr) } lyt6 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% summarize_row_groups(cfun = foo) tbl6 <- build_table(lyt6, df) tbl6
Now let's calculate the mean of df$y
for pattern I:
foo <- function(df, labelstr) { rcell(mean(df$y), label = labelstr, format = "xx.xx") } lyt7 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% summarize_row_groups(cfun = foo) tbl7 <- build_table(lyt7, df) tbl7
Note that foo
has the variable information hard-encoded in the
function body. Let's try some alternatives returning to analyze()
:
lyt8 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% analyze("y", afun = mean) tbl8 <- build_table(lyt8, df) tbl8
Note that the subset of the y
variable is passed as the x
argument
to mean()
. We could also get the data.frame
instead of the
variable:
lyt9 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% analyze("y", afun = function(df) mean(df$y)) tbl9 <- build_table(lyt9, df) tbl9
which is in contrast to:
lyt10 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% analyze("y", afun = function(x) mean(x)) tbl10 <- build_table(lyt10, df) tbl10
where the function receives the subset of y
.
Pattern I is an interesting one as we can add more row structure (with further splits). Consider the following table:
A B C -------------------------------------- U u1 foo(<>) foo(<>) foo(<>) u2 foo(<>) foo(<>) foo(<>) u3 foo(<>) foo(<>) foo(<>) V v1 foo(<>) foo(<>) foo(<>) v2 foo(<>) foo(<>) foo(<>) v3 foo(<>) foo(<>) foo(<>) W w1 foo(<>) foo(<>) foo(<>) w2 foo(<>) foo(<>) foo(<>) w3 foo(<>) foo(<>) foo(<>)
where <>
represents the data that is represented by the cell. So for
the cell U > u1, A
we would have the subset:
df %>% filter(r1 == "U", r2 == "u1", c1 == "A")
and so on. We can get this table as follows:
lyt11 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% split_rows_by("r2") %>% summarize_row_groups(cfun = function(df, labelstr) { rcell(mean(df$x), format = "xx.xx", label = paste("mean x for", labelstr)) }) tbl11 <- build_table(lyt11, df) tbl11
or, if we wanted to calculate two summaries per row split:
s_mean_sd <- function(x) { in_rows("mean (sd)" = rcell(c(mean(x), sd(x)), format = "xx.xx (xx.xx)")) } s_range <- function(x) { in_rows("range" = rcell(range(x), format = "xx.xx - xx.xx")) } lyt12 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% split_rows_by("r2") %>% analyze("x", s_mean_sd, show_labels = "hidden") %>% analyze("x", s_range, show_labels = "hidden") tbl12 <- build_table(lyt12, df) tbl12
Which has the following structure:
A B C --------------------------------------------------------- U u1 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) u2 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) u3 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) V v1 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) v2 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) v3 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) W w1 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) w2 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) w3 mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>)
The rows U
, u1
, u2
, ..., W
, w1
, w2
, w3
are label rows
and the other rows (with mean_sd
and range
) are data
rows. Currently we do not have content rows in the table. Content rows
summarize the data defined by their splitting (i.e. V > v1, B
). So
if we wanted to add content rows at the r2
split level then we would
get:
A B C --------------------------------------------------------- U u1 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) u2 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) u3 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) V v1 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) v2 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) v3 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) W w1 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) w2 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>) w3 s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>)
where s_cfun_2
is the content function and either returns one row
via rcell()
or multiple rows via in_rows()
. The data represented
by <>
for the content rows is same data as for it's descendant,
i.e. for the U > u1, A
content row cell it is df %>% filter(r1 ==
"U", r2 == "u1", c1 == "A")
. Note that content functions cfun
operate only on data frames and not on vectors/variables so they must
take the df
argument. Further, a cfun
must also have the
labelstr
argument which is the split level. This way, the cfun
can
define its own row name. In order to get the table above we can use
the layout framework as follows:
s_mean_sd <- function(x) { in_rows("mean (sd)" = rcell(c(mean(x), sd(x)), format = "xx.xx (xx.xx)")) } s_range <- function(x) { in_rows("range" = rcell(range(x), format = "xx.xx - xx.xx")) } s_cfun_2 <- function(df, labelstr) { rcell(nrow(df), format = "xx", label = paste(labelstr, "(n)")) } lyt13 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% split_rows_by("r2") %>% summarize_row_groups(cfun = s_cfun_2) %>% analyze("x", s_mean_sd, show_labels = "hidden") %>% analyze("x", s_range, show_labels = "hidden") tbl13 <- build_table(lyt13, df) tbl13
In the same manner, if we want content rows for the r1
split we can
do it at as follows:
lyt14 <- basic_table() %>% split_cols_by("c1") %>% split_rows_by("r1") %>% summarize_row_groups(cfun = s_cfun_2) %>% split_rows_by("r2") %>% summarize_row_groups(cfun = s_cfun_2) %>% analyze("x", s_mean_sd, show_labels = "hidden") %>% analyze("x", s_range, show_labels = "hidden") tbl14 <- build_table(lyt14, df) tbl14
In pagination, content rows and label rows get repeated if a page is
split in a descendant of a content row. So, for example, if we were to
split the following table at ***
:
A B C --------------------------------------------------------- U u1 (n) s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) *** range s_range(<>) s_range(<>) s_range(<>) u2 (n) s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>)
Then we would get the following two tables:
A B C --------------------------------------------------------- U u1 (n) s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>)
and
A B C --------------------------------------------------------- U u1 (n) s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) range s_range(<>) s_range(<>) s_range(<>) u2 (n) s_cfun_2(<>) s_cfun_2(<>) s_cfun_2(<>) mean_sd s_mean_sd(<>) s_mean_sd(<>) s_mean_sd(<>) range s_range(<>) s_range(<>) s_range(<>)
Let's consider the following tabulation pattern:
A B C ------------------------------------------------ label 1 foo(x_A) bar(x_B) zoo(x_C) label 2 foo(x_A) bar(x_B) zoo(x_C) label 3 foo(x_A) bar(x_B) zoo(x_C)
We will discuss that in a future release of rtables
.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.