knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Before starting this tutorial, I want to thank the developers of the packages
Really all credit should go to the teams maintaining these two packages.
Full disclosure: My function merely provides an easy-to-use API or wrapper around their packages to get
a beautiful publication-ready bivariate comparison table 1.
Creating a descriptive sociodemographic table can be a tedious and repetitive task.
Every published article has such a table (usually the first, so table1). However,
I found myself to be frustrated over the amount of work it took to create those table,
especially because it is such a repetitive task, and I always believed that
there has to be an easy solution for this in R. Recently I discovered the
table1::table1()
function, which made life much easier.
I advanced the example given by Benjamin Rich and combined it
with the ability of flextable::flextable()
to get it nicely formatted into Word.
So what am I talking about, you might think? See for yourself below; It just requires one function call!
# Only 2 mandatory arguments: formula specifying variables to be used & data (see also table1::table1()) flex_table1( str_formula = "~ random_Smoking + Glucose + Insulin + Leptin +Age + BMI|Diagnosis", data = breast_cancer_modified, table_caption = c( "Table 1", "Sample Characteristics, Comparison of Healthy Controls and Cancer Patients." )
library(datscience) # Additionally load Tidyverse for Data Wrangling library(dplyr) library(labelled) library(flextable) path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv" breast_cancer <- read.csv(path) breast_cancer_modified <- breast_cancer |> mutate(Diagnosis = factor(Classification, labels = c("Healthy Controls", "Breast Cancer Patients") )) |> select(-Classification) |> mutate(random_Smoking = sample(c("Smoker", "Non-Smoker", "Occasionally"), size = nrow(breast_cancer), replace = TRUE, prob = c(0.49, 0.49, 0.02) )) |> # Convert the Randomly Generated Variable into a Factor mutate(random_Smoking = factor(random_Smoking)) var_label(breast_cancer_modified$random_Smoking) <- "Fictional Smoking Status" ft <- flex_table1( str_formula = "~ random_Smoking + Glucose + Insulin + Leptin +Age + BMI|Diagnosis", data = breast_cancer_modified # table_caption = c( # "Table 1", # "Sample Characteristics, Comparison of Healthy Controls and Cancer Patients." # ) ) ft |> set_table_properties(layout = "autofit", width = 1)
In published articles, this kind of table (above) is usually the first table (thus called Table 1). The example above shows the descriptive stats of two subgroups of the sample (aka bivariate descriptive). The function also automatically conducts group comparisons (and, if desired, corrects) the obtained p-values.
r knitr::asis_output("\u2264")
1). See also the "Details" section in the documentation of the function
(flex_table1()
), on why the function defaults to no correction for multiple
comparisons and the use of the $N-1$ version of the $\chi^2$ test for cases with
expected cell count above 1 (instead of Fisher's exact test for cell expected
cell count below 5).
I will show you how you can create such a table and get it into Word with just a few basic and simple steps.
So how do we get there, just follow these few and easy steps.
We will primarily use the datscience
package, as well as a bit of dplyr
and labelled
to
prepare the data. However, please note that the function I am showcasing here heavily
benefits from the two package table1
flextable
(as described above).
If you want to customize the appearance
of your table, I recommend you to check out the sources above. The function and vignette
provided here just serve for convenience purposes, to make the creation of a bivariate
comparison in table1 a less time-consuming task.
Note: If you further want to customize
your table created with datscience::flex_table1()
, you can easily additionally use the
functions provided by flextable, as flex_table1()
just returns a flextable object which
you can modify as you desire
library(datscience) # Additionally load dplyr and labelled for Data Wrangling library(dplyr) library(labelled) # We are also setting seed because we will generate random data with sample below set.seed(123)
The data we are using for this tutorial is publicly [Patricio, 2018] available on the homepage of the UCI Machine Learning Repository.
For reading in the data we can just use baseR function read.csv()
.
# Load the data path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv" breast_cancer <- read.csv(path)
Now, for demonstration purposes, we will slightly modify the original data set.
We will start by changing the column Classification
to a factor and assigning the correct labels, obtained from the UCI repository.:
| Labels: | 1=Healthy controls | 2=Patients
Additionally we will generate a new fictional variable called random_Smoking
with 3 associated factor levels c("Smoker","Non-Smoker","Occasionally")
.
Please note that this is not part of the original data and was just fabricated to illustrate the utility of the flex_table1()
function on comparison of factors.
# Modify the Data breast_cancer_modified <- breast_cancer |> # Create a Factor Variable from Classification Column # And add Labels to the Factor Levels mutate(Diagnosis = factor(Classification, labels = c("Healthy Controls", "Breast Cancer Patients") )) |> # Remove the old Variable select(-Classification) |> # Generate a New Fictional Categorical Variable: random_Smoking mutate(random_Smoking = sample(c("Smoker", "Non-Smoker", "Occasionally"), size = nrow(breast_cancer), replace = TRUE, prob = c(0.49, 0.49, 0.02) )) |> # Convert the Randomly Generated Variable into a Factor mutate(random_Smoking = factor(random_Smoking)) # Giving the new fictional column a name to be shown in the table var_label(breast_cancer_modified$random_Smoking) <- "Fictional Smoking Status"
Just to get a glimpse and overview of the data.
Additionally, this give the opportunity to showcase another
useful function of this (datscience
) package the format_flextable()
# determine how many number of decimal places by convention nod <- get_number_of_decimals(nrow(breast_cancer_modified)) head(breast_cancer_modified) |> # round numeric columns mutate(across(where(is.numeric), ~ round(.x,nod))) |> flextable() |> format_flextable(table_caption = c("Table 2","First Cases (N = 5) of the Cancer Patients Data Set"))
Additionally lets inspect distributional parameters and descriptive statistics
with the use of the function psych::describe()
, and format the output
again with datscience::format_flextable()
. All outputs generated here can also be
conveniently saved to word with datscience::save_flextable()
Descriptive Statistics and Distribution
breast_cancer_modified |> # remove categorical variables select(where(is.numeric)) |> # get descriptives of metric variables psych::describe() |> as.data.frame() |> round(nod) |> # add rownmaes (name of variable) as a separate columns tibble::rownames_to_column(var = "Variable") |> select(-vars) |> # generate flextable flextable() |> # format flextable format_flextable(table_caption = c("Table 3","Complete Sample - Numeric Variables: Distribution and Descriptive Stat."))
The function datscience::flex_table1
provides a nice API for creating Table 1
with bivariate group comparison. There are only two mandatory arguments:
data = breast_cancer_modified
"~"
followed by the variables to be shown in the rows combined with a plus sign "Age + BMI"
. Lastly the groups to be compared a specified after "| Diagnosis"
. For more details please refere to the table1 documentation. Lets say for our cancer sample we want to know if there are difference in the following variables"~ random_Smoking + Glucose + Insulin + Leptin +Age + BMI|Diagnosis"
And that's it. We included in the function call below the argument ref_correction = TRUE
this marks p-values where the respective correction was applied. However TRUE
is also the default value. You change this behavior by just passing FALSE
Creation of the Table
flex_table1( str_formula = "~ random_Smoking + Glucose + Insulin + Leptin +Age + BMI|Diagnosis", data = breast_cancer_modified, table_caption = c( "Table 4", "Sample Characteristics, Comparison of Healthy Controls and Cancer Patients." ), ref_correction = TRUE )|> # Only For markdown we additionally need to fix the autofit property, this step is not # needed when directly saved to word e.g. with datscience::save_flextable() set_table_properties(layout = "autofit", width = 1)
Please note as of version 0.2.3 of datscience one can also compare multiple groups with the flex_table1 function (see also News.md)
[Patricio, 2018] Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Gomes, M., Seiça, R., & Caramelo, F. (2018). Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer, 18(1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.