% Easy Demographics Tables
% Toby Johnson Toby.x.Johnson@gsk.com
% r format(Sys.time(), "%d %B %Y")
suppressPackageStartupMessages(library(knitr))
A common component of both clinical and statistical genetics analyses is the generation of "demographics tables". From a statistical or programming perspective, such tables are nothing more than summaries of (typically several) ``demographic'' variables, summarized for one or more groups of subjects. A classic use case is summarising baseline characteristics by treatment arm in a randomized clinical study. However, under the general definition used here, "demographics tables" could summarize analysis endpoints, explanatory variables such as genotypes, covariates, or indeed any other variable, by any set of groups. The groups may be disjoint or may be overlapping, for example groups by genotype at a variant of interest, or one group for all subjects in a clinical study and a second group for the subset of subjects included in a genetic analysis.
The demographics
functions in the gtx package aim to make
generation of demographics tables as easy as possible, given the
following assumptions:
Surv
class for survival endpointAll the demographics
functions output character arrays, which are
easily read in a fixed width font, and which may be easily inserted
into dynamically generated documents using e.g. knitr::kable
or
xtable::xtable
.
Under these assumtions, with analysis variables in a data frame x
,
generating demographics by levels of the TRTGRP
variable in x
can be done as easily as:
demographics(x, by = "TRTGRP")
For a more complicated example, generating demographics by two overlapping groups that are defined on the fly, could be done quite easily using:
demographics(x, by = list("All subjects" = rep(TRUE, nrow(x)), Genotyped = x$ID %in% genotypes$ID))
For numerical summaries, different styles and decimal precisions can be specified in a straightforward and intuitive way, both globally (here for all numeric variables) and for specific variables (here for AGE):
demographics(x, by = list("All subjects" = rep(TRUE, nrow(x)), Genotyped = x$ID %in% genotypes$ID), style = list(numeric = "Median (IQR)", AGE = "Median (Range)"), digits = list(numeric = 2, AGE = 0))
The following sections describe in more detail how to use the demographics
functions.
Specifying the demographic variables to be summarized is easy: Put
them all in a dataframe and pass this as the first argument to
demographics()
. To summarize a restricted set of variables, the
data frame can be subsetted using base R functions subset
or [
before passing to demographics()
. If a single variable is to be
summarized, it is generally better to pass it to demographics()
as a
single column data frame (subset with drop=FALSE
).
The behaviour of demographics()
depends on the class(es) of its
arguments. Currently, meaningful summaries are only produced for
classes data.frame
, logical
, factor
, numeric
and Surv
. In
particular, no meaningful summary is produced for arguments of
character
class. Therefore, variables in the input data frame must
be converted or coerced to appropriate classes.
An option clinical.descriptors
can be used to provide a named
list of descriptive names to substitute for the actual
variable names.
An example of preparing suitable input, using the aoex1
example
dataset:
suppressPackageStartupMessages(library(gtx)) options(clinical.usubjid = "USUBJID") suppressPackageStartupMessages(library(survival)) data(aoex1) aoex1 <- within(aoex1, { SEX <- factor(SEX) # convert to factor Male <- SEX == "Male" # logical OS <- Surv(SRVDY*12/365.25, SRVCFLCD) # survival, in months Genotyped <- !is.na(rs123456) # logical }) options(clinical.descriptors = list(SEX = "Gender", OS = "Overall survival, Months", AGE = "Age, Years", bmi = "BMI, kg/m2"))
The examples above illustrate the two principal ways that the by
argument can be used, either as a character vector or as a list of
logical vectors:
by
is a character vector, it is interpreted as names of columns
in x
. These columns must be either of logical or factor
vectors. Each logical vector is used to define exactly one group.
Each factor vector is used to define one group per level of the
factor.by
is a list of logical vectors, each is used to define exactly
one group. This second approach allows group membership to be
determined without constructing additional columns in x
and also
allows finer control of the group titles.The following demonstrations all use the aoex1
dataset as modified above.
The knitr::kable
function converts the character matrix produced by
demographics()
into a markdown table that is rendered nicely to
appear in this vignette.
In the first demonstration, we take a subset of rows in aoex1
for
subjects in study AO000002 only, and a subset of columns for the
variables we want to summarize. We define groups by TRTGRP
which is
one of the factors in the data frame. Note that because TRTGRP
is
used to define groups it is automatically excluded from the variables
actually summarized. This demonstration also illustrates a difference
between the summaries produced for factors (such as SEX
) and for
logicals (such as Male
).
kable(demographics(subset(aoex1, STUDYID == "AO000002", select = c("SEX", "AGE", "Male", "bmi", "TRTGRP")), by = "TRTGRP"))
The summary style can be chosen by passing a named list as the
optional style
argument, where the names of the list must match the
name of the variable for which the style is to be used (here bmi
).
The style can be an arbitrary character string, and is converted to
produce the summary by simple text substitution. Currently, the
following case sensitive patterns are substituted with numeric data:
"Mean", "Median", "SD", "Q1", "Q3", "Min", "Max", "IQR", "Range".
Note that the style string is also used as part of the row label.
Numerical precision can be controlled in a similar fashion by passing
a named list as the options digits
argument.
kable(demographics(subset(aoex1, STUDYID == "AO000002", select = c("SEX", "AGE", "bmi", "TRTGRP")), by = "TRTGRP", style = list(bmi = "Median (Range)"), digits = list(bmi = 1)))
For tabulated count data (objects of class factor
), the style string
substitutes case sensitively the patterns "N", "pc0", "pc1", and
"pc2" and "pc", which are respectively the count, percentages to 0, 1
and 2 decimals, and percentage to digits
decimals. The default
style is "N (pc0%)". When the row label is constructed from the style
string, any "pc[012]%" patterns are changed to "%". The style
commonly used in clinical statistics (without any percent sign in the
data columns) is like "N (pc1)".
kable(demographics(subset(aoex1, STUDYID == "AO000002", select = c("SEX", "AGE", "bmi", "TRTGRP")), by = "TRTGRP", style = list(SEX = "N (pc1%)")))
The style
argument can also include elements named for classes of
variables, namely logical, factor, numeric and Surv. If a variable
being sumamrized does not match style
but does match by class, the
specified style will be used. The digits
argument works similarly.
Hence, in the following, the "Mean (IQR)" style and 0 decimal precision
will be used for all numeric variables except bmi
, for which the
"Median (IQR)" style and 1 decimal precision will be used.
kable(demographics(subset(aoex1, STUDYID == "AO000002", select = c("SEX", "AGE", "bmi", "TRTGRP")), by = "TRTGRP", style = list(bmi = "Median (IQR)", numeric = "Mean (IQR)"), digits = list(bmi = 1, numeric = 0)))
Multiple variables in the data frame can be used to construct
subgroups, and they can be a mixture of factors (such as STUDYID
)
and logicals (such as Genotyped
). Note that factors generate one
group per level in the factor, and logicals generate exactly one group
corresponding to subjects for which the logical is true.
kable(demographics(subset(aoex1, select = c("bmi", "TRTGRP", "OS", "STUDYID", "Genotyped")), by = c("STUDYID", "Genotyped")))
Subgroups can also be constructed as a named list of logical vectors.
One advantage of this method is that the group names that appear in
the column names are constructed using the names of the list. This
allows finer control than the "VARIABLE=value" column names generated
in the previous examples. The names have to be quoted if they contain
spaces. This demonstration includes a deliberately empty group, which
is automatically suppressed in the output (unless
options(demographics.omit0groups = FALSE)
is used).
bygroups <- list("All subjects" = rep(TRUE, nrow(aoex1)), "Study 1" = aoex1$STUDYID == "AO000001", "Study 2" = aoex1$STUDYID == "AO000002", "Men from Mars" = rep(FALSE, nrow(aoex1))) kable(demographics(subset(aoex1, select = c("bmi", "TRTGRP")), by = bygroups))
If no subgroups are specified deliberately, or if subgroups are
improperly specified (names of variables not present in the first
argument, logicals of the wrong length, etc.), then a warning message
is issued and a failsafe group called "All subjects" is used. This
demonstration also illustrates demographics()
called on the whole
aoex1
dataframe.
kable(demographics(aoex1, by = "MisTypEdNAme"))
Variables of class Surv
are handled in essentially a similar way to
numeric variables. A style argument can be used, and the following
case sensitive patterns are substituted with numeric data: "Events",
"Median", "95% LCL", "95% UCL", "95% CI". Note that some of these
patterns contain spaces. This demonstration also illustrates that
because styles work by text substitution, it is easy to change the
style to use square brackets instead of parantheses.
kable(demographics(subset(aoex1, select = c("bmi", "TRTGRP", "OS", "STUDYID", "Genotyped")), by = bygroups, style = list(Surv = "Median [95% CI]", numeric = "Mean [Range]")))
demographics
is a generic function and uses S3 method dispatching.
All the examples and demonstrations above use the
demographics.data.frame()
method. This method does some checking
and conversion of the by
argument, finds the most relevant element
in the optional style
argument, and then calls the generic
demographics
for each variable in the data frame in turn.
Therefore, the user can in principle define new classes and write
demographics
methods for those classes.
The demographics
functions have some behaviours that are controlled
by options that can be set by the user. Currently, these control
whether rows where all groups have zero count are omitted
(demographics.omit0rows, default TRUE), and whether groups with no
subjects are omitted (demographics.omit0groups, default TRUE).
In the future it is planned to provide automatic collapsing of levels for factors with many levels.
demographics
functions?Some of what the demographics
functions do is provided by existing
functions such as by
, aggregate
and summary.data.frame
.
However, none of these functions can do in one or two lines what the
demographics
functions do, especially with regard to producing near
publication ready output. In particular, the demographics
functions
allow fine user control of how numerical data are summarised, and can
summarise survival endpoints in the manner widely used in the clinical
literature.
No attempt has currently been made to accomodate the finer aspects of publication readiness. Typically, some fine adjustment of spacing will be needed, but this adjustment will depend on factors such as the font used in the final publication, and therefore is difficult to implement within the R package.
This document is written in R Markdown, using the Pandoc extended and revised version of Markdown. To make persistent edits you need to edit the source (.Rmd) file, not any other files (such as the .md, .html or .docx file).
To compile this document as a .docx (document), run the following commands in R:
library(knitr) knit("demographics.Rmd")` pandoc("demographics.md", format = "docx")
To compile this document as a .html (web page), run the following commands in R:
library(knitr) knit2html("demographics.Rmd")
This document was complied on r date()
using r R.version.string
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.