library(knitr) knitr::opts_chunk$set(comment="", message=FALSE, warning=FALSE,fig.align="center",fig.height=10,fig.width=10,tidy=TRUE,tidy.opts=list(blank=FALSE, width.cutoff=1200)) options(width=150) library(kableExtra)
tibble class data sets possibly imported by
readr, etc. or to be used by
dplyr packages are supported.
new function called
descrTable has been implemented to build descriptive tables in a single step.
export2md to export descriptive tables to R-markdown documents has been improved and now supports stratified tables for HTML.
new funciton called
strataTable has been implemented to build descriptive tables by stratas (values or levels of a variable).
Date variables are treated as continuous-non normal, performing medians, quartiles and non-parametric tests, but now are printed dates.
var.equal added in
descrTable. This allows to consider different variances when comparing means between more than two groups.
compareGroups package [@Subirana2014] allows users to create tables displaying results of univariate analyses, stratified or not by categorical variable groupings.
Tables can easily be exported to CSV, LaTeX, HTML, PDF, Word or Excel, or inserted in R-markdown files to generate reports automatically.
This package can be used from the
R prompt or from a user-friendly graphical user interface for non-R familiarized users.
compareGroups package is available on CRAN repository. To load the package using the
R prompt, enter:
This document provides an overview of the usage of the
compareGroups package with a real examples, both using the R syntax and the graphical user interface. It is structure as follows:
compareGroups package has three functions:
compareGroupscreates an object of class
compareGroups. This object can be:
createTablecreates an object of class
createTable. This object can be:
export2xlswill export results to CSV, HTML, LaTeX, PDF, Markdown, Word or Excel, respectively.
Figure 1 shows how the package is structured in terms of functions, classes and methods.
Since version 4.0, a new function called
descrTable has been implemented which is a shortcut of
createTable, i.e. step 1 and step 2 in a single step (see section 4.2.5).
To illustrate how this package works we sampled 85% data from the participants in the PREDIMED study [@PREDIMED]. PREDIMED is a multicenter trial in Spain, were randomly assigned participants who were at high cardiovascular risk, but with no cardiovascular disease at enrolment, to one of three diets: a Mediterranean diet supplemented with extra-virgin olive oil (MedDiet+VOO), a Mediterranean diet supplemented with mixed nuts (MedDiet+Nuts), or a control diet (advice to reduce dietary fat). Participants received quarterly individual and group educational sessions and, depending on group assignment, free provision of extra-virgin olive oil, mixed nuts, or small non-food gifts. The primary end point was the rate of major cardiovascular events (myocardial infarction, stroke, or death from cardiovascular causes.
First of all, load PREDIMED data typing:
Variables and labels in this data frame are:
dicc <- data.frame( "Name"=I(names(predimed)), "Label"=I(unlist(lapply(predimed, attr, which="label", exact=TRUE))), "Codes"=I(unlist(lapply(predimed, function(x) paste(levels(x),collapse="; ")))) ) dicc$Codes <- sub(">=","$\\\\geq$",dicc$Codes) kable(dicc, align=rep("l",4), row.names=FALSE, format = "html")
It is important to note that
compareGroups is not aimed to perform quality control of the data. Other useful packages such as
2lh [@r2lh] are available for this purpose.
It is strongly recommended that the data.frame contain only the variables to be analyzed; the ones not needed in the present analysis should be removed from the list.
The nature of variables to be analyzed should be known, or at least which variables are to be used as categorical. It is important to code categorical variables as factors and the order of their levels is meaningful in this package.
To label the variables set the "label" attributes from each of them. The tables of results will contain the variable labels (by default).
A variable of class
Surv must be created to deal with time-to-event variables (i.e., time to Cardiovascular event/censored in our example):
predimed$tmain <- with(predimed, Surv(toevent, event == 'Yes')) attr(predimed$tmain,"label") <- "AMI, stroke, or CV Death"
Note that variables tmain and tcv are created as time-to-death and time-to-cardiovascular event, respectively, both taking into account censoring (i.e. they are of class Surv).
compareGroups is the main function which does all the calculus. It is needed to store results in an object. Later, applying the function
createTable (Section 4.2) to this object will create tables of the analysis results.
For example, to perform a univariate analysis with the predimed data between group ("response" variable) and all other variables ("explanatory" variables), this formula is required:
compareGroups(group ~ . , data=predimed)
If only a dot occurs on the right side of the
~ all variables in the data frame will be used.
To remove the variable toevent and event from the analysis:
compareGroups(group ~ . -toevent - event, data=predimed)
To select some explanatory variables (e.g., age, sex and waist) and store results in an object of class
res<-compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed) res
Note: Although we have full data (n=
r nrow(predimed)) for Age, Sex and Waist circumference, there are some missing data in Hormone-replacement therapy (probably male participants).
Diet groups have some differences in Smoking and Hormone-replacement therapy although those don't reach statistical significance (p-value=0.714 and 0.859, repectively); although Age, Sex and Waist circumference are clearly different.
Age & Waist circumference has been used as continuous and normal distributed. Sex, Smoking & Hormone-replacement therapy as categorical.
No filters have been used (e.g., selecting only treated patients); therefore, the selection column lists "ALL" (for all variables).
To perform the analysis in a subset of participants (e.g., "female" participants):
compareGroups(group ~ age + smoke + waist + hormo, data=predimed, subset = sex=='Female')
Note that only results for female participants are shown.
To subset specific variable/s (e.g., hormo and waist):
compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed, selec = list(hormo= sex=="Female", waist = waist>20 ))
Combinations are also allowed, e.g.:
compareGroups(group ~ age + smoke + waist + hormo, data=predimed, selec = list(waist= !is.na(hormo)), subset = sex=="Female")
A variable can appear twice in the formula, e.g.:
compareGroups(group ~ age + sex + bmi + bmi + waist + hormo, data=predimed, selec = list(bmi.1=!is.na(hormo)))
In this case results for bmi will be reported for all participants (n= 6324) and also for only those with no missing in Hormone-replacement therapy (!is.na(hormo)). Note that "bmi.1" in the
selec argument refers to the second time that bmi appears in the formula.
By default continuous variables are analyzed as normal-distributed. When a table is built (see
createTable function, Section 4.2), continuous variables will be described with mean and standard deviation. To change default options, e.g., "waist" used as non-normal distributed:
compareGroups(group ~ age + smoke + waist + hormo, data=predimed, method = c(waist=2))
Note that "continuous non-normal" is shown in the method column for the variable Hormone-replacement therapy.
Possible values in methods statement are:
1: forces analysis as normal-distributed
2: forces analysis as continuous non-normal
3: forces analysis as categorical
NA: performs a Shapiro-Wilks test to decide between normal or non-normal
method argument is stated as
NA for a variable, then a Shapiro-Wilk test for normality is used to decide if the variable is normal or non-normal distributed. To change the significance threshold:
compareGroups(group ~ age + smoke + waist + hormo, data=predimed, method = c(waist=NA), alpha= 0.01)
According to Shapiro-Wilk test, stating the cutpoint at 0.01 level, "Hormone-replacement therapy" departed significantly from the normal distribution and therefore the method for this variable will be "continuous non-normal".
All non factor variables are considered as continuous. Exception is made (by default) for those that have fewer than 5 different values. This threshold can be changed in the min.dis statement:
predimed$age7gr<-as.integer(cut(predimed$age, breaks=c(-Inf,55,60,65,70,75,80,Inf), right=TRUE)) compareGroups(group ~ age7gr, data=predimed, method = c(age7gr=NA)) compareGroups(group ~ age7gr, data=predimed, method = c(age7gr=NA), min.dis=8)
To avoid errors the maximum categories for the response variable is set at 5 in this example (default value). If this variable has more than 5 different values, the function
compareGroups returns an error message. For example:
compareGroups(age7gr ~ sex + bmi + waist , data=predimed)
Defaults setting can be changed with the max.ylev statement:
compareGroups(age7gr ~ sex + bmi + waist, data=predimed, max.ylev=7)
Similarly, by default there is a limit for the maximum number of levels for an explanatory variable. If this level is exceeded, the variable is removed from the analysis and a warning message is printed:
compareGroups(group ~ sex + age7gr, method= (age7gr=3), data=predimed, max.xlev=5)
Although the options described in this section correspond to
compareGroups function, results of changing/setting them won't be visible until the table is created with the
createTable function (explained later).
include.label: By default the variable labels are shown in the output (if there is no label the name will be printed). Changing the statement include.label from "= TRUE" (default) to "= FALSE" will cause variable names to be printed instead.
compareGroups(group ~ age + smoke + waist + hormo, data=predimed, include.label= FALSE)
Q3: When the method for a variable is stated as "2" (i.e., to be analyzed as continuous non-normal; see section 4.1.3), by default the median and quartiles 1 and 3 will be shown in the final results, after applying the function
createTable(see Section 4.2).
resu1<-compareGroups(group ~ age + waist, data=predimed, method = c(waist=2)) createTable(resu1)
Note: percentiles 25 and 75 are calculated for "Waist circumference".
To get instead percentile 2.5% and 97.5%:
resu2<-compareGroups(group ~ age + smoke + waist + hormo, data=predimed, method = c(waist=2), Q1=0.025, Q3=0.975) createTable(resu2)
Note: percentiles 2.5% and 97.5% are calculated for Follow-up.
To get minimum and maximum:
compareGroups(group ~ age + smoke + waist + hormo, data=predimed, method = c(waist=2), Q1=0, Q3=1)
simplify: Sometimes a categorical variable has no individuals for a specific group. For example, smoker has 3 levels. As an example and to illustrate this problem, we have created a new variable smk with a new category ("Unknown"):
predimed$smk<-predimed$smoke levels(predimed$smk)<- c("Never smoker", "Current or former < 1y", "Never or former >= 1y", "Unknown") attr(predimed$smk,"label")<-"Smoking 4 cat." cbind(table(predimed$smk))
Note that this new category ("unknown") has no individuals:
compareGroups(group ~ age + smk + waist + hormo, data=predimed)
Note that a "Warning" message is printed related to the problem with smk.
To avoid using empty categories,
simplify must be stated as
TRUE (Default value).
compareGroups(group ~ age + smk, data=predimed, simplify=FALSE)
Nota that a "warning" message is shown and no p-values are calculated for "Smoking".
summary function to an object of class
createTable will obtain a more detailed output:
res<-compareGroups(group ~ age + sex + smoke + waist + hormo, method = c(waist=2), data=predimed) summary(res[c(1, 2, 4)])
Note that because only variables 1, 3 and 4 are selected, only results for Age, Sex and Waist circumference are shown. Age is summarized by the mean and the standard deviation, Sex by frequencies and percentage, and Waist circumference (method =2) by the median and quartiles.
Variables can be plotted to see their distribution. Plots differ according to whether the variable is continuous or categorical. Plots can be seen on-screen or saved in different formats (BMP, JPG', PNG, TIF or PDF). To specify the format use the argument `type'.
plot(res[c(1,2)], file="./figures/univar/", type="png")
Plots also can be done according to grouping variable. In this case only a boxplot is shown for continuous variables:
plot(res[c(1,2)], bivar=TRUE, file="./figures/bivar/", type="png")
The object from
compareGroups can later be updated. For example:
res<-compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed) res
res is updated using:
res<-update(res, . ~. - sex + bmi + toevent, subset = sex=='Female', method = c(waist=2, tovent=2), selec = list(bmi=!is.na(hormo))) res
Note that "Sex" is removed as an explanatory variable but used as a filter, subsetting only "Female" participants. Variable "Waist circumference" has been changed to "continuous non-normal". Two new variables have been added: Body mass index and Follow-up (stated continuous non-normal). For Body mass index is stated to show only data of participants with non-missing values in Hormone-replacement therapy.
Since version 3.0, there is a new function called
getResults to retrieve some specific results computed by
compareGroups, such as p-values, descriptives (means, proportions, ...), etc.
For example, it may be interesting to recover the p-values for each variable as a vector to further manipulate it in
R, like adjusting for multiple comparison with
p.adjust. For example, lets take the example data from
SNPassoc package that contains information of dozens of SNPs (genetic variants) from a sample of cases and controls. In this case we analize five of them:
library(SNPassoc) data(SNPs) tab <- createTable(compareGroups(casco ~ snp10001 + snp10002 + snp10005 + snp10008 + snp10009, SNPs)) pvals <- getResults(tab, "p.overall") p.adjust(pvals, method = "BH")
When the response variable is binary, the Odds Ratio (OR) can be printed in the final table. If the response variable is time-to-event (see Section 3.1), the Hazard Ratio (HR) can be printed instead.
ref: This statement can be used to change the reference category:
res1<-compareGroups(htn ~ age + sex + bmi + smoke, data=predimed, ref=1) createTable(res1, show.ratio=TRUE)
Note that for categorical response variables the reference category is the first one in the statement:
res2<-compareGroups(htn ~ age + sex + bmi + smoke, data=predimed, ref=c(smoke=1, sex=2)) createTable(res2, show.ratio=TRUE)
Note that the reference category for Smoking status is the first and for Sex the second.
ref.no: Similarly to the
ref.nois used to state "no" as the reference category for all variables with this category:
res<-compareGroups(htn ~ age + sex + bmi + hormo + hyperchol, data=predimed, ref.no='NO') createTable(res, show.ratio=TRUE)
Note: "no", "No" or "NO" will produce the same results; the coding is not case sensitive.
fact.ratio: By default OR or HR for continuous variables are calculated for each unit increase. It can be changed by the
res<-compareGroups(htn ~ age + bmi, data=predimed) createTable(res, show.ratio=TRUE)
Here the OR is for the increase of one unit for Age and Systolic blood pressure.
res<-compareGroups(htn ~ age + bmi, data=predimed, fact.ratio= c(age=10, bmi=2)) createTable(res, show.ratio=TRUE)
Here the OR is for the increase of 10 years for Age and 2 units for "Body mass index".
ref.y: By default when OR or HR are calculated, the reference category for the response variable is the first. The reference category could be changed using the
res<-compareGroups(htn ~ age + sex + bmi + hyperchol, data=predimed) createTable(res, show.ratio=TRUE)
Note: This output shows the OR of having hypertension. Therefore, "Non-hypertension" is the reference category.
res<-compareGroups(htn ~ age + sex + bmi + hyperchol, data=predimed, ref.y=2) createTable(res, show.ratio=TRUE)
Note: This output shows the OR of having No hypertension, and 'Hypertension' is now the reference category.
When the response variable is of class
Surv, the bivariate
plot function returns a Kaplan-Meier figure if the explanatory variable is categorical. For continuous variables the function returns a line for each individual, ending with a circle for censored and with a plus sign for uncensored.
plot(compareGroups(tmain ~ sex, data=predimed), bivar=TRUE, file="./figures/bivarsurv/", type="png") plot(compareGroups(tmain ~ age, data=predimed), bivar=TRUE, file="./figures/bivarsurv/", type="png")
When a variable of class
Surv (see Section 3.1) is used as explanatory it will be described with the probability of event, computed by Kaplan-Meier, up to a stated time.
timemax: By default probability is calculated at the median of the follow-up period.
timemaxoption allows us to change at what time probability is calculated.
res<-compareGroups(sex ~ age + tmain, timemax=c(tmain=3), data=predimed) res
tmain is calculated at 3 years (see section 3.1).
plot function applied to a variable of class
Surv returns a Kaplan-Meier figure. The figure can be stratified by the grouping variable.
plot(res, file="./figures/univar/", type="png") plot(res, bivar=TRUE, file="./figures/bivar/", type="png")
createTable function, applied to an object of
compareGroups class, returns tables with descriptives that can be displayed on-screen or exported to CSV, LaTeX, HTML, Word or Excel.
res<-compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed, selec = list(hormo=sex=="Female")) restab<-createTable(res)
Two tables are created with the
createTable function: one with the descriptives and the other with the available data. The
createTable prints one or both tables on the
Note that the option "descr" prints descriptive tables.
While the option "avail" prints the available data, as well as methods and selections.
By default, only the descriptives table is shown. Stating "both" in
which.table argument prints both tables.
hide: If the explanatory variable is dichotomous, one of the categories often is hidden in the results displayed (i.e., if
r paste(round(prop.table(table(predimed$sex)), 3)*100,"%",sep="")are male, obviously
r paste(round(prop.table(table(predimed$sex)), 3)*100,"%",sep="")are female). To hide some category, e.g., "Male":
update(restab, hide = c(sex="Male"))
Note that the percentage of males is hidden.
hide.no: Similarly, as explained above, if the category "no" is to be hidden for all variables:
res<-compareGroups(group ~ age + sex + htn + diab, data=predimed) createTable(res, hide.no='no', hide = c(sex="Male"))
Note: "no", "No" or "NO" will produce the same results; the coding is not case sensitive.
digits: The number of digits that appear in the results can be changed, e.g:
createTable(res, digits= c(age=2, sex = 3))
Note that mean and standard deviation has two decimal places for age, while percentage in sex has been set to three decimal places.
type: By default categorical variables are summarized by frequencies and percentages. This can be changed by the
Note that only percentages are displayed.
Note that only frequencies are displayed.
Value 2 or
NA return the same results, i.e., the default option.
show.n: If option
show.nis set to
TRUEa column with available data for each variable appears in the results:
show.descr: If argument
show.descris set to
FALSEonly p-values are displayed:
show.allargument is set to
TRUEa column is displayed with descriptives for all data:
show.p.overallargument is set to
FALSEp-values are omitted from the table:
show.p.trend: If the response variable has more than two categories a p-value for trend can be calculated. Results are displayed if the
show.p.trendargument is set to
Note: The p-value for trend is computed from the Pearson test when row-variable is normal and from the Spearman test when it is continuous non-normal. If row-variable is of class
Surv, the test score is computed from a Cox model where the grouping variable is introduced as an integer variable predictor. If the row-variable is categorical, the p-value for trend is computed as
show.p.mul: For a response variable with more than two categories a pairwise comparison of p-values, corrected for multiple comparisons, can be calculated. Results are displayed if the
show.p.mulargument is set to
Note: Tukey method is used when explanatory variable is normal-distributed and Benjamini & Hochberg [@BH] method otherwise.
show.ratio: If response variable is dichotomous or has been defined as class
survival(see Section 3.1), Odds Ratios and Hazard Ratios can be displayed in the results by stating
TRUEat the show.ratio option:
createTable(update(res, subset= group!="Control"), show.ratio=TRUE)
Note that category "Control diet" of the response variable has been omitted in order to have only two categories (i.e., a dichotomous variable). No Odds Ratios would be calculated if response variable has more than two categories.
Note that when response variable is of class
Surv, Hazard Ratios are calculated instead of Odds Ratios.
createTable(compareGroups(tmain ~ group + age + sex, data=predimed), show.ratio=TRUE)
digits.ratio: The number of decimal places for Odds/Hazard ratios can be changed by the
createTable(compareGroups(tmain ~ group + age + sex, data=predimed), show.ratio=TRUE, digits.ratio= 3)
header.labels: Change some key table header, such as the p.overall, etc. Note that this is done when printing the table changing the argument in the
createTablefunction. This argument is also present in other function that exports the table to pdf, plain text, etc.
tab<-createTable(compareGroups(tmain ~ group + age + sex, data=predimed), show.all = TRUE) print(tab, header.labels = c("p.overall" = "p-value", "all" = "All"))
Tables made with the same response variable can be combined by row:
restab1 <- createTable(compareGroups(group ~ age + sex, data=predimed)) restab2 <- createTable(compareGroups(group ~ bmi + smoke, data=predimed)) rbind("Non-modifiable risk factors"=restab1, "Modifiable risk factors"=restab2)
Note how variables are grouped under "Non-modifiable" and "Modifiable"" risk factors because of an epigraph defined in the
rbind command in the example.
The resulting object is of class
rbind.createTable, which can be subset but not updated. It inherits the class
createTable. Therefore, columns and other arguments from the
createTable function cannot be modified:
To select only Age and Smoking:
x <- rbind("Non-modifiable"=restab1,"Modifiable"=restab2) rbind("Non-modifiable"=restab1,"Modifiable"=restab2)[c(1,4)]
To change the order:
Columns from tables built with the same explanatory and response variables but done with a different subset (i.e. "ALL", "Male" and "Female", strata) can be combined:
res<-compareGroups(group ~ age + smoke + bmi + htn , data=predimed) alltab <- createTable(res, show.p.overall = FALSE) femaletab <- createTable(update(res,subset=sex=='Female'), show.p.overall = FALSE) maletab <- createTable(update(res,subset=sex=='Male'), show.p.overall = FALSE)
With the argument
caption set to
NULL no name is displayed for columns.
By default the name of the table is displayed for each set of columns.
NOTE: The resulting object is of class
cbind.createTable and inherits also the class
createTable. This cannot be updated. It can be nicely printed on the R console and also exported to LaTeX but it cannot be exported to CSV or HTML.
Since version 4.0, it exists the function
strataTable to build tables within stratas defined by the values or levels defined of a variable. Notice that the syntax is much simpler than using
cbind method. For example, to perform descriptives by groups, and stratified per gender:
res <- compareGroups(group ~ . -sex, predimed) restab <- createTable(res, hide.no="no")
strataTablefunction on the table:
In this section some other
createTable options and methods are discussed:
which.tableargument it can be changed: "avail" returns data available and "both" returns both tables:
print(createTable(compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed)), which.table='both')
nmax argument to
FALSE, the total maximum "n" in the available data is omitted in the first row.
print(createTable(compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed)), nmax=FALSE)
summary: returns the same table as that generated with
summary(createTable(compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed)))
update: An object of class
createTablecan be updated:
res<-compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed) restab<-createTable(res, type=1, show.ratio=TRUE ) restab update(restab, show.n=TRUE)
In just one statement it is possible to update an object of class
update(restab, x = update(res, subset=c(sex=='Female')), show.n=TRUE)
Note that the
compareGroups object (res) is updated, selecting only "Female"" participants, and the createTable class object (restab) is updated to add a column with the maximum available data for each explanatory variable.
subsetting: Objects from
createTablefunction can also be subsetted using "[":
createTable(compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed))
createTable(compareGroups(group ~ age + sex + bmi, data=predimed))[1:2, ]
Making use of
descrTable the user can build descriptive table in one step. This function takes all the arguments of
compareGroups function plus the ones from
The result is the same as if the user first call compareGroups and then createTable. Therefore one can use the same methods and functions avaiable for createTable objects (subsetting, ploting, printing, exporting, etc.)
To describe all varaible from predimed, just type:
To describe some varaibles, make use of formula argument (just as using
descrTable(~ age + sex, predimed)
To hide "no" category from yes-no variables, use the
hide.no argument from
To report descriptives by group as well as the descriptives of the entire cohort:
descrTable(group ~ ., predimed, hide.no="no", show.all=TRUE)
Or you can select individuals using the
Tables can be exported to CSV, HTML, LaTeX, PDF, Markdown, Word or Excel
export2csv(restab, file='table1.csv'), exports to CSV format
export2html(restab, file='table1.html'), exports to HTML format
export2latex(restab, file='table1.tex'), exports to LaTeX format (to be included in Swaeave documents R chunks)
export2pdf(restab, file='table1.pdf'), exports to PDF format
export2md(restab, file='table1.md'), to be included inside Markdown documents R chunks
export2word(restab, file='table1.docx'), exports to Word format
export2xls(restab, file='table1.xlsx'), exports to Excel format
Note that, since version 3.0, it is necessary write the extension of the file.
which.table: By default only the table with the descriptives is exported. This can be changed with the
"avail" exports only available data and "both" exports both tables.
nmax: By default a first row with the maximum "n" for available data (i.e. the number of participants minus the least missing data) is exported. Stating
nmax argument to FALSE this first row is omitted.
sep: Only relevant when table is exported to csv. Stating, for example,
sep = ";" table will be exported to csv with columns separated by ";".
A special case of exporting is when tables are exported to LaTeX. The function
export2latex returns an object with the tex code as a character that can be changed in the
export2latexis missing, the code is printed in the
Rconsole. This can be useful when
Rcode is inserted in a LaTeX document chunk to be processed with
restab<-createTable(compareGroups(group ~ age + sex + smoke + waist + hormo, data=predimed)) export2latex(restab)
size: The font size of exported tables can be changed by this argument. Possible values are "tiny", "scriptsize", "footnotesize", "small", "normalsize", "large", "Large", "LARGE","huge", "Huge" or "same". Default is "same", which means that font size of the table is the same as specified in the main LaTeX document where the table will be inserted.
caption: The table caption for descriptives table and available data table. If
which.table is set to "both" the first element of "caption" will be assigned to descriptives table and the second to available data table. If it is set to "", no caption is inserted. Default value is
NULL, which writes "Summary descriptives table by groups of 'y'" for descriptives table and "Available data by groups of 'y'" for the available data table.
loc.caption: Table caption location. Possible values are "top" or "bottom". Default value is "top".
label: Used to cite tables in a LaTeX document. If
which.table is set to "both" the first element of "label" will be assigned to the descriptives table and the second to the available data table. Default value is
NULL, which assigns no label to the table/s.
landscape: Table is placed in horizontal way. This option is specially usefull when table contains many columns and/or they are too wide to be placed vertically.
Since 4.0 version,
cbind.createTable class objects, i.e. when exporting stratified descriptive tables. Also, nicer and more costumizable tables can be reported making use of
kableExtra package (such as size, strip rows, etc.).
Following, there are some examples when exporting to HTML.
First create the descriptive table of PREDIMED variables by intervention groups.
res <- compareGroups(group ~ ., predimed) restab <- createTable(res, hide.no="no")
export2md(restab, strip=TRUE, first.strip=TRUE)
resMales <- compareGroups(group ~ . -sex, predimed, subset=sex=="Male", simplify=FALSE) resFemales <- compareGroups(group ~ . -sex, predimed, subset=sex=="Female", simplify=FALSE) restabMales <- createTable(resMales, hide.no="no") restabFemales <- createTable(resFemales, hide.no="no") restab <- cbind("Males"=restabMales, "Females"=restabFemales)
Since version 2.0 of
compareGroups package, there is a function called
report which automatically generates a PDF document with the "descriptive" table as well as the corresponding "available"" table. In addition, plots of all analysed variables are shown.
In order to make easier to navigate throught the document, an index with hyperlinks is inserted in the document.
See the help file of this function where you can find an example with the REGICOR data (the other example data set contained in the
# to know more about report function ?report # info about REGICOR data set ?regicor
Also, you can use the function
radiograph that dumps the raw values on a plain text file. This may be usefull to identify possible wrong codes or non-valid values in the data set.
Many times, it is important to be aware of the missingness contained in each variable, possibly by groups.
Althought "available" table shows the number of the non-missing values for each row-variable and in each group, it would be desirable to test whether the frequency of non-available data is different between groups.
For this porpose, a new function has been implemented in the
compareGroups package, which is called
missingTable. This function applies to both
createTable class objects. This last option is useful when the table is already created. To illustrate it, we will use the REGICOR data set, comparing missing rates of all variables by year:
# from a compareGroups object data(regicor) res <- compareGroups(year ~ .-id, regicor) missingTable(res)
# or from createTable objects restab <- createTable(res, hide.no = 'no') missingTable(restab)
Perhaps a NA value of a categorical variable may mean something different from just non available. For example, patients admitted for "Coronary Acute Syndrome" with
NA in "ST elevation" may have a higher risk of in-hospital death than the ones with available data, i.e. "ST elevation" yes or not. If these kind of variables are introduced in the data set as
NA, they are removed from the analysis. To avoid the user having to recode
NA as a new category for all categorical variables, new argument called
compareGroups function has been implemented which does it automatically. Let's see an example with all variables from REGICOR data set by cardiovascular event.
# first create time-to-cardiovascular event regicor$tcv<-with(regicor,Surv(tocv,cv=='Yes')) # create the table res <- compareGroups(tcv ~ . -id-tocv-cv-todeath-death, regicor, include.miss = TRUE) restab <- createTable(res, hide.no = 'no') restab
In the version 2.0 of
compareGroups, it is possible to analyse genetic data, more concretely Single Nucleotic Polymorphisms (SNPs), using the function
compareSNPs. This function takes advantage of
SNPassoc [@SNPassoc] and
HardyWeinberg [@HW] packages to perform quality control of genetic data displaying the Minor Allele Frequencies, Missingness, Hardy Weinberg Equilibrium, etc. of the whole data set or by groups. When groups are considered, it also performs a test to check whether missingness rates is the same among groups.
Following, we illustrate this by an example taking a data set from
First of all, load the
SNPs data from
SNPassoc, and visualize the first rows. Notice how are the SNPs coded, i.e. by the alleles. The alleles separator can be any character. If so, this must be specified in the
sep argument of
compareSNPs function (type
?compareSNPs for more details).
In this data frame there are some genetic and non-genetic data. Genetic variables are those whose names begin with "snp". If we want to summarize the first three SNPs by case control status:
res<-compareSNPs(casco ~ snp10001 + snp10002 + snp10003, data=SNPs) res
Note that all variables specified in the right hand side of the formula must be SNPs, i.e. variables whose levels or codes can be interpreted as genotypes (see
setupSNPs function from
SNPassoc package for more information).
Separated summary tables by groups of cases and controls are displayed, and the last table corresponds to missingness test comparing non-available rates among groups.
If summarizing SNPs in the whole data set is desired, without separating by groups, leave the left side of formula in blank, as in
compareGroups function. In this case, a single table is displayed and no missingness test is performed.
res<-compareSNPs(~ snp10001 + snp10002 + snp10003, data=SNPs) res
compareGroups package is loaded, a Graphical User Interface (GUI) is displayed in response to typing
cGroupsGUI(predimed). The GUI is meant to make it feasible for users who are unfamiliar with
R to construct bivariate tables.
Note that, since version 3.0, it is necessary to specifiy an existing data.frame as input. So, for example, you can load the PREDIMED data by typing
data(predimed) before calling
In this section we illustrate, step by step, how to construct a bivariate table containing descriptives by groups from the predimed data using the GUI:
export2md(createTable(compareGroups(group ~ age + sex + smoke + bmi + waist + wth + htn + diab + hyperchol + famhist + hormo + p14 + toevent + event, data=predimed), hide.no="No",hide = c(sex="Male")))
Rformat, CSV plain text file or a data.frame already existing in the Workspace. By default, the predimed example data is loaded when the GUI is opened.
Rconsole. The table can also be exported to the file formats listed.
For a case-control study, it may be necessary to report the Odds Ratio between cases and controls for each variable. The table below contains Odds Ratios for each row-variable by hypertension status.
export2md(createTable(compareGroups(htn ~ age + sex + smoke + bmi + waist + wth + diab + hyperchol + famhist + hormo + p14 + toevent + event, data=predimed), hide.no="No",hide = c(sex="Male"), show.ratio=TRUE, show.descr=FALSE))
To build this table, as illustrated in the screens below, you would select htn variable (Hypertension status) as the factor variable, indicate "no" category on the "reference" pull-down menu, and mark "Show odds/hazard ratio" in the "Report Options" menu before exporting the table.
In a cohort study, it may be more informative to compute hazard ratio taking into account time-to-event.
export2md(createTable(compareGroups(tmain ~ group + age + sex, data=predimed), show.ratio=TRUE))
To generate this table, select toevent variable and event, indicating the time-to-event and the status, respectively, and select the event category for the status variable. Finally, as for Odds Ratios, mark 'Show odds/hazard ratio' in the 'Report Options' menu before exporting the table.
To return to the
R console, just close the GUI window.
Since version 2.1,
compareGropus package incorporates a Web User Interface (WUI) based on
R available on CRAN repository [@Shiny] shiny website to facilitate the use of the package for non
This application includes almost all the options existing in "type on" version. Also, thanks to the power of
shiny package, the user can see the results when setting the included variable, the groups, number of decimals, etc almost instantaneously ("reactivity" -see
shiny manual and examples-). This is very useful to modify and customize the descriptive table before saving it in the desired format saving a lot of time.
In the following subsections, we list and describe all the options available in the
Shiny-compareGroups application, and we illustrate how it works with a real example.
Once the WUI is called, a web browser is launched with two main parts: in the left hand side there are all the control menu options to read the data frame from different format files, select the variables list to be analysed, specify the number of digits of the descriptives or p-values displayed in the bivariate table, and much more other aspects; while the right hand side contains the results, i.e, a table containing a basic summary of all variables, the bivariate table or some plots.
Following, an exhaustive list describing all the options are presented:
A panel with all the options to select the data base and to select the analysed variables. Different file formats are allowed; in the present version these are: SPSS, plain text (txt or csv), EXCEL (2000 or 2007) or
R (.rda or RData). Also, it is possible to select example data already present in the
compareGroups package (REGICOR and PREDIMED) and SNPS from
A panel with different issues to control for different aspects of the bivariate table:
Type Each variable can be analysed as normal or non-normal or as categorical. In the first case, mean and standard deviation are displayed; for the second case, medians and quantiles; and for the latest, frequencies and proportions.
Response Control menu to select which variable indicates the group. Three choices are possible:
Hide For categorical variables, select the category you want to be hidden. Additionally, in the "hide no" input text windows you can type the category which represents "no" in the sense that the category named as indicated for all binary variables are hidden.
Subset A global subset can be typed affecting all analysed variables. Also you can select a subgroup of individuals for each variable. In any case, a logical expression in
R language must be typed:
| Operator | R |
Note that to indicate the category, you must type the number that appears in the "VALUES" table on right side of the application instead of its >>name. For example, for gender, type 1 instead of "male" and 2 instead of "female", if "male" is the first category and "female" is the second.
In this panel, the user can select which information must be displayed in the resulting bivariate table, and in which format, etc.
Show What to be displayed in the bivariate table:
Format The user can specify how to display the mean and standard deviation, the quantiles and frequencies.
Decimals The number of decimals to be displayed in the bivariate table, for descriptives, p-values and for OR / HR.
Labels This tab allows to change the "key" headers of the descriptive table such as "ALL", "p-value", etc.
A panel with the options to save the bivariate table in different formats: PDF, CSV, TXT, HTML, Word (.docx) or Excel (.xlsx).
If PDF format is selected, size can be specified as well as landscape format. These options may be useful when the table is big or contains lots of columns.
Explanation about both the
R-package and its Web User Interface (WUI) version. Also, it contains the security rules of the data sent by the user when using WUI remotely (from
VALUES Using this tab, the user can take a look at the data contained in the data set.
Summary: A table with the the name, label and a basic summary of all the variables contained in the data set.
Values: A table with the raw values in the data set. By default only the first ten rows are displayed, but the user can set the number of rows to be shown and navigate through the data set.
TABLE The bivariate table contaning the descriptives, p-values, etc. for the selected variables. Additionaly, pressing "info" button it is displayed the "info table" containing, for each descripted variable, how many observations are available for each group, whether it is treated as normal, non-normal or categorical, which criteria, if any, has been used to select a subset of individuals and if, the Odds Ratio or Hazard Ratio is computed, the factor for which the variable has been multiplied.
PLOT Univariate or bivariate (taking into account the groups if proceeds) plots of selected variable.
SNPs Descriptives of SNPs (Single Nucleotide Polymorphims) with appropriate statistics and tests for these genetic variants. To perform this analysis, only SNPs variables can be selected. A factor response to display genotype frequencies by groups is permitted but not a time-to-event response variable.
In this section we illustrate how to analyse a data set.
To use the WUI locally (and not on a remote server), first load the
compareGroups package and call the
In this example we will load the PREDIMED data set from the PREDIMED study [@PREDIMED]. This data is already available in the
After the data is loaded satisfactory, the "Step 2. Select variable" panel is opened automatically. Using this panel, we will select all variables except the "event" and "toevent" (the time-to-event) variables. They will represent the response and must be removed from the row-variables.
Since PREDIMED is a longitudinal study where the main goal is to check whether the mediterranean diet is related to a less incidence of cardiovascular disease, we will take the response as the time to cardiovascular event. To do so, we will select "Survival" response type, setting the "toevent" as time variable and "event" as indicator variable (taking "yes" as case code).
To see which continuous variables should be treated as normal and which not. By default,
compareGroups performs a Shapiro-Wilks normality test to decide which variables are normal. But we may want to check the normality assumption graphically.
To see the output, click on TABLE tab on the right panel.
By pressing "View options" button a slider to customize the font-size of the bivariate table appears, as well as an "info" button. Pressing the "info" button a table containg information about the number of available data by variable and group, type of variable (normal, non-normal or categorical, etc) is displayed on a modal.
Finally, once the table contains all the desired figures and in the appropriate format, it can be downloaded and stored in different formats:
a. PDF: a LaTeX compiler such as MikTex must be installed to build the PDF document with the descriptive table. The user can select the font size and whether table must be placed vertically or horizontally (landscape option).
b. CSV: a plain text file with columns separated by commas or semicolons. For Windows users, this format is useful since it can be opened by Excel.
c. HTML: a web browser is opened and the table can be easily copied and pasted to Word, for instance.
d. TXT: a plain text file which can be opened by any text editor program and which contains the table as in
R console with a nice format. Once the "download" button is pressed, the file is automatically stored in your PC/Mac.
e. Word: A Word document file (either 2000, 2003 or 2010 version) is created.
f. Excel: An Excel sheet file (either 2000, 2003 or 2010 version) is created.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.