In gcastella/statTools: Easier statistical analyses

library(knitr)
library(statTools)
opts_chunk$set(collapse = T, comment = "#>", eval = FALSE)

When performing a statistical analysis most of the times differences between two or more groups need to be assessed. bivarTable function is thought to address this issue constructing a table with the results from the comparison. It is particularly useful when searching for confounding variables or simply assessing differences in the outcomes, adjusting to avoid confusion.

The aim is to provide a useful function in R to allow statisticians to perform this kind of analyses quicker and easier, saving a lot of time. However, bivarTable is very flexible and can assess different hypothesis at a time, allowing different arguments to customize the tables.

The package statTools also includes an export method, that exports the created tables to latex (via xtable), xlsx (via openxlsx) or csv.

The structure

bivarTable constructs a matrix with the results of comparing variables by groups. The levels of the grouping variable will be displayed in columns, and in the rows all the variables in which the differences will be assessed. An extra column called All will be always displayed, with the overall description of the sample. All this structure is summarized with a formula (argument X), where the LHS indicates the variable displayed in the columns (i.e. grouping variable) and the RHS all the variables in the rows. To sum up:

Grouping Variable ~ Variable 1 + Variable 2 + ... + Variable n

or equivalently,

Columns ~ Row 1 + Row 2 + ... + Row n

Notice that ONLY one term is allowed in the LHS and interaction terms are not handled (only additive terms). In addition, since the names of the terms of the formula are assumed to be columns in the argument data, the dot . can be used to add all other variables in the data not mentioned in the formula. You can make use also of the -operator to remove terms from the formula. When LHS is not provided, only a table with the overall description is produced. The following code, depicts the structure explained, using the iris dataset as example.

bivarTable(Species ~ ., iris)

An unquoted matrix is printed (internally returns a list where the first argument is the table printed), with the totals and the levels of Species (LHS) as columns and all the remaining variables in iris as rows. Notice also that:

N's in every group are provided, and percentage as well in the row names.
The percentage in the overall column indicates that 100\% of the values of grouping variable are non-missing.
Each cell has the mean (SD) or absolute frequency (relative frequency * 100) for quantitative and qualitative variables, respectively.
For the qualitative variables, each percentage is calculated from the total in the column.
There's an argument which was not explained yet. Pattience, we will get there soon.

This is one of the most basic tables that can be generated with bivartable. Other arguments than can control the output produced by default:

To change the percentages for qualitative variables (rows to sum 100\% instead of columns) use margin = 1 instead of the default margin = 2. It is passed to prop.table. If it is a vector 1:2 or 2:1, both percentages are calculated.
To display more decimal places, use rounding. By default only two decimals are printed: rounding = 2. This argument reads in fact the element rounding from options.
By default, binary factors in the rows are displayed as one row (usually the second, except the first one has level "Yes", "1", or something similar) and giving the label of the row displayed in brackets. Set the argument condense.binary.factors to FALSE to display both levels (rows).
By default, the grouping variable is transformed to factor using factor function. This removes unused categories. To change that, use drop.y = FALSE.
The same is done for the rows when are factors. To change that set drop.x to FALSE. These agruments can change the number of rows and columns and must be used carefully because they can lead (and avoid) to errors.
show.quantiles is passed to function descr_var(). If FALSE no quantiles are calculated. If the quantiles are required, show.quantiles will be passed to function quantile as argument probs.

Most of these arguments can be set by default throughout all the session via options(argument = <value>). We will add some qualitative variables to the iris dataset to better illustrate the effect of some of these arguments:

iris2 <- iris
iris2$Sepal.Total.CAT2 <- cut(iris$Sepal.Width + iris$Sepal.Length, breaks = 2)
iris2$Sepal.Total.CAT3 <- cut(iris$Sepal.Width + iris$Sepal.Length, breaks = 3)
iris2$Sepal.Total.CAT5 <- cut(iris$Sepal.Width + iris$Sepal.Length, breaks = 5)

bivarTable(Species ~ . + log(Sepal.Length), iris2, 
           rounding = 1, margin = 2, condense.binary.factors = TRUE)
bivarTable(Species ~ . + log(Sepal.Length), iris2, margin = 1, condense.binary.factors = FALSE)
bivarTable(Species ~ . + log(Sepal.Length), iris2, margin = 1, condense.binary.factors = FALSE, 
           show.quantiles = c(0, 0.5, 1))

Notice that the formula allows transformations of the variables like logarithm in this case.

Assessing differences

Once the basic structure of the table is set, extra columns can be added with the p-values resulting from tests assessing differences between groups in each of the row variables. Argument test controls this issue. By default, test = "none", no tests are performed. However, there are other options for this argument: "parametric", "non-parametric" and "both". These options will perform the tests specified in the elements "parametric.tests" and "non.parametric.tests" from options. One parametric test and one non-parametric test are defined for every possible scenario respectively.

Quantitative row variable vs binary grouping variable: t-test (t.test) or Mann-Whitney test (wilcox.test).
Quantitative row variable vs non-binary grouping variable: ANOVA (oneway.test) or Kruskal-Wallis test (kruskal.test).
Qualitative row variable: Chi-squared test (chisq.test) or Fisher test (fisher.test).

When test = "both", both parametric and non-parametric tests are performed. Here you have an example of the use of both tests: two new columns are added.

bivarTable(Species ~ . + log(Sepal.Length), iris2, margin = 1,  # structure
           test = "both")                                       # tests

I like personally to use one line for the structure and options (two lines if many options are needed), another line for the tests. The other arguments left to explain are fit.model, outcome, FUN.model and ..., wich would be specified in new lines. These ones are explained in the next section.

Moreover, all the p-values will be rounded using the function given the first element in options()$p.value.rounding. The other elements will be arguments passed to the first one called by do.call. The default function used for this is pvalue, which rounds significant p-values to the first significant digit and non-significant p-values to two decimal places. See ?pvalue.

Fitting models

If wanted, models can be fitted for every row variable with the grouping variable. The argument fit.model is a NAMED list of formulas that are passed to update.formula. For every formula in this list, an extra columnt will be added with the p-values of the significance of the first variable in the model (not counting the intercept), using drop1. To specify this formulas, make use of the . to refer either the row variable or the grouping variable. The argument outcome, controls whether the output variable is the one in the columns or in the rows in all the models performed (1 means the row variable is the outcome and 2 for the columns). Depending on that, a linear model is performed if the output is quantitative or a factor with more than two levels (using the order). If the output variable is a binary factor, a logistic regression model is fitted instead.

For instance, we could add the p-value of the model adjusted by Petal.Length for every variable in iris2 (except for the one we are adjusting by) by doing:

bivarTable(Species ~ . - Petal.Length, iris2,
           test = "non-par", 
           fit.model = list(model1 = ~ . + Petal.Length),
           output = 1)

Always write the dot first, because the p-value provided would come from that variable. In this case, the dot is exchangable for Species, because the output are the variables in the rows in each model, so Species is always the input. However, in the next example it is not.

bivarTable(Species ~ . - Petal.Length, iris2,
           test = "non-par", 
           fit.model = list(adjusted = ~ . + Petal.Length,
                            interaction = ~ . * Petal.Length),
           output = 2)
bivarTable(Species ~ . - Petal.Length, iris2,
           test = "non-par", 
           fit.model = list(adjusted = ~ . + Petal.Length,
                            interaction = ~ . * Petal.Length),
           output = 2, pvalue.model = FALSE)

In this example, another model with the interaction is fitted, so two new columns are added. However, the p-value from the variable in the row is the only one that is provided by default. If you want to print the p-value from the interaction you need to use the FUN.model argument which is explained in the next section.

Note that in the second call, the argument pvalue.model controls whether to show or not the drop1 p-values from every model.

Adding model outputs

Once a model if fitted, it can be used as the argument to many functions. These functions have to be specified in the argument FUN.model. FUN.model a list whose elements are vectors of characters with the names of the functions to be applied to every model. The name of these elements indicates which of the models in fit.model will be used. The model will be passed as the first argument to every function. You can use your own defined functions or the ones already in this package like:

To display coefficients and SD's use getBetaSd.
To display p-values of the coefficients use getPval.
To display OR and 95\% CI use getORCI.
To display Nagelkerke's r-squared use nagelkerke or adjNagelkerke for the adjusted.
To display goodness of fit p-value use GoF.

When you want to display the results for every level of the row variable in its own row, use the vertical version of the functions: verticalgetBetaSd, verticalgetPval and verticalgetORCI.

To change the arguments of these functions, add an argument (in ...) with the same name of the model in which this functions will be applied. This argument will be a list whose elements will be lists with the arguments to pass to every function. The names of these lists indicates the function whose arguments belong to. For instance, let's write the code for displaying the AIC of the interaction model in the last example.

bivarTable(Species ~ . - Petal.Length, iris2,
           test = "non-par", 
           fit.model = list(adjusted = ~ . + Petal.Length,      # model adjusted by Petal.Length
                            interaction = ~ . * Petal.Length),  # model with the interaction
           output = 1,                                          # row variables are the outputs

           FUN.model = list(interaction = "AIC"),               # for the 'interaction' model 
                                                                # use 'AIC' with default args
           )

bivarTable(Species ~ . - Petal.Length, iris2,
           test = "non-par", 
           fit.model = list(adjusted = ~ . + Petal.Length,      # model adjusted by Petal.Length
                            interaction = ~ . * Petal.Length),  # model with the interaction
           output = 1,                                          # row variables are the outputs

           FUN.model = list(interaction = c("AIC", "BIC"),      # for the 'interaction' model 
                            adjusted = "AIC"),                  # use function 'AIC' and 'BIC'
                                                                # just 'AIC' for the 'adjusted'

           interaction = list(AIC = list(k = 10))               # for the function 'AIC' from
                                                                # the model 'interaction' set arg
                                                                # k=10 and default args for 'BIC'
           )

The second call, returns the AIC from the model interaction using k = 10 and the BIC with default parameters. From the model adjusted returns the AIC with default arguments.

Export and saving

export method allows to export and save bivarTable objects to xlsx with a fancy style. You need openxlsx for this.It allows to export or save in latex, using xtable, or csv. For xlsx automatically opens the file with EXCEL.exe.

bvt <- bivarTable(Species ~ . - Petal.Length, iris2,
           test = "non-par", 
           fit.model = list(adjusted = ~ . + Petal.Length,
                            interaction = ~ . * Petal.Length),
           output = 2)
export(bvt, type = "xlsx")
export(bvt, file = "myBivarTable.xlsx")