Tables in R Markdown

knitr::opts_chunk$set(echo = TRUE)
options(width=60)
if (!requireNamespace("rmarkdown") || !rmarkdown::pandoc_available("1.12.3")) {
  warning("This vignette requires pandoc version 1.12.3; code will not run in older versions.")
  knitr::opts_chunk$set(eval = FALSE)
}

Introduction

This vignette was built using tables version r packageDescription("tables")$Version. It is intended to show the same content as the tables.pdf vignette that was written in Sweave, but with R Markdown source code. This has allowed a few simplifications; see Section \@ref(sec:knitr) for a description of them.

It is a short introduction to the tables package. Inspired by my 20 year old memories of SAS PROC TABULATE, I decided to write a simple utility to create nice looking tables in Sweave documents. (It now also works in R Markdown documents, as this vignette illustrates.) For example, we might display summaries of some of Fisher's iris data using the code

library(tables)
table_options(knit_print = FALSE)
tabular( (Species + 1) ~ (n=1) + Format(digits=2)*
         (Sepal.Length + Sepal.Width)*(mean + sd), data=iris )

You can also pass the output through the toLatex() function to produce \LaTeX\ output, which when processed by pdflatex will produce the following table:

toLatex(
  tabular( (Species + 1) ~ (n=1) + Format(digits=2)*
           (Sepal.Length + Sepal.Width)*(mean + sd), data=iris )
  )

However, if you are using rmarkdown or knitr (as this document does), toLatex() is not necessary. Just execute

table_options(knit_print = TRUE)

at the start of your document, and conversion to \LaTeX\ will be done automatically when needed.

If you prefer the style of table that the \LaTeX\ booktabs package [@booktabs] produces, you can choose that style instead. I mostly like it, so I have used

booktabs()

for the rest of this document. This gives

saved.options <- table_options()
invisible(booktabs())
tabular( (Species + 1) ~ (n=1) + Format(digits=2)*
         (Sepal.Length + Sepal.Width)*(mean + sd), data=iris )

Details on booktabs() are given in section \@ref(sec:booktabs) below.

There is also the toHTML function and html.tabular method for the Hmisc::html() generic; they produce output in HTML format. Finally, see section \@ref(sec:csv) for other output formats.

The idea of a table in the tables package is a rectangular array of values, with each row and column labelled, and possibly with groups of rows and groups of columns also labelled. These arrays are specified by "table formulas".

Table formulas are R formula objects, with the rows of the table described before the tilde ("~"), and the columns after. Each of those is an expression containing "*", "+", "=", as well as functions, function calls and variables, and parentheses for grouping. There are also various directives included in the formula, entered as "pseudo-functions", i.e. expressions that look like function calls but which are interpreted by the tabular() function.

For example, in the formula

(Species + 1) ~ (n=1) + Format(digits=2)*
         (Sepal.Length + Sepal.Width)*(mean + sd)

the rows are given by \verb!(Species + 1)!. The summation here is interpreted as concatenation, i.e. this says rows for Species should be followed by rows for 1.

In the iris dataframe, Species is a factor, so the rows for it correspond to its levels.

The 1 is a place-holder, which in this context will mean "all groups".

The columns in the table are defined by

(n=1) + Format(digits=2)*(Sepal.Length + Sepal.Width)*(mean + sd)

Again, summation corresponds to concatenation, so the first column corresponds to (n=1). This is another use of the placeholder, but this time it is labelled as n. Since we haven't specified any other statistic to use, the first column contains the counts of values in the dataframe in each category.

The second term in the column formula is a product of three factors. The first, Format(digits=2), is a pseudo-function to set the format for all of the entries to come. (For more on formats, see section \@ref(sec:formats) below.) The second factor, (Sepal.Length + Sepal.Width), is a concatenation of two variables. Both of these variables are numeric vectors in iris, and they each become the variable to be analyzed, in turn. The last factor, (mean + sd) names two R functions. These are assumed to be functions that operate on a vector and produce a single value, as mean and sd do. The values in the table will be the results of applying those functions to the two different variables and the subsets of the dataset.

Reference

For the examples below we use the following definitions:

set.seed(100)
X <- rnorm(10)
X
A <- sample(letters[1:2], 10, rep=TRUE)
A
F <- factor(A)
F

Function syntax

tabular()

tabular(table, ...)
tabular.default(table, ...)
tabular.formula(table, data=parent.frame(), n, suppressLabels=0, ...) 

The tabular function is a generic function. The default method uses as.formula() to try to convert the table argument to a formula, then passes it and all the other arguments to tabular.formula() method, which does most of the work. That method has 4 arguments plus ..., but usually only the first two are used, and a warning is issued if anything is passed in the ... arguments.

The value returned is a list-mode matrix corresponding to the entries in the table, with a number of attributes to help with formatting. See the ?tabular help page for more details.

format(), print(), toLatex() {#sec:formatsyntax}

format(x, digits=4, justification="n", ...) 
print(x, ...)
toLatex(x, file="", options=NULL,  ...)

The tables package provides methods for the format(), print() and utils::toLatex() generics. The arguments are:

as.matrix(), write.csv.tabular(), write.table.tabular() {#sec:csv}

as.matrix(x, format = TRUE, 
    rowLabels = TRUE, colLabels = TRUE, justification = "n", ...)
write.csv.tabular(x, file = "", justification = "n", row.names=FALSE, 
    write.options=list(), ...)
write.table.tabular(x, file="", 
    justification = "n", row.names=FALSE, col.names=FALSE,
    write.options=list(), ...) 

These functions export tables for further computations. The arguments are:

as.tabular()

as.tabular(x, ...)
as.tabular.default(x, like=NULL, ...)
as.tabular.data.frame(x, ...)

These functions create tables from existing matrices or dataframes of values. The dimnames of the input are used to construct default row and column names. If more elaborate labelling is wanted, use a tabular object as the like argument. The labelling for like will be used on the newly constructed result.

table_options(), booktabs() {#sec:booktabs}

The table_options() function sets a number of formatting defaults for the toLatex() method:

The defaults are

saved.options

Some options only apply to HTML output; see the help page ?table_options for details.

If you are using the \LaTeX\ booktabs package, the booktabs() function will set different options. Currently those are:

table_options()[c("toprule", "midrule", "bottomrule", "titlerule")]

The earlier table of iris data was produced using


We can use the doXXXX options to insert raw \LaTeX\ into a table:

toLatex(tabular(Species ~ (n=1) + Format(digits=2)*
         (Sepal.Length + Sepal.Width)*(mean + sd), data=iris),
      options=list(doFooter=FALSE, doEnd=FALSE))
cat("\\ \\\\ \\multicolumn{6}{l}{
\\textit{Overall, we see the following: }} \\\\
\\ \\\\")
toLatex(tabular(1 ~ (n=1) + Format(digits=2)*
         (Sepal.Length + Sepal.Width)*(mean + sd), data=iris),
      options=list(doBegin=FALSE, doHeader=FALSE))
toLatex(tabular(Species ~ (n=1) + Format(digits=2)*
         (Sepal.Length + Sepal.Width)*(mean + sd), data=iris),
      options=list(doFooter=FALSE, doEnd=FALSE))
cat("\\ \\\\ \\multicolumn{6}{l}{
\\textit{Overall, we see the following: }} \\\\
\\ \\\\")
toLatex(tabular(1 ~ (n=1) + Format(digits=2)*
         (Sepal.Length + Sepal.Width)*(mean + sd), data=iris),
      options=list(doBegin=FALSE, doHeader=FALSE))

Note that we need explicit toLatex() calls to access these options; in turn, that means the knitr chunk options require results = "asis".

latexNumeric() {#sec:latexNumeric}

latexNumeric(chars, minus = TRUE, leftpad = TRUE, rightpad=TRUE, 
                        mathmode = TRUE)

The latexNumeric() function converts character representations of numbers into a format suitable for display in \LaTeX\ documents. There are two goals:

The arguments are:

Operators

$e_1 + e_2$

Summing two expressions indicates that they should be displayed in sequence. For rows, this means $e_1$ will be displayed just above $e_2$; for columns, $e_1$ will be just to the left of $e_2$.

Example:

tabular(F + 1 ~ 1)

$e_1 * e_2$

Multiplying two expressions means that each element of $e_1$ will be applied to each element of $e_2$. If $e_1$ is a factor, then $e_2$ will be displayed for each element of it. NB: $$ has higher precedence than $+$ and evaluation proceeds from left to right. The expression $(e_1 + e_2)(e_3 + e_4)$ is equivalent to $e_1e_3 + e_1e_4 + e_2e_3 + e_2e_4$.

Example:

tabular( X*F*(mean + sd) ~ 1 )

$e_1 \sim e_2$

The tilde separates row specifications from column specifications, but otherwise acts the same as $*$, i.e. each row value applies to each column.

Example:

tabular( X*F ~ mean + sd )

$e_1 = e_2$

The operator $=$ is used to set the name of $e_2$ to a displayed version of $e_1$. It is an abbreviation for Heading($e_1$)*$e_2$. NB: because $=$ has lower operator precedence than any other operator, we usually put parentheses around these expressions, i.e. $(e_1 = e_2)$.

Example: F is renamed to Newname.

tabular( X*(Newname=F) ~ mean + sd )

Terms in Formulas

R parses table formulas into sums, products, and bindings separated by the tilde formula operator. What comes between the operators are other expressions. Other than the pseudo-functions described in section \ref{sec:pseudo}, these are evaluated and the actions depend on the type of the resulting value.

Closures or other functions {#sec:closures}

If the expression evaluates to a function (e.g. it is the name of a function), then that function becomes the summary statistic to be displayed. The summary statistic should take a vector of values as input, and return a single value (either numeric, character, or some other simple printable value). If no summary function is specified, the default is length, to count the length of the vector being passed.

Note that only one summary function can be specified for any cell in the table or an error will be reported.

Example: mean and sd are specified functions; n is the renamed default statistic.

tabular( (F+1) ~ (n=1) + X*(mean + sd) )

Factors

If the expression evaluates to a factor, the dataset is broken up into subgroups according to the levels of the factor. Most of the examples above have shown this for the factor F, but this can also be used to display complete datasets:

Example: creating a factor to show all data. Use the identity function to display the values in each cell.

tabular( (i = factor(seq_along(X)))  ~ 
       Heading()*identity*(X+A + 
              (F = as.character(F) ) ) )

Logical vectors

If the expression evaluates to a logical vector, it is used to subset the data.

Example: creating subsets on the fly.

tabular( (X > 0) + (X < 0)  + 1
    ~ ((n = 1) + X*(mean + sd)) )

Language Expressions

If the expression evaluates to a language object, e.g. the result of quote() or substitute(), then it will be replaced in the table formula by its result. This allows complicated table formulas to be saved and re-used. For examples, see section \ref{sec:tableformulas}.

Other vectors {#sec:othervectors}

If the expression evaluates to something other than the above, then it is assumed to be a vector of values to be summarized in the table. If you would like to summarize a factor or logical vector, wrap it in I() to prevent special handling.

Note that the following must all be true, or an error will be reported:

Example: treating a logical vector as values.

tabular( I(X > 0) + I(X < 0)
    ~ ((n=1) + mean + sd) )

"Pseudo-functions" {#sec:pseudo}

Several directives to tables may be embedded in the table formula. This is done using "pseudo-functions". Syntactically they look like function calls, but reserved names are used. In most cases, their action applies to later factors in the term in which they appear. For example,

X*Justify(r)*(Y + Format(digits=2)*Z) + A

will apply the Justify(r) directive to both Y and Z, but the Format(digits=2) directive will only apply to Z, and neither will apply to A.

Format() {#sec:formats}

By default tables formats each column using the standard format() function, with arguments taken from the format.tabular() call (see section \ref{sec:formatsyntax}).

The Format() pseudo-function does two things: it changes the formatting, and it specifies that all values it applies to will be formatted together. The "call" to Format looks like a call to format, but without specifying the argument x. When tabular() formats the output it will construct x from the entries in the table governed by the Format() specification.

Example: The mean and standard deviation are both governed by the same format, so they are displayed with the same number of decimal places, chosen so that the smallest values (the means) show two significant digits.

tabular( (F+1) ~ (n=1) 
           + Format(digits=2)*X*(mean + sd) )

For customized formatting, an alternate syntax is to pass a function call to Format(), rather than a list of arguments. The function should accept an argument named x (but as with the regular formatting, x should not be included in the formula), to contain the data. It should return a character vector of the same length as x.

Example: Use a custom function and sprintf() to display a standard error in parentheses.

StdErr <- function(x) sd(x)/sqrt(length(x))
fmt <- function(x, digits, ...) {
  s <- format(x, digits=digits, ...)
  is_stderr <- (1:length(s)) > length(s) %/% 2
  s[is_stderr] <- sprintf("$(%s)$", s[is_stderr])
  s[!is_stderr] <- latexNumeric(s[!is_stderr])
  s
}
tabular( Format(fmt(digits=1))*(F+1) ~ X*(mean + StdErr) )

Character values in cells in the table are handled specially; see section \ref{sec:formatdetails} below.

.Format()

The pseudo-function .Format() is mainly intended for internal use. It takes a single integer argument, saying that data governed by this call uses the same formatting as the format specification indicated by the integer. In this way entries can be commonly formatted even when they are not contiguous. The integers are assigned sequentially as the format specification is parsed; users will likely need trial and error to find the right value in a complicated table with multiple formats.

Example: Format two separated columns with the same format.

tabular( (F+1) ~ X*(Format(digits=2)*mean 
                    + (n=1) + .Format(1)*sd) )

Heading()

Normally tabular() generates row and column labels by deparsing the expression being tabulated. These can be changed by using the Heading() pseudo-function, which replaces the heading on the next object found. The heading can either be a name or a string in quotes. If the character.only argument is TRUE, the expression will be evaluated to a string which will be used as a heading. \LaTeX\ codes which are not syntactically valid R can be used either in quoted strings or with character.only = TRUE.

If no argument is passed, the next label is suppressed.

There's an optional argument override, which must be either TRUE or FALSE if present. If it is TRUE (or not present), then the heading will override a previously specified heading. If FALSE, it will not. The latter seems likely only to be of use in automatically generated code, and is used in the automatically generated labels for factors.

Another optional argument is nearData. This is used only when two terms in a table are concatenated using +, and they don't have the same number of rows or columns. Under the default TRUE value, the smaller one is moved closer to the data in the table (i.e. to the right for row labels, down for column labels); if FALSE, it is moved in the opposite direction.

Example: Replace F with a Greek $\Phi$, and suppress the label for X.

tabular( (Heading("$\\Phi$")*F+1) ~ (n=1) 
           + Format(digits=2)*Heading()*X*(mean + sd) )

Example: Use nearData = FALSE to push a label away from the data:

tabular( X*F + Heading("near")*X 
        + Heading("far", nearData = FALSE)*X ~ mean + sd )

Justify()

The Justify() pseudo-function is used to specify the text justification of the headers and data values in the table. If called with one argument, that value is used for both labels and data; if called with two arguments, the first is used for the labels, the second for the data. If no Justify() specification is given, the default passed to format(), print() or toLatex() will be used. Values may be specified without quotes if they are legal R names; quoted strings may also be used. (The latter is useful for \LaTeX\ output, for example Justify("r@{}"), to suppress column spacing on the right.)

Example:

tabular( Justify(r)*(F+1) ~ Justify(c)*(n=1) 
   + Justify(c,r)*Format(digits=2)*X*(mean + sd) )

Percent()

The Percent() pseudo-function is used to specify a statistic that depends on other values in the table. It has two optional arguments:

The special syntax Equal(...) will record the expressions in ..., and ignore any factor based subsetting if the factor does not appear among the expressions. Similarly Unequal(...) will use values which differ in any of the expressions in ... from the values in the current cell. (In fact, the mechanism is more general. The expressions in Equal(...) or Unequal(...) are deparsed and treated as strings. Any logical vector elsewhere in the table may be labelled with a string using the labelSubset function and those labels will be respected. Unlabelled logical vectors in the table formula will always be used for subsetting.)

If a logical vector is given, it is used to select which values form the denominator. Anything else is just passed to fn as given. - fn=percent This is the function which actually does the computation. The default definition is function(x, y) 100*length(x) /length(y), giving the percentage count, but any other two argument function could be used.

These two examples are different ways of producing the same table:

tabular( (Factor(gear, "Gears") + 1)
          *((n=1) + Percent() 
            + (RowPct=Percent("row")) 
            + (ColPct=Percent("col"))) 
         ~ (Factor(carb, "Carburetors") + 1)
          *Format(digits=1), data=mtcars )
```r
tabular( (Factor(gear, "Gears") + 1)
          *((n=1) + Percent() 
            + (RowPct=Percent(Equal(gear)))  # Equal, not "row"
            + (ColPct=Percent(Equal(carb)))) # Equal, not "col"
         ~ (Factor(carb, "Carburetors") + 1)
          *Format(digits=1), data=mtcars )

Arguments()

The Arguments() pseudo-function is an exception to the rule that pseudo-functions apply to later factors in the table. What it does is to specify (additional) arguments to the summary function (see section \ref{sec:closures}). For example, the weighted.mean() function takes two arguments: x and w. To use it in a table, you would specify the values to use as x via the usual mechanism for the analysis variable (section \ref{sec:othervectors}), and include a term Arguments(w=weights) either before or after it. The function will be called as weighted.mean(x[subset], w=weights[subset]), where subset is a logical vector indicating which rows of data belong in the current cell.

It is actually a little more complicated than as described above. The arguments to Arguments are evaluated in full, then only those which are length n are subsetted. And if no analysis variable has been specified, but Arguments() has been, then the function will be called without the x[subset] argument. Finally, the Arguments() entry will not create a heading.

For example:

# This is the example from the weighted.mean help page
wt <- c(5,  5,  4,  1)/15
x <- c(3.7,3.3,3.5,2.8)
gp <- c(1,1,2,2)
tabular( (Factor(gp) + 1) 
                ~ weighted.mean*x*Arguments(w = wt) )

The same table (without the x heading) can be produced using

tabular( (Factor(gp) + 1) 
                ~ Arguments(x, w = wt)*weighted.mean )

The order of the weighted.mean and Arguments() factors makes no difference.

DropEmpty()

DropEmpty() indicates that cells (or whole rows or columns of the table) should be dropped if they contain no observations. This will prevent ugly results like NA or NaN from showing up in the table.

This pseudo-function takes two optional arguments, which (with default value c("row", "col", "cell")) and empty (with default value "").

If the which argument contains "row", then any row in the table in which all cells are empty will be dropped. Similarly, if it contains "col", empty columns will be dropped. If it contains "cell", then cells in rows and columns that are not dropped will be set to the empty string.

For example, without using DropEmpty(), this table is ugly:

set.seed(730)
df <- data.frame(Label = LETTERS[1:9], 
         Group = rep(letters[1:3], each=3), 
         Value = rnorm(9), 
         stringsAsFactors = TRUE)
tabular( Label ~ Group*Value*mean, 
        data = df[1:6,])

This looks much better:

tabular( Label ~ Group*Value*mean*
            DropEmpty(empty="."), 
        data = df[1:6,])

Formula Functions {#sec:tableformulas}

Currently several examples of formula functions are provided. Not all are particularly robust; e.g. Hline() only works for \LaTeX\ output and must be in a particular position in the formula. Users can provide their own as well. Such functions should return a language object, which will be substituted into the formula in place of the formula function call.

All()

This function expands all the columns from a dataframe into separate variables in the table. It has syntax

All(df, numeric=TRUE, character=FALSE, logical=FALSE, 
        factor=FALSE, complex=FALSE, raw=FALSE, other=FALSE,
        texify=getOption("tables.texify", FALSE))

The arguments are

If functions are given for any of the selection arguments, the columns will be transformed according to the specified function before inclusion. For example, using factor=as.character will convert factors into character vectors in the table.

Example: Show the means of the numeric columns in the iris data.

tabular( Species ~ Heading()*mean*All(iris), data=iris)

AllObs(), RowNum()

The AllObs() function displays all of the observations in a dataset. It does this by creating a factor with a different level for each observation, and a summary statistic function which just displays the observation. It works with DropEmpty() to drop rows (or columns) from the table if they correspond to non-existent observations. For example,

df <- mtcars[1:10,]
tabular(Factor(cyl)*Factor(gear)*AllObs(df) ~ 
               rownames(df) + mpg, data=df)

Often (as with the mtcars dataset) the full dataset takes a lot of space to display. In that case, it can be displayed in multiple columns using a combination of the AllObs() and RowNum() functions. Because this affects both rows and columns in the resulting table, the code is a little unusual. You would normally compute the RowNum() formula function outside the call to tabular(), and include it in the row specification wrapped in I() and in the column specification in the within argument to AllObs(). For example,

rownum <- with(mtcars, RowNum(list(cyl, gear)))
tabular(Factor(cyl)*Factor(gear)*I(rownum) ~
        mpg * AllObs(mtcars, within = list(cyl, gear, rownum)), 
        data=mtcars)

Despite its name, RowNum can be used to specify columns instead of rows, for a column-major display. In this case, its perrow argument should be interpreted as "per column". For example,

rownum <- with(mtcars, RowNum(list(cyl, gear), perrow = 2))
tabular(Factor(cyl)*Factor(gear)*
           AllObs(mtcars, within = list(cyl, gear, rownum)) ~
               mpg * I(rownum), 
        data=mtcars)

Hline()

This function produces horizontal lines in the table. It only works for LaTeX output, and must be the first factor in a term in the table formula. It has syntax

Hline(columns)

The argument is

Example:

tabular( Species + Hline(2:5) + 1 
                         ~ Heading()*mean*All(iris), data=iris)

Literal() {#sec:Literal}

This function inserts literal text as a label. It has syntax

Literal(x)

The single argument is the text to insert. It is used by the Hline() function to insert the text.

PlusMinus()

This function produces table entries like $x \pm y$ with an optional header. It has syntax

PlusMinus(x, y, head, xhead, yhead, digits=2, ...)

The arguments are

Example: Display mean $\pm$ standard error.

StdErr <- function(x) sd(x)/sqrt(length(x))
tabular( (Species+1) ~ All(iris)*
          PlusMinus(mean, StdErr, digits=1), data=iris )

Paste()

This function produces table entries made up of multiple values. It has syntax

Paste(..., head, digits=2, justify="c", prefix="", sep="",
      postfix="")

The arguments are

Example: Display a confidence interval.

lcl <- function(x) mean(x) - qt(0.975, df=length(x)-1)*StdErr(x)
ucl <- function(x) mean(x) + qt(0.975, df=length(x)-1)*StdErr(x)
tabular( (Species+1) ~ All(iris)*
          Paste(lcl, ucl, digits=2, 
                head="95\\% CI", sep=",", prefix="[",
                postfix="]"), 
          data=iris )

Factor(), RowFactor() and Multicolumn()

\label{sec:RowFactor}

The Factor() function converts its argument into a factor, but keeps the original name for a column heading. RowFactor() is designed to be used only for \LaTeX\ output: it produces multiple rows the way a factor does, but with more flexibility in the formatting. The Multicolumn() function is also designed for \LaTeX\ output: it displays factor levels in the style where the level is displayed across multiple columns on its own line.

They have syntax

Factor(x, name, levelnames, texify=getOption("tables.texify", FALSE))
RowFactor(x, name, levelnames, spacing=3, space=1, 
                    nopagebreak="\\nopagebreak", texify=getOption("tables.texify", FALSE))
Multicolumn(x, name, levelnames, width=2, first=1, justify="l",
                    texify=getOption("tables.texify", FALSE))

The arguments are

Example: Show the first 15 lines of the iris dataset, in groups of 5 lines.

subset <- 1:15
tabular( RowFactor(subset, "$i$", spacing=5)  ~ 
       All(iris[subset,], factor=as.character)*Heading()*identity )

To add extra space after each high level group in a multi-way classification, use spacing = 1. For example:

set.seed(1000)
dat <- expand.grid(Block=1:3, Treatment=LETTERS[1:2], 
                                Subset=letters[1:2])
dat$Response <- rnorm(12)
toLatex( tabular( RowFactor(Block, spacing=1)
                * RowFactor(Treatment, spacing=1, space=0.5)
                * Factor(Subset)
                ~ Response*Heading()*identity, data=dat),
                options=list(rowlabeljustification="c") )

For longer tables, the "longtable" environment allows the table to cross page boundaries. Using this is more complicated, as in the example below. The toprule setting inserts the caption as well as the top rule, because the longtable package requires it to be within the table. The midrule setting gets the headings to repeat on subsequent pages. (I've done all of this in a way that is compatible with the booktabs style; if you want the default style, use \hline in place of the booktabs \toprule and \midrule macros in the options settings instead.) To avoid extra spacing at the top of those pages, we need to undo the automatic addition of a \verb!\normalbaselineskip! there, and use suppressfirst=FALSE so that the first page doesn't get messed up. Whew!

subset <- 1:50
toLatex( tabular( RowFactor(subset, "$i$", spacing=5, 
                                             suppressfirst=FALSE)  ~ 
       All(iris[subset,], factor=as.character)*Heading()*identity ),
       options = list(tabular="longtable",
          toprule="\\caption{This table crosses page boundaries.}\\\\
              \\toprule",
midrule="\\midrule\\\\[-2\\normalbaselineskip]\\endhead\\hline\\endfoot") )

To suppress the row numbering, use suppress=3 in the call to tabular. (It is 3 because we need to suppress the column heading, the rewritten labels for the rows, and the original labels. Trial and error is the best way to determine this!) Unfortunately, the spacing features of RowFactor() won't work without the row labels.

subset <- 1:10
tabular( Factor(subset)  ~ 
       All(iris[subset,], factor=as.character)*Heading()*identity, 
       suppress=3 )

(It is actually possible to get this to work with RowFactor(), but it is ugly: set the name and level names to "", and set the justification to "l@{}" to suppress the intercolumn spacing. Then the column of row labels will be there, but it will be zero width and invisible.)

RowFactor with spacing > 1 will add the nopagebreak macro at the beginning of each label except the first in the group. This can produce \LaTeX\ errors in any column except the first one. One workaround for this is to post-process the table to move the macro. For example, if tab contains the result of tabular() and \LaTeX\ complains about misplaced \verb!\nopagebreak! macros, this will allow it to be displayed properly:

code <- capture.output( toLatex( tab ) )
code <- sub("^(.*)(\\\\nopagebreak )", "\\2\\1", code)
cat(code, sep = "\n")

To get group labels to span multiple columns, the levelnames argument can be used with embedded \LaTeX\ code. For example,

tabular( Multicolumn(Species, width=3, 
            levelnames=paste("\\textit{Iris", levels(Species),"}")) 
            * (mean + sd)  ~ All(iris), data=iris, suppress=1)

Further Details

Formatting {#sec:formatdetails}

As mentioned in \@ref(sec:formats), formatting in tables depends on the standard format() function or other user-selected functions. Here are the details of how it is done.

The format.tabular() method does the first part of the work. First, it constructs the calls to the appropriate formatting functions, and uses them to format all of the non-character entries in the table. The character entries are left as-is, except as described below. This converts the tabular object to a character array.

The procedure goes as follows:

  1. Entries in the table without specified formatting are formatted first, separately by column using the format() function. This is so that entries in a given column will end up with the same character width and (with the default settings) with the same number of decimal places.
  2. Entries in the table with specified formatting are grouped according to the format specification. For example, if two columns both share the same Format(), they will be formatted in a single call. This results in such entries ending up with the same character width and (with the default settings) with the same number of decimal places.
  3. If the toLatex argument is TRUE, any numeric entries are passed to the latexNumeric() function (see \ref{sec:latexNumeric}), which replaces blanks and minus signs with fixed width spaces and \LaTeX\ minus signs so that all entries will display in the same width. This means that numeric values will normally have decimal points aligned, unless the formatting function explicitly removes leading spaces. Non-numeric entries are passed through the Hmisc::latexTranslate function so that special characters are displayed properly.

  4. If the toLatex argument is FALSE, an attempt is made to justify the results using simple ASCII spacing, according to the Justify() specification with the justification argument used as a default.

Note that \LaTeX\ special characters will not be escaped in data when toLatex() is called, but row and column headings generated by All(), Factor(), etc. will by default not have the escapes done. Those functions have a texify argument that can be set to TRUE to enable this behaviour (e.g. if the label is not meant to be processed by \LaTeX). For example, with the definition

df <- data.frame(A = factor(c( "$", "\\" ) ), B_label=1:2)

the code

tabular( mean ~ A*B_label, data=df ) 

would fail, as the labels would include the special characters. But this will work, provided the Hmisc package is available:

options(tables.texify = TRUE)
tabular( mean ~ Factor(A)*All(df), data=df )

Use of the texify option requires that the suggested package "Hmisc" be available.

As mentioned above, character values in cells in the table are handled specially. If the default format function (or a custom function named format) is used, then those character values are not formatted, they are just copied into the result. (This is so that a column can have mixed numeric and character values, and the numerics are not converted to character before formatting.) If you want to use format on character values, you will need to use a custom formatting function with a different name.

Missing Values

By default, most summary statistics in R return NA if any of the input values are NA, but have ways to treat NA differently. For example, the mean() function has the na.rm argument:

dat <- data.frame( a = c(1, 2, 3, NA), b = 1:4 )
mean(dat$a)
mean(dat$a, na.rm=TRUE)

The tabular() function itself has no way to specify special NA handling, but there are several ways to do this yourself, depending on how you want them handled. To ignore NA values within the column, define a new function which sets the different behaviour. For example,

Mean <- function(x) base::mean(x, na.rm=TRUE)
tabular( Mean ~ a + b, data=dat )

An alternative approach is to use na.omit() to work on a subset of your data which has rows with any missing values removed, e.g.

tabular( mean ~ a + b, data = na.omit(dat) )

A third possibility is to use the complete.cases() function to remove missings only from some columns, e.g.

tabular( 
  Mean ~ (1 + Heading(Complete)*complete.cases(dat)) * (a + b), 
               data=dat )

Missing values in factors are normally ignored, i.e. observations whose value is missing won't match any category. If you would like NA to be used as an additional category, use exclude = NULL in a call to factor() when you create the variable, e.g. compare the following two tables:

A <- factor(dat$a)
tabular( A + 1 ~ (n=1))
A <- factor(dat$a, exclude = NULL)
tabular( A + 1 ~ (n=1) )

Subsetting and Joining Tables

It is possible to select a subset of a table using the usual R matrix indexing on the table object. For example, this table contains rows with no data in them, and those yield ugly NA and NaN statistics:

set.seed(1206)
q <- data.frame(p = rep(c("A","B"),each=10,len=30),
                           a = rep(c(1,2,3),each=10),id=seq(30),
                           b = round(runif(30,10,20)),
                           c = round(runif(30,40,70)),
        stringsAsFactors = FALSE)
tab <- tabular((Factor(p)*Factor(a)+1) 
                ~ (N = 1) + (b + c)*(mean+sd), data = q)
tab

To omit those rows, use matrix-like subsetting to select the rows where the first column of data (i.e. $N$) is greater than zero:

tab[ tab[,1] > 0, ]

Similarly, cbind() can be used to join tables that have identical row labels, and rbind() can be used to join tables with identical column labels. Thus the top part of the table above could be produced in another way:

formula <- Factor(p)*Factor(a) ~ 
       (N = 1) + (b + c)*(mean+sd)
tab <- NULL
for (sub in c("A", "B")) 
    tab <- rbind(tab, tabular( formula, 
                               data = subset(q, p == sub) ) )
tab

It is also possible to edit the row or column labels after constructing the table. For example,

colLabels(tab)
labs <- colLabels(tab)
labs[1, 2] <- "New label"
colLabels(tab) <- labs
tab

Note that <NA> in the column labels means "same as the label to the left", and in the row labels it means "same as the label above". This is used in constructing multi-column or multi-row labels.

knitr, rmarkdown and kableExtra support {#sec:knitr}

This vignette was originally written many years ago using Sweave, and is still available in that format. Nowadays I would recommend most users to use knitr instead: it is easier and more flexible. The input may be in Noweb syntax very similar to Sweave, or Markdown syntax using the rmarkdown package, as in this file.

One specific advantage of using knitr or rmarkdown is that explicit calls to toLatex() are not needed: by default, tabular objects will print in the appropriate formatting for \LaTeX\ or HTML output.

The kableExtra package may be used to customize displays. For example, the code below causes the table to be full width, and the colour of the 4th column is changed. These features require additional \LaTeX\ packages; see the kableExtra documentation for details.

library(magrittr)
library(kableExtra)
toKable(tab) %>% 
  kable_styling(full_width = TRUE) %>%
  column_spec(4, color = "red")

See the HTML vignette (which is written in rmarkdown) for more discussion and examples.

Captions, labels, etc.

LaTeX breaks the description of tables into two parts: the tabular environment holding the data, and the optional table environment surrounding it, where captions, labels, where to place the table in the document, etc. are all specified. The tables package concentrates on the details of the tabular part, because I didn't want to duplicate the myriad options in LaTeX to set up the table wrapper. However, others are not so lazy, and Yihui Xie's knitr package includes the kable() function which does these things. (It is much less flexible about the actual contents, however.) Rather than copying all his code, I have added the latexTable function. It uses kable() to produce a dummy table, then replaces the tabular part with the result of the tabular() function from this package. For example, this code produces Table \ref{tab:sepals}:

latexTable(tabular((Species + 1) ~ (n=1) + Format(digits=2)*
                   (Sepal.Length + Sepal.Width)*(mean + sd), 
                   data=iris),
           caption = "Iris sepal data", label = "sepals")

which should have floated to the top or bottom of page \pageref{tab:sepals}.

Acknowledgments

I gratefully acknowledge helpful suggestions and hints from Rich Heiberger, Frank Harrell, Dieter Menne, Marius Hofert, Jeff Newmiller and Jeffrey Miller. Hao Zhu was extremely helpful in adding the kableExtra support.

References



Try the tables package in your browser

Any scripts or data that you put into this service are public.

tables documentation built on May 3, 2023, 1:15 a.m.