sumtable: Summary Statistics"
In vtable: Variable Table for Variable Documentation

The vtable package serves the purpose of outputting automatic variable documentation that can be easily viewed while continuing to work with data.

vtable contains four main functions: vtable() (or vt()), sumtable() (or st()), labeltable(), and dftoHTML()/dftoLaTeX(). This vignette focuses on sumtable().

sumtable() takes a dataset and outputs a formatted summary statistics table. Summary statistics an be for the whole data set at once or by-group.

There are a huge number of R packages that will make a summary statistics table for you. What does vtable::sumtable() bring that isn't already there?

First, like other vtable functions, sumtable() by default prints its results to Viewer (in RStudio) or the browser (elsewhere), making it easy to look at information about your data while continuing to work on it.

Second, sumtable() is designed to have nice defaults and be fast to work with. By fast to work with that's both in the sense that you should just be able to ask for a sumtable and have it pretty much be what you want immediately, and also in the sense of trying to keep the number of keystrokes low (thus the st() shortcut, and the intent of not having to set a bunch of options).

sumtable() has customization options, but they're certainly not as extensive as with a package like gtsummary or arsenal. Nor should they be! If you want full control over your table, those packages are already great, we don't need another package that does that.

However, if you want the kind of table sumtable() produces (and I think a lot of you do!) then it's perfect and easy. This makes sumtable() very similar in spirit to the summary statistics functionality of stargazer::stargazer(), except with some additional important bonuses, like working with tibbles, factor variables, producing summary statistics by group, and being a summary-statistics-only function so the documentation isn't entwined with a bunch of regression-table functionality.

Like, look at this. Isn't this what you basically already want? After loading the package this took eight keystrokes and no option-setting:

library(vtable)
st(iris)

The `sumtable()` function

sumtable() (or st() for short) syntax follows the following outline:

sumtable(data,
         vars=NA,
         out=NA,
         file=NA,
         summ=NA,
         summ.names=NA,
         add.median=FALSE,
         group=NA,
         group.long=FALSE,
         group.test=FALSE,
         group.weights =NA,
         col.breaks=NA,
         digits=2,
         fixed.digits=FALSE,
         numformat = formatfunc(digits = digits, big.mark = ''),
         skip.format = c('notNA(x)','propNA(x)','countNA(x)'),
         factor.percent=TRUE,
         factor.counts=TRUE,
         factor.numeric=FALSE,
         logical.numeric=FALSE,
         logical.labels=c('No','Yes'),
         labels=NA,
         title='Summary Statistics',
         note = NA, 
         anchor=NA,
         col.width=NA,
         col.align=NA,
         align=NA,
         note.align='l',
         fit.page=NA,
         simple.kable=FALSE,
         obs.function=NA)
         opts=list())

The goal of sumtable() is to take a data set data and output a usually-HTML (but data.frame, kable, csv, and latex options are there too) file with summary statistics for each of the variables in data. There are several options as to how the table will be constructed, and each of these options are explained below. Throughout, the output will be built as kables since this is an RMarkdown document. However, generally you can leave out at its default and it will publish an HTML table to Viewer (in RStudio) or the browser (otherwise). This will also include some additional information about your data that can't be demonstrated in this vignette:

`data` and `vars`

The data argument can take any data.frame, data.table, tibble, or matrix, as long as it has a valid set of variable names stored in the colnames() attribute.

By default, sumtable will include in the summary statistics table every variable in the data set that is (1) numeric, (2) factor, (3) logical, or (4) a character variable with six or fewer unique values (as a factor), and it will include them in the order they're in the data.

You can override that variable list with vars, which is just a vector of variable names to include. You can use this to force sumtable to ignore variables you don't want, or to include variables it doesn't by default.

data(LifeCycleSavings)
st(LifeCycleSavings, vars = c('pop15','pop75'))

`out`

The out option determines what will be done with the resulting summary statistics table. There are several options for out:

| Option | Result | |------------| -----------------------------------------| | browser | Loads output in web browser. | | viewer | Loads output in Viewer pane (RStudio only). | | htmlreturn | Returns HTML code for output file. | | return | Returns summary table in data.frame format. Depending on options, the data frame may be entirely character variables. | | csv | Returns summary table in data.frame format and, with a file option, saves that to CSV. | | kable | Returns a knitr::kable() | | latex | Returns a LaTeX table. | | latexpage | Returns an independently-buildable LaTeX document. |

By default, sumtable will select 'viewer' if running in RStudio, and 'browser' otherwise. If it's being built in an RMarkdown document with knitr, it will default to 'kable'. Note that an RMarkdown default to 'kable' will also include some nice formatting, where out = 'kable' directly will give you a more basic kable you can format yourself.

Also be aware that some of these formats, like 'return', do not support multi-column cells, and so instead you'll have headers squished into one cell, with blank cells next to them.

data(LifeCycleSavings)
sumtable(LifeCycleSavings)
vartable <- vtable(LifeCycleSavings,out='return')

#I can easily \input this into my LaTeX doc:
vt(LifeCycleSavings,out='latex',file='mytable1.tex')

`file`

The file argument will write the variable documentation file to an HTML or LaTeX file and save it. Will automatically append 'html' or 'tex' filetype if the filename does not include a period.

data(LifeCycleSavings)
st(LifeCycleSavings,file='lifecycle_summary')

`summ` and `summ.names`

summ is the set of summary statistics functions to run and include in the table. It is very flexible, hopefully without being difficult to use.

It takes a character vector in which each element is of the form function(x), where function(x) is any function that takes a vector and returns a single numeric value. For example, summ=c('mean(x)','median(x)','mean(log(x))') would calculate the mean, median, and mean of the log for each variable. summ.names is just the heading-title of the corresponding summ. So in this example that might be summ.names=c('Mean','Median','Mean of Log').

Factor variables largely ignore summ (unless factor.numeric = TRUE) and will just report counts in the first column and means/percentages in the second. You may want to consider this when selecting the order you put your summ in if you have both factor and numeric variables.

summ treats as special two vtable functions: propNA(x) and countNA(x), which give the proportion and count of NA values, and the count of non-NA values in the variable, respectively. These two functions are the only functions that include NA values in their calculations.

By default, summ is c('notNA(x)', 'mean(x)', 'sd(x)', 'min(x)', 'pctile(x)[25]', 'pctile(x)[75]', 'max(x)') in 'one-column' tables. If there's more than one column either due to the col.breaks option or the group option, it defaults to c('notNA(x)', 'mean(x)', 'sd(x)'). Alternately, if a given column of variables is entirely made up of factor variables, it defaults to c('notNA(x)','mean(x)'). These precise defaults have corresponding default summ.names. If you set your own summ but not summ.names, it will try to guess the name by taking your function, removing (x), and capitalizing. so 'mean(x)' becomes 'Mean'.

If you want to get complex you can. If there are multiple 'columns' of summary statistics and you want different statistics in each column, make summ and summ.names into a list, where each entry is a character vector of the calculations/names you want in that column.

sumtable(iris,
         summ=c('notNA(x)',
                'mean(x)',
                'median(x)',
                'propNA(x)'))

#Getting complex
st(iris, col.breaks = 4,
   summ = list(
     c('notNA(x)','mean(x)','sd(x^2)','min(x)','max(x)'),
     c('notNA(x)','mean(x)')
   ),
   summ.names = list(
     c('N','Mean','SD of X^2','Min','Max'),
     c('Count','Percent')
   ))

`group`, `group.long`, and `group.test`

sumtable() allows for the calculation of summary statistics by group.

group is a character variable containing the column name in data that you want to calculate summary statistics separately for. At the moment this supports only a single variable, although you can combine multiple variables into a single one yourself before using sumtable.

group.long is a flag for whether you want the different group summary statistics stacked side-by-side (group.long = FALSE), making for easier comparisons, or on top of each other (group.long = TRUE), giving things a bit more room to breathe and allowing space for more statistics. Defaults to FALSE.

group.test, which is only compatible with group.long = FALSE, performs a test of independence between the variable in group and each of the variables in your summary statistics table. Defaults to FALSE. Default group.test = TRUE behavior is to perform a group F-test (with anova(lm())) for numeric variables, and a chi-squared test (with chisq.test) for factor, logical, and character variables, returning results in the format 'Test statistic name = Test statistic^significance stars'. If you want to change any of that, instead of group.test = TRUE, set group.test equal to a named list of options that will be send to the opts argument of independence.test. See help(independence.test).

Be aware that the table produced with group uses multi-column cells. So it will not look quite as nice when outputting to a format that does not support multi-column cells, like out='return'. Multi-column cells are supported in out='kable' for group, as below, but are not supported on other rows of the kable. Multi-column cells are supported for out='kable' only for HTML and LaTeX output.

st(iris, group = 'Species', group.test = TRUE)

st(iris, group = 'Species', group.long = TRUE)

`group.weights`

This allows you to pass a set of weights for your data (as a vector or as a string column name). HOWEVER, it does not automatically weight all the results. If you leave summ as its default, then it will use weighted.mean(x, w = wts) and weighted.sd(x, w = wts) in place of wherever it would normally have mean(x) and sd(x). Factor proportions are calculated using mean(x), so this is covered. Weights will also be passed to independence.test() if group.test = TRUE, and so tests of independence across groups will be weighted as well.

No other calculations will be automatically weighted. This is really designed to be used with group and group.test = TRUE to create weighted balance tables, which is why it's called group.weights (and to avoid anyone thinking it weights everything, which would be the natural conclusion if it were just called weights).

If you want to use the weights with other functions, you can. You'll just need to specify summ yourself. Just pass summ a string describing function that takes weights and refer to wts as the weights, e.g. 'weighted.mean(x, w = wts)' for a weighted mean, as above.

`col.breaks`

Sometimes you don't need all that much information on each variable, but you have a lot of variables and your table gets long! col.breaks will break up the variables in your table into multiple columns, and put them side by side. Also handy if you want to mix numeric and factor variables - put all your factors in a second column to economize on space. Incompatible with group unless group.long = TRUE.

Set col.breaks to be a numeric vector. sumtable() will start a new column after that many variables have been processed.

#Let's put species in a column by itself
#There are five variables here, Species is last,
#so break the column after the first four variables.
st(iris, col.breaks = 4)

#Why not three columns?
sumtable(mtcars, col.breaks = c(4,8))

`digits` and `fixed.digits`

digits indicates how many digits after the decimal place should be displayed. fixed.digits determines whether trailing zeros are maintained. fixed.digits only works if numformat = NA, and will eventually be deprecated for a formatfunc(drop0trailing=TRUE) setting in numformat.

st(iris, digits = 5)

st(iris, digits = 3, fixed.digits = TRUE, numformat = NA)

Other Numerical Formatting Options

Should the numbers in the summary table be formatted in some way? By default, "number of nonmissing observations" values are formatted with notNA() formatting (but specifically, whatever is specified in obs.function), and the rest are not formatted except for having the number of digits set with digits or fixed.digits.

You can specify numformat to set numerical formatting for numeric variables. numformat accepts as an argument functions that accept a number and return a formatted string, as generated by formatfunc() or the label_ functions in the scales package. So, for example, numformat = formatfunc(prefix = '$') would give all your numbers dollar formatting.

You can also use string shorthand as shortcuts for some formatfunc() settings. 'comma' will set big.mark = ',', 'decimal' will set big.mark = '.', decimal.mark = ',', 'percent' will do percentage formatting (with 1 = 100%), and 'A|B' will use 'A' as a prefix and 'B' as a suffix (specifying suffix optional, so numformat = '$' gives '$3'). This will also respect your digits choice (which formatfunc() directly won't do). These can be combined. 'comma$|M' will turn 1000 into $1,000M. Although if you're getting complex you may as well just set formatfunc() yourself.

You can specify different formatting functions for different variables by either specifying a string vector of those shorthands, or a list of functions. You can either provide an unnamed vector/list with length equal to the number of variables in the data, or you can provide a named vector/list that specifies the formatting for specific variables. You can apply a format to all variables not specifically named by making it an unnamed first entry, for example numformat = c('dollar','sharevariable' = 'percent') to make everything dollar-formatted except for 'sharevariable', which becomes percent-formatted.

Note that any functions that match the ones in the skip.format option will not have this formatting applied to them at all. It is not currently possible otherwise to apply two different kinds of formatting to different columns of the sumtable.

st(iris, numformat = c('|cm', 'Sepal.Width' = 'percent'))

Factor, Logical, and Character Display Options

How should factor, logical, and character variables (all of which get turned into factors in the sumtable-making process) be shown on the sumtable()?

In all sumtable()s, there is one row for the name of the factor variable, with the number of nonmissing observations of the variable. Then there's one row for each of the values (for logicals, FALSE and TRUE become 'No' and 'Yes', or pick your own labels with logical.labels), showing the count and percentage (i.e. 50%) of observations with that value.

Set factor.percent = FALSE to report the proportion of observations (.5) instead of the percentage (50%) for the values.

Set factor.counts = FALSE to omit the count for the individual values. So you'll see the number of nonmissing observations for the variable overall, and then just the percentage/proportion for each of the values.

Set factor.numeric = TRUE and/or logical.numeric = TRUE to ignore all this special treatment for factor/logical variables (respectively), and just treat each of the values as numeric binary variables. factor.numeric also covers character variables.

st(iris, factor.percent = FALSE, factor.counts = FALSE)

st(iris, factor.numeric = TRUE)

`labels`

The labels argument will attach variable labels to the variables in data. If variable labels are embedded in data and those labels are what you want, then set labels = TRUE.

If you'd like to set your own labels that aren't embedded in the data, there are three formats available:

`labels` as a vector

labels can be set to be a vector of equal length to the number of variables in data (or in vars if that's set), and in the same order. You can use NAs for padding if you only want labels for some varibles and just want to use the regular variable names for others.

This option is not recommended if you have set group, as it gets tricky to figure out what order to put the labels in.

#Note that LifeCycleSavings has five variables
data(LifeCycleSavings)
#These variable labels are taken from help(LifeCycleSavings)
labs <- c('numeric aggregate personal savings',
    'numeric % of population under 15',
    'numeric % of population over 75',
    NA,
    'numeric % growth rate of dpi')
sumtable(LifeCycleSavings,labels=labs)

`labels` as a two-column data set

labels can be set to a two-column data set (any type will do) where the first column has the variable names, and the second column has the labels. The column names don't matter.

This approach does not require that every variable name in data has a matching label.

#Note that LifeCycleSavings has five variables
#with names 'sr', 'pop15', 'pop75', 'dpi', and 'ddpi'
labs <- data.frame(nonsensename1 = c('sr', 'pop15', 'pop75'),
nonsensename2 = c('numeric aggregate personal savings',
    'numeric % of population under 15',
    'numeric % of population over 75'))
st(LifeCycleSavings,labels=labs)

`labels` as a one-row data set

labels can be set to a one-row data set in which the column names are the variable names in data and the first row is the variable names. The labels argument can take any data type including data frame, data table, tibble, or matrix, as long as it has a valid set of variable names stored in the colnames() attribute.

This approach does not require that every variable name in data has a matching label.

labs <- data.frame(sr = 'numeric aggregate personal savings',
    pop15 = 'numeric % of population under 15',
    pop75 = 'numeric % of population over 75')
sumtable(LifeCycleSavings,labels=labs)

`title`, `note`, and `anchor`

title will include a title for your table.

note will add a table note in the last row.

anchor will add an anchor ID (<a name = in HTML or \label{} in LaTeX) to allow other parts of your document to link to it, if you are including your sumtable in a larger document.

title will only show up in output formats with titles. That is, you won't get them with out = 'return'. note and anchor will only show up in formats that support multi-column cells and anchoring, so anchor won't work with out = 'kable' and neither will work with out = 'return' or out = 'csv'.

`col.width`

sumtable() will select default column widths by basically just giving the column with the variable name a little more space than the summ-based columns. col.width, as a vector of percentage column widths on the 0-100 scale, will override these defaults.

Doesn't apply to out = 'kable', out = 'csv', or out = 'return'.

#The variable names in this data set are pretty short, and the value labels are
#a little cramped, so let's move that over.
st(LifeCycleSavings,
   col.width=c(9,rep(13,7)))

`col.align`

col.align can be used to adjust text alignment in HTML output. Set to 'left', 'right', or 'center' to align all columns, or give a vector of column alignments to do each column separately.

If you want to get tricky, you can add a semicolon afterwards and keep putting in whatever CSS attributes you want. They will be applied to the whole column.

This option is only for HTML output and will only work with out values of 'browser', 'viewer', or 'htmlreturn'.

st(LifeCycleSavings,col.align = 'right')

`align`, `note.align`, and `fit.page`

These options are used only with LaTeX output (out is 'latex' or 'latexpage').

align and note.align are single strings used for alignment. align will be used as column alignment in standard LaTeX syntax, for example 'lccc' for the left column left-aligned and the other three centered. note.align is an alignment note specifically for any table notes set with note (or significance stars), which enters as part of a \multicolumn argument. These both accept 'p{}' and other LaTeX column types.

Defaults to left-aligned 'Variable' columns and right-aligned everything else. If col.widths is specified, align defaults to 'p{}' columns, with widths set by col.width.

fit.page can be used to ensure that the table is a certain width, and will be used as an entry to a \resizebox{}. Set to \\textwidth to set the table to text width, or .9\\textwidth for 90% of the page, and so on, or any recognized width value in LaTeX.

For all of these, be sure to escape special characters, in particular backslashes.

sumtable(iris,align = 'p{.3\\textwidth}ccccccc', fit.page = '\\textwidth', out = 'latex')

`opts`

You can create a named list where the names are the above options and the values are the settings for those options, and input it into sumtable using opts=. This is an easy way to set the same options for many sumtables.

Any scripts or data that you put into this service are public.

vtable documentation built on Oct. 26, 2023, 5:08 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vtable
Variable Table for Variable Documentation

sumtable: Summary Statistics"
In vtable: Variable Table for Variable Documentation

The `sumtable()` function

`data` and `vars`

`out`

`file`

`summ` and `summ.names`

`group`, `group.long`, and `group.test`

`group.weights`

`col.breaks`

`digits` and `fixed.digits`

Other Numerical Formatting Options

Factor, Logical, and Character Display Options

`labels`

`labels` as a vector

`labels` as a two-column data set

`labels` as a one-row data set

`title`, `note`, and `anchor`

`col.width`

`col.align`

`align`, `note.align`, and `fit.page`

`opts`

Try the vtable package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

vtable Variable Table for Variable Documentation

sumtable: Summary Statistics" In vtable: Variable Table for Variable Documentation

The sumtable() function

data and vars

out

file

summ and summ.names

group, group.long, and group.test

group.weights

col.breaks

digits and fixed.digits

Other Numerical Formatting Options

Factor, Logical, and Character Display Options

labels

labels as a vector

labels as a two-column data set

labels as a one-row data set

title, note, and anchor

col.width

col.align

align, note.align, and fit.page

opts

Try the vtable package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

vtable
Variable Table for Variable Documentation

sumtable: Summary Statistics"
In vtable: Variable Table for Variable Documentation

The `sumtable()` function

`data` and `vars`

`out`

`file`

`summ` and `summ.names`

`group`, `group.long`, and `group.test`

`group.weights`

`col.breaks`

`digits` and `fixed.digits`

`labels`

`labels` as a vector

`labels` as a two-column data set

`labels` as a one-row data set

`title`, `note`, and `anchor`

`col.width`

`col.align`

`align`, `note.align`, and `fit.page`

`opts`