The vtable
package serves the purpose of outputting automatic variable documentation that can be easily viewed while continuing to work with data.
vtable
contains four main functions: vtable()
(or vt()
), sumtable()
(or st()
), labeltable()
, and dftoHTML()
/dftoLaTeX()
. This vignette focuses on sumtable()
.
sumtable()
takes a dataset and outputs a formatted summary statistics table. Summary statistics an be for the whole data set at once or by-group.
There are a huge number of R packages that will make a summary statistics table for you. What does vtable::sumtable()
bring that isn't already there?
First, like other vtable
functions, sumtable()
by default prints its results to Viewer (in RStudio) or the browser (elsewhere), making it easy to look at information about your data while continuing to work on it.
Second, sumtable()
is designed to have nice defaults and be fast to work with. By fast to work with that's both in the sense that you should just be able to ask for a sumtable and have it pretty much be what you want immediately, and also in the sense of trying to keep the number of keystrokes low (thus the st()
shortcut, and the intent of not having to set a bunch of options).
sumtable()
has customization options, but they're certainly not as extensive as with a package like gtsummary
or arsenal
. Nor should they be! If you want full control over your table, those packages are already great, we don't need another package that does that.
However, if you want the kind of table sumtable()
produces (and I think a lot of you do!) then it's perfect and easy. This makes sumtable()
very similar in spirit to the summary statistics functionality of stargazer::stargazer()
, except with some additional important bonuses, like working with tibble
s, factor variables, producing summary statistics by group, and being a summary-statistics-only function so the documentation isn't entwined with a bunch of regression-table functionality.
Like, look at this. Isn't this what you basically already want? After loading the package this took eight keystrokes and no option-setting:
library(vtable) st(iris)
sumtable()
functionsumtable()
(or st()
for short) syntax follows the following outline:
sumtable(data, vars=NA, out=NA, file=NA, summ=NA, summ.names=NA, add.median=FALSE, group=NA, group.long=FALSE, group.test=FALSE, group.weights =NA, col.breaks=NA, digits=2, fixed.digits=FALSE, numformat = formatfunc(digits = digits, big.mark = ''), skip.format = c('notNA(x)','propNA(x)','countNA(x)'), factor.percent=TRUE, factor.counts=TRUE, factor.numeric=FALSE, logical.numeric=FALSE, logical.labels=c('No','Yes'), labels=NA, title='Summary Statistics', note = NA, anchor=NA, col.width=NA, col.align=NA, align=NA, note.align='l', fit.page=NA, simple.kable=FALSE, obs.function=NA) opts=list())
The goal of sumtable()
is to take a data set data
and output a usually-HTML (but data.frame
, kable
, csv
, and latex
options are there too) file with summary statistics for each of the variables in data
. There are several options as to how the table will be constructed, and each of these options are explained below. Throughout, the output will be built as kable
s since this is an RMarkdown document. However, generally you can leave out
at its default and it will publish an HTML table to Viewer (in RStudio) or the browser (otherwise). This will also include some additional information about your data that can't be demonstrated in this vignette:
data
and vars
The data
argument can take any data.frame
, data.table
, tibble
, or matrix
, as long as it has a valid set of variable names stored in the colnames()
attribute.
By default, sumtable
will include in the summary statistics table every variable in the data set that is (1) numeric, (2) factor, (3) logical, or (4) a character variable with six or fewer unique values (as a factor), and it will include them in the order they're in the data.
You can override that variable list with vars
, which is just a vector of variable names to include. You can use this to force sumtable
to ignore variables you don't want, or to include variables it doesn't by default.
data(LifeCycleSavings) st(LifeCycleSavings, vars = c('pop15','pop75'))
out
The out
option determines what will be done with the resulting summary statistics table. There are several options for out
:
| Option | Result |
|------------| -----------------------------------------|
| browser | Loads output in web browser. |
| viewer | Loads output in Viewer pane (RStudio only). |
| htmlreturn | Returns HTML code for output file. |
| return | Returns summary table in data.frame
format. Depending on options, the data frame may be entirely character variables. |
| csv | Returns summary table in data.frame
format and, with a file
option, saves that to CSV. |
| kable | Returns a knitr::kable()
|
| latex | Returns a LaTeX table. |
| latexpage | Returns an independently-buildable LaTeX document. |
By default, sumtable
will select 'viewer' if running in RStudio, and 'browser' otherwise. If it's being built in an RMarkdown document with knitr
, it will default to 'kable'. Note that an RMarkdown default to 'kable' will also include some nice formatting, where out = 'kable'
directly will give you a more basic kable
you can format yourself.
Also be aware that some of these formats, like 'return', do not support multi-column cells, and so instead you'll have headers squished into one cell, with blank cells next to them.
data(LifeCycleSavings) sumtable(LifeCycleSavings) vartable <- vtable(LifeCycleSavings,out='return') #I can easily \input this into my LaTeX doc: vt(LifeCycleSavings,out='latex',file='mytable1.tex')
file
The file
argument will write the variable documentation file to an HTML or LaTeX file and save it. Will automatically append 'html' or 'tex' filetype if the filename does not include a period.
data(LifeCycleSavings) st(LifeCycleSavings,file='lifecycle_summary')
summ
and summ.names
summ
is the set of summary statistics functions to run and include in the table. It is very flexible, hopefully without being difficult to use.
It takes a character vector in which each element is of the form function(x)
, where function(x)
is any function that takes a vector and returns a single numeric value. For example, summ=c('mean(x)','median(x)','mean(log(x))')
would calculate the mean, median, and mean of the log for each variable. summ.names
is just the heading-title of the corresponding summ
. So in this example that might be summ.names=c('Mean','Median','Mean of Log')
.
Factor variables largely ignore summ
(unless factor.numeric = TRUE
) and will just report counts in the first column and means/percentages in the second. You may want to consider this when selecting the order you put your summ
in if you have both factor and numeric variables.
summ
treats as special two vtable
functions: propNA(x)
and countNA(x)
, which give the proportion and count of NA values, and the count of non-NA values in the variable, respectively. These two functions are the only functions that include NA values in their calculations.
By default, summ
is c('notNA(x)', 'mean(x)', 'sd(x)', 'min(x)', 'pctile(x)[25]', 'pctile(x)[75]', 'max(x)')
in 'one-column' tables. If there's more than one column either due to the col.breaks
option or the group
option, it defaults to c('notNA(x)', 'mean(x)', 'sd(x)')
. Alternately, if a given column of variables is entirely made up of factor variables, it defaults to c('notNA(x)','mean(x)')
. These precise defaults have corresponding default summ.names
. If you set your own summ
but not summ.names
, it will try to guess the name by taking your function, removing (x)
, and capitalizing. so 'mean(x)' becomes 'Mean'.
If you want to get complex you can. If there are multiple 'columns' of summary statistics and you want different statistics in each column, make summ
and summ.names
into a list, where each entry is a character vector of the calculations/names you want in that column.
sumtable(iris, summ=c('notNA(x)', 'mean(x)', 'median(x)', 'propNA(x)'))
#Getting complex st(iris, col.breaks = 4, summ = list( c('notNA(x)','mean(x)','sd(x^2)','min(x)','max(x)'), c('notNA(x)','mean(x)') ), summ.names = list( c('N','Mean','SD of X^2','Min','Max'), c('Count','Percent') ))
group
, group.long
, and group.test
sumtable()
allows for the calculation of summary statistics by group.
group
is a character variable containing the column name in data
that you want to calculate summary statistics separately for. At the moment this supports only a single variable, although you can combine multiple variables into a single one yourself before using sumtable
.
group.long
is a flag for whether you want the different group summary statistics stacked side-by-side (group.long = FALSE
), making for easier comparisons, or on top of each other (group.long = TRUE
), giving things a bit more room to breathe and allowing space for more statistics. Defaults to FALSE
.
group.test
, which is only compatible with group.long = FALSE
, performs a test of independence between the variable in group
and each of the variables in your summary statistics table. Defaults to FALSE
. Default group.test = TRUE
behavior is to perform a group F-test (with anova(lm())
) for numeric variables, and a chi-squared test (with chisq.test
) for factor, logical, and character variables, returning results in the format 'Test statistic name = Test statistic^significance stars'. If you want to change any of that, instead of group.test = TRUE
, set group.test
equal to a named list of options that will be send to the opts
argument of independence.test
. See help(independence.test)
.
Be aware that the table produced with group
uses multi-column cells. So it will not look quite as nice when outputting to a format that does not support multi-column cells, like out='return'
. Multi-column cells are supported in out='kable'
for group
, as below, but are not supported on other rows of the kable
. Multi-column cells are supported for out='kable'
only for HTML and LaTeX output.
st(iris, group = 'Species', group.test = TRUE)
st(iris, group = 'Species', group.long = TRUE)
group.weights
This allows you to pass a set of weights for your data (as a vector or as a string column name). HOWEVER, it does not automatically weight all the results. If you leave summ
as its default, then it will use weighted.mean(x, w = wts)
and weighted.sd(x, w = wts)
in place of wherever it would normally have mean(x)
and sd(x)
. Factor proportions are calculated using mean(x)
, so this is covered. Weights will also be passed to independence.test()
if group.test = TRUE
, and so tests of independence across groups will be weighted as well.
No other calculations will be automatically weighted. This is really designed to be used with group
and group.test = TRUE
to create weighted balance tables, which is why it's called group.weights
(and to avoid anyone thinking it weights everything, which would be the natural conclusion if it were just called weights
).
If you want to use the weights with other functions, you can. You'll just need to specify summ
yourself. Just pass summ
a string describing function that takes weights and refer to wts
as the weights, e.g. 'weighted.mean(x, w = wts)'
for a weighted mean, as above.
col.breaks
Sometimes you don't need all that much information on each variable, but you have a lot of variables and your table gets long! col.breaks
will break up the variables in your table into multiple columns, and put them side by side. Also handy if you want to mix numeric and factor variables - put all your factors in a second column to economize on space. Incompatible with group
unless group.long = TRUE
.
Set col.breaks
to be a numeric vector. sumtable()
will start a new column after that many variables have been processed.
#Let's put species in a column by itself #There are five variables here, Species is last, #so break the column after the first four variables. st(iris, col.breaks = 4)
#Why not three columns? sumtable(mtcars, col.breaks = c(4,8))
digits
and fixed.digits
digits
indicates how many digits after the decimal place should be displayed. fixed.digits
determines whether trailing zeros are maintained. fixed.digits
only works if numformat = NA
, and will eventually be deprecated for a formatfunc(drop0trailing=TRUE)
setting in numformat
.
st(iris, digits = 5)
st(iris, digits = 3, fixed.digits = TRUE, numformat = NA)
Should the numbers in the summary table be formatted in some way? By default, "number of nonmissing observations" values are formatted with notNA()
formatting (but specifically, whatever is specified in obs.function
), and the rest are not formatted except for having the number of digits set with digits
or fixed.digits
.
You can specify numformat
to set numerical formatting for numeric variables. numformat
accepts as an argument functions that accept a number and return a formatted string, as generated by formatfunc()
or the label_
functions in the scales package. So, for example, numformat = formatfunc(prefix = '$')
would give all your numbers dollar formatting.
You can also use string shorthand as shortcuts for some formatfunc()
settings. 'comma'
will set big.mark = ','
, 'decimal'
will set big.mark = '.', decimal.mark = ','
, 'percent'
will do percentage formatting (with 1 = 100%), and 'A|B'
will use 'A'
as a prefix and 'B'
as a suffix (specifying suffix optional, so numformat = '$'
gives '$3'
). This will also respect your digits
choice (which formatfunc()
directly won't do). These can be combined. 'comma$|M'
will turn 1000
into $1,000M
. Although if you're getting complex you may as well just set formatfunc()
yourself.
You can specify different formatting functions for different variables by either specifying a string vector of those shorthands, or a list of functions. You can either provide an unnamed vector/list with length equal to the number of variables in the data, or you can provide a named vector/list that specifies the formatting for specific variables. You can apply a format to all variables not specifically named by making it an unnamed first entry, for example numformat = c('dollar','sharevariable' = 'percent')
to make everything dollar-formatted except for 'sharevariable', which becomes percent-formatted.
Note that any functions that match the ones in the skip.format
option will not have this formatting applied to them at all. It is not currently possible otherwise to apply two different kinds of formatting to different columns of the sumtable
.
st(iris, numformat = c('|cm', 'Sepal.Width' = 'percent'))
How should factor, logical, and character variables (all of which get turned into factors in the sumtable
-making process) be shown on the sumtable()
?
In all sumtable()s
, there is one row for the name of the factor variable, with the number of nonmissing observations of the variable. Then there's one row for each of the values (for logicals, FALSE and TRUE become 'No' and 'Yes', or pick your own labels with logical.labels
), showing the count and percentage (i.e. 50%) of observations with that value.
Set factor.percent = FALSE
to report the proportion of observations (.5) instead of the percentage (50%) for the values.
Set factor.counts = FALSE
to omit the count for the individual values. So you'll see the number of nonmissing observations for the variable overall, and then just the percentage/proportion for each of the values.
Set factor.numeric = TRUE
and/or logical.numeric = TRUE
to ignore all this special treatment for factor/logical variables (respectively), and just treat each of the values as numeric binary variables. factor.numeric
also covers character variables.
st(iris, factor.percent = FALSE, factor.counts = FALSE)
st(iris, factor.numeric = TRUE)
labels
The labels
argument will attach variable labels to the variables in data
. If variable labels are embedded in data
and those labels are what you want, then set labels = TRUE
.
If you'd like to set your own labels that aren't embedded in the data, there are three formats available:
labels
as a vectorlabels
can be set to be a vector of equal length to the number of variables in data
(or in vars
if that's set), and in the same order. You can use NA
s for padding if you only want labels for some varibles and just want to use the regular variable names for others.
This option is not recommended if you have set group
, as it gets tricky to figure out what order to put the labels in.
#Note that LifeCycleSavings has five variables data(LifeCycleSavings) #These variable labels are taken from help(LifeCycleSavings) labs <- c('numeric aggregate personal savings', 'numeric % of population under 15', 'numeric % of population over 75', NA, 'numeric % growth rate of dpi') sumtable(LifeCycleSavings,labels=labs)
labels
as a two-column data setlabels
can be set to a two-column data set (any type will do) where the first column has the variable names, and the second column has the labels. The column names don't matter.
This approach does not require that every variable name in data
has a matching label.
#Note that LifeCycleSavings has five variables #with names 'sr', 'pop15', 'pop75', 'dpi', and 'ddpi' labs <- data.frame(nonsensename1 = c('sr', 'pop15', 'pop75'), nonsensename2 = c('numeric aggregate personal savings', 'numeric % of population under 15', 'numeric % of population over 75')) st(LifeCycleSavings,labels=labs)
labels
as a one-row data setlabels
can be set to a one-row data set in which the column names are the variable names in data
and the first row is the variable names. The labels
argument can take any data type including data frame, data table, tibble, or matrix, as long as it has a valid set of variable names stored in the colnames()
attribute.
This approach does not require that every variable name in data
has a matching label.
labs <- data.frame(sr = 'numeric aggregate personal savings', pop15 = 'numeric % of population under 15', pop75 = 'numeric % of population over 75') sumtable(LifeCycleSavings,labels=labs)
title
, note
, and anchor
title
will include a title for your table.
note
will add a table note in the last row.
anchor
will add an anchor ID (<a name =
in HTML or \label{}
in LaTeX) to allow other parts of your document to link to it, if you are including your sumtable
in a larger document.
title
will only show up in output formats with titles. That is, you won't get them with out = 'return'
. note
and anchor
will only show up in formats that support multi-column cells and anchoring, so anchor
won't work with out = 'kable'
and neither will work with out = 'return'
or out = 'csv'
.
col.width
sumtable()
will select default column widths by basically just giving the column with the variable name a little more space than the summ
-based columns. col.width
, as a vector of percentage column widths on the 0-100 scale, will override these defaults.
Doesn't apply to out = 'kable'
, out = 'csv'
, or out = 'return'
.
#The variable names in this data set are pretty short, and the value labels are #a little cramped, so let's move that over. st(LifeCycleSavings, col.width=c(9,rep(13,7)))
col.align
col.align
can be used to adjust text alignment in HTML output. Set to 'left', 'right', or 'center' to align all columns, or give a vector of column alignments to do each column separately.
If you want to get tricky, you can add a semicolon afterwards and keep putting in whatever CSS attributes you want. They will be applied to the whole column.
This option is only for HTML output and will only work with out
values of 'browser', 'viewer', or 'htmlreturn'.
st(LifeCycleSavings,col.align = 'right')
align
, note.align
, and fit.page
These options are used only with LaTeX output (out
is 'latex' or 'latexpage').
align
and note.align
are single strings used for alignment. align
will be used as column alignment in standard LaTeX syntax, for example 'lccc' for the left column left-aligned and the other three centered. note.align
is an alignment note specifically for any table notes set with note
(or significance stars), which enters as part of a \multicolumn
argument. These both accept 'p{}' and other LaTeX column types.
Defaults to left-aligned 'Variable' columns and right-aligned everything else. If col.widths
is specified, align
defaults to 'p{}' columns, with widths set by col.width
.
fit.page
can be used to ensure that the table is a certain width, and will be used as an entry to a \resizebox{}
. Set to \\textwidth
to set the table to text width, or .9\\textwidth
for 90% of the page, and so on, or any recognized width value in LaTeX.
For all of these, be sure to escape special characters, in particular backslashes.
sumtable(iris,align = 'p{.3\\textwidth}ccccccc', fit.page = '\\textwidth', out = 'latex')
opts
You can create a named list where the names are the above options and the values are the settings for those options, and input it into sumtable
using opts=
. This is an easy way to set the same options for many sumtable
s.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.