Creating of description statistics
Description
A function that returns a description statistic that can be used for creating a publication "table 1" when you want it by groups. The function identifies if the variable is a continuous, binary or a factored variable. The format is inspired by NEJM, Lancet & BMJ.
Usage
1 2 3 4 5 6 7 8  getDescriptionStatsBy(x, by, digits = 1, html = TRUE,
numbers_first = TRUE, statistics = FALSE, statistics.sig_lim = 10^4,
statistics.two_dec_lim = 10^2, statistics.suppress_warnings = TRUE,
useNA = c("ifany", "no", "always"), useNA.digits = digits,
continuous_fn = describeMean, prop_fn = describeProp,
factor_fn = describeFactors, show_all_values = FALSE, hrzl_prop = FALSE,
add_total_col, total_col_show_perc = TRUE, use_units = FALSE, default_ref,
NEJMstyle = FALSE, percentage_sign = TRUE, header_count, ...)

Arguments
x 
The variable that you want the statistics for 
by 
The variable that you want to split into different columns 
digits 
The number of decimals used 
html 
If HTML compatible output should be used. If 
numbers_first 
If the number should be given or if the percentage should be presented first. The second is encapsulated in parentheses (). 
statistics 
Add statistics, fisher test for proportions and Wilcoxon for continuous variables. See details below for more customization. 
statistics.sig_lim 
The significance limit for < sign, i.e. pvalue 0.0000312 should be < 0.0001 with the default setting. 
statistics.two_dec_lim 
The limit for showing two decimals. E.g. the pvalue may be 0.056 and we may want to keep the two decimals in order to emphasize the proximity to the allmighty 0.05 pvalue and set this to 10^2. This allows that a value of 0.0056 is rounded to 0.006 and this makes intuitive sense as the 0.0056 level as this is well below the 0.05 value and thus not as interesting to know the exact proximity to 0.05. Disclaimer: The 0.05limit is really silly and debated, unfortunately it remains a standard and this package tries to adapt to the current standards in order to limit publication associated issues. 
statistics.suppress_warnings 
Hide warnings from the statistics function. 
useNA 
This indicates if missing should be added as a separate
row below all other. See 
useNA.digits 
The number of digits to use for the
missing percentage, defaults to the overall 
continuous_fn 
The method to describe continuous variables. The
default is 
prop_fn 
The method used to describe proportions, see 
factor_fn 
The method used to describe factors, see 
show_all_values 
This is by default false as for instance if there is
no missing and there is only one variable then it is most sane to only show
one option as the other one will just be a complement to the first. For instance
sex  if you know gender then automatically you know the distribution of the
other sex as it's 100 %  other %. To choose which one you want to show then
set the 
hrzl_prop 
This is default FALSE and indicates that the proportions are to be interpreted in a vertical manner. If we want the data to be horizontal, i.e. the total should be shown and then how these differ in the different groups then set this to TRUE. 
add_total_col 
This adds a total column to the resulting table. You can also specify if you want the total column "first" or "last" in the column order. 
total_col_show_perc 
This is by default true but if requested the percentages are suppressed as this sometimes may be confusing. 
use_units 
If the Hmisc package's units() function has been employed
it may be interesting to have a column at the far right that indicates the
unit measurement. If this column is specified then the total column will
appear before the units (if specified as last). You can also set the value to

default_ref 
The default reference, either first, the level name or a number within the levels. If left out it defaults to the first value. 
NEJMstyle 
Adds  no (%) at the end to proportions 
percentage_sign 
If you want to suppress the percentage sign you can set this variable to FALSE. You can also choose something else that the default % if you so wish by setting this variable. 
header_count 
Set to 
... 
Currently only used for generating warnings of deprecated call parameters. 
Value
Returns a vector if vars wasn't specified and it's a continuous or binary statistic. If vars was a matrix then it appends the result to the end of that matrix. If the x variable is a factor then it does not append and you get a warning.
Customizing statistics
You can specify what function that you want for statistic by providing a function
that takes two arguments x
and by
and returns a pvalue. There are
a few functions already prepared for this see getPvalAnova
,
getPvalChiSq
getPvalFisher
getPvalKruskal
getPvalWilcox
.
The default functions used are getPvalFisher
and getPvalWilcox
(unless the by
argument has more than three unique levels where it defaults to getPvalAnova
).
If you want the function to select functions depending on the type of input
you can provide a list with the names 'continuous'
, 'proportion'
, 'factor'
and
the function will choose accordingly. If you fail to define a certain category
it will default to the above.
You can also use a custom function that returns a string with the attribute 'colname' set that will be appended to the results instead of the pvalue column. to the results instead of the pvalue column.
See Also
Other descriptive functions: describeFactors
,
describeMean
, describeMedian
,
describeProp
, getPvalWilcox
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81  data(mtcars)
# For labelling we use the label()
# function from the Hmisc package
library(Hmisc)
label(mtcars$mpg) < "Gas"
units(mtcars$mpg) < "Miles/(US) gallon"
label(mtcars$wt) < "Weight"
units(mtcars$wt) < "10<sup>3</sup> kg" # not sure the unit is correct
mtcars$am < factor(mtcars$am, levels=0:1, labels=c("Automatic", "Manual"))
label(mtcars$am) < "Transmission"
mtcars$gear < factor(mtcars$gear)
label(mtcars$gear) < "Gears"
# Make up some data for making it slightly more interesting
mtcars$col < factor(sample(c("red", "black", "silver"),
size=NROW(mtcars), replace=TRUE))
label(mtcars$col) < "Car color"
mergeDesc(getDescriptionStatsBy(mtcars$mpg, mtcars$am,
header_count = TRUE,
use_units = TRUE),
getDescriptionStatsBy(mtcars$wt, mtcars$am,
header_count = TRUE,
use_units = TRUE),
htmlTable_args = list(caption = "Basic continuous stats from the mtcars dataset"))
tll < list()
tll[["Gear (3 to 5)"]] < getDescriptionStatsBy(mtcars$gear, mtcars$am)
tll < c(tll,
list(getDescriptionStatsBy(mtcars$col, mtcars$am)))
mergeDesc(tll,
htmlTable_args = list(caption = "Factored variables"))
tl_no_units < list()
tl_no_units[["Gas (mile/gallons)"]] <
getDescriptionStatsBy(mtcars$mpg, mtcars$am,
header_count = TRUE)
tl_no_units[["Weight (10<sup>3</sup> kg)"]] <
getDescriptionStatsBy(mtcars$wt, mtcars$am,
header_count = TRUE)
mergeDesc(tl_no_units, tll,
# Remove the formatting for the groups
htmlTable_args = list(css.rgroup = ""))
# A little more advanced
mtcars$mpg[sample(1:NROW(mtcars), size=5)] < NA
getDescriptionStatsBy(mtcars$mpg, mtcars$am, statistics=TRUE)
# Do the horizontal version
getDescriptionStatsBy(mtcars$col, mtcars$am,
statistics=TRUE, hrzl_prop = TRUE)
mtcars$wt_with_missing < mtcars$wt
mtcars$wt_with_missing[sample(1:NROW(mtcars), size=8)] < NA
getDescriptionStatsBy(mtcars$wt_with_missing, mtcars$am, statistics=TRUE,
hrzl_prop = TRUE, total_col_show_perc = FALSE)
mtcars$col_with_missing < mtcars$col
mtcars$col_with_missing[sample(1:NROW(mtcars), size=5)] < NA
getDescriptionStatsBy(mtcars$col_with_missing, mtcars$am, statistics=TRUE,
hrzl_prop = TRUE, total_col_show_perc = FALSE)
## Not run:
## There is also a LaTeX wrapper
tll < list(
getDescriptionStatsBy(mtcars$gear, mtcars$am),
getDescriptionStatsBy(mtcars$col, mtcars$am))
latex(mergeDesc(tll),
caption = "Factored variables",
file="")
## End(Not run)
