#' Creating of description statistics
#'
#' A function that returns a description statistic that can be used
#' for creating a publication "table 1" when you want it by groups.
#' The function identifies if the variable is a continuous, binary
#' or a factored variable. The format is inspired by NEJM, Lancet &
#' BMJ.
#'
#' @section Customizing statistics:
#'
#' You can specify what function that you want for statistic by providing a function
#' that takes two arguments \code{x} and \code{by} and returns a p-value. There are
#' a few functions already prepared for this see \code{\link{getPvalAnova}},
#' \code{\link{getPvalChiSq}}
#' \code{\link{getPvalFisher}}
#' \code{\link{getPvalKruskal}}
#' \code{\link{getPvalWilcox}}.
#' The default functions used are \code{getPvalFisher} and \code{getPvalWilcox} (unless the by
#' argument has more than three unique levels where it defaults to \code{getPvalAnova}).
#'
#' If you want the function to select functions depending on the type of input
#' you can provide a list with the names \code{'continuous'}, \code{'proportion'}, \code{'factor'} and
#' the function will choose accordingly. If you fail to define a certain category
#' it will default to the above.
#'
#' You can also use a custom function that returns a string with the attribute \code{'colname'}
#' set that will be appended to the results instead of the p-value column.
#'
#' @param x If a data.frame it will be used as the data source for the variables in the \code{...} parameter.
#' If it is a single variable it will be the core value that want the statistics for.
#' In the print this is equivalent to the output of this function.
#' @param ... The variables that you want you statistic for. In the print all thes parameters
#' are passed on as [htmlTable::htmlTable] arguments.
#' @param by The variable that you want to split into different columns
#' @param digits The number of decimals used
#' @param digits.nonzero The number of decimals used for values that are close to zero
#' @param html If HTML compatible output should be used. If \code{FALSE}
#' it outputs LaTeX formatting
#' @param NEJMstyle Adds - no (\%) at the end to proportions
#' @param numbers_first If the number should be given or if the percentage
#' should be presented first. The second is encapsulated in parentheses ().
#' @param useNA This indicates if missing should be added as a separate
#' row below all other. See \code{\link[base]{table}} for \code{useNA}-options.
#' \emph{Note:} defaults to ifany and not "no" as \code{\link[base]{table}} does.
#' @param useNA.digits The number of digits to use for the
#' missing percentage, defaults to the overall \code{digits}.
#' @param continuous_fn The method to describe continuous variables. The
#' default is \code{\link{describeMean}}.
#' @param prop_fn The method used to describe proportions, see \code{\link{describeProp}}.
#' @param factor_fn The method used to describe factors, see \code{\link{describeFactors}}.
#' @param statistics Add statistics, fisher test for proportions and Wilcoxon
#' for continuous variables. See details below for more customization.
#' @param statistics.two_dec_lim The limit for showing two decimals. E.g.
#' the p-value may be 0.056 and we may want to keep the two decimals in order
#' to emphasize the proximity to the all-mighty 0.05 p-value and set this to
#' \eqn{10^-2}. This allows that a value of 0.0056 is rounded to 0.006 and this
#' makes intuitive sense as the 0.0056 level as this is well below
#' the 0.05 value and thus not as interesting to know the exact proximity to
#' 0.05. \emph{Disclaimer:} The 0.05-limit is really silly and debated, unfortunately
#' it remains a standard and this package tries to adapt to the current standards in order
#' to limit publication associated issues.
#' @param statistics.sig_lim The significance limit for < sign, i.e. p-value 0.0000312
#' should be < 0.0001 with the default setting.
#' @param statistics.suppress_warnings Hide warnings from the statistics function.
#' @param show_all_values Show all values in proportions. For factors with only two values
#' it is most sane to only show one option as the other one will just be a complement
#' to the first, i.e. we want to convey a proportion. For instance sex - if you know
#' gender then automatically you know the distribution of the other sex as it's 100 \% - other \%.
#' To choose which one you want to show then set the \code{default_ref} parameter.
#' @param hrzl_prop This is default FALSE and indicates
#' that the proportions are to be interpreted in a vertical manner.
#' If we want the data to be horizontal, i.e. the total should be shown
#' and then how these differ in the different groups then set this to TRUE.
#' @param add_total_col This adds a total column to the resulting table.
#' You can also specify if you want the total column "first" or "last"
#' in the column order.
#' @param total_col_show_perc This is by default true but if
#' requested the percentages are suppressed as this sometimes may be confusing.
#' @param use_units If the Hmisc package's units() function has been employed
#' it may be interesting to have a column at the far right that indicates the
#' unit measurement. If this column is specified then the total column will
#' appear before the units (if specified as last). You can also set the value to
#' \code{"name"} and the units will be added to the name as a parenthesis,
#' e.g. Age (years).
#' @param units_column_name The name of the units column. Used if use_units = TRUE
#' @param percentage_sign If you want to suppress the percentage sign you
#' can set this variable to FALSE. You can also choose something else that
#' the default \% if you so wish by setting this variable.
#' @param header_count Set to \code{TRUE} if you want to add a header count,
#' e.g. Smoking; No. 25 observations, where there is a new line after the
#' factor name. If you want a different text for the second line you can
#' specifically use the \code{\link[base]{sprintf}} formatting, e.g. "No. \%s patients".
#' @param missing_value Value that is substituted for empty cells. Defaults to "-"
#' @param names_of_missing Optional character vector containing the names of returned statistics,
#' in case all returned values for a given \code{by} level are missing. Defaults to NULL
#' @param default_ref The default reference when dealing with proportions. When using
#' `dplyr` syntax (`tidyselect`) you can specify a named vector/list for each column name.
#' @return Returns \code{matrix} if a single value was provided, otherwise a \code{list}
#' of matrices with the class \code{"Gmisc_getDescriptionStatsBy"}.
#'
#' @example inst/examples/getDescriptionStatsBy_example.R
#'
#' @family descriptive functions
#'
#' @importFrom Hmisc label label<- units capitalize
#' @importFrom stats na.omit
#' @importFrom methods is
#'
#' @export
getDescriptionStatsBy <- function(x,
...,
by,
digits = 1,
digits.nonzero = NA,
html = TRUE,
numbers_first = TRUE,
statistics = FALSE,
statistics.sig_lim = 10^-4,
statistics.two_dec_lim = 10^-2,
statistics.suppress_warnings = TRUE,
useNA = c("ifany", "no", "always"),
useNA.digits = digits,
continuous_fn = describeMean,
prop_fn = describeProp,
factor_fn = describeFactors,
show_all_values = FALSE,
hrzl_prop = FALSE,
add_total_col,
total_col_show_perc = TRUE,
use_units = FALSE,
units_column_name = "Units",
default_ref = NULL,
NEJMstyle = FALSE,
percentage_sign = TRUE,
header_count = NULL,
missing_value = "-",
names_of_missing = NULL) {
UseMethod("getDescriptionStatsBy")
}
#' @exportS3Method
#' @importFrom rlang expr enquo
#' @importFrom methods formalArgs
getDescriptionStatsBy.data.frame <- function(x,
...,
by,
digits = 1,
digits.nonzero = NA,
html = TRUE,
numbers_first = TRUE,
statistics = FALSE,
statistics.sig_lim = 10^-4,
statistics.two_dec_lim = 10^-2,
statistics.suppress_warnings = TRUE,
useNA = c("ifany", "no", "always"),
useNA.digits = digits,
continuous_fn = describeMean,
prop_fn = describeProp,
factor_fn = describeFactors,
show_all_values = FALSE,
hrzl_prop = FALSE,
add_total_col,
total_col_show_perc = TRUE,
use_units = FALSE,
units_column_name = "Units",
default_ref = NULL,
NEJMstyle = FALSE,
percentage_sign = TRUE,
header_count = NULL,
missing_value = "-",
names_of_missing = NULL) {
Call = match.call()
safeLoadPkg("tidyselect")
vars <- tidyselect::eval_select(rlang::expr(c(...)), x)
if (missing(by)) {
if (length(vars) == 2) {
by_var <- vars[2]
by <- x[[by_var]]
vars <- head(vars, -1)
} else {
stop("When providing a data.frame as input you must specify the by variables")
}
} else {
by_var <- tidyselect::eval_select(rlang::enquo(by), x)
by = x[[by_var]]
}
if (label(by) == "") {
label(by) <- names(by_var)
}
base_call <- as.list(Call)
base_call <- base_call[!names(base_call) %in% c("", "x")]
base_call <- base_call[names(base_call) %in% formalArgs(getDescriptionStatsBy)]
base_call$by = by
get_single_result <- function(var, name) {
base_call$x = x[[var]]
column_name <- colnames(x)[var]
# If the user has set the name in the input parameter then it is
# highly likely that they want to override any label() name
if (column_name != name && name != "" && !is.null(name) || label(base_call$x) == "") {
label(base_call$x) <- name
}
if (column_name %in% names(default_ref)) {
base_call$default_ref <- default_ref[[column_name]]
} else if (!is.null(names(default_ref))) {
base_call$default_ref <- NULL
}
do.call(getDescriptionStatsBy, base_call)
}
if (length(vars) == 1) {
return(get_single_result(var = vars, name = names(vars)))
}
multiple_results <- mapply(FUN = get_single_result,
var = vars,
name = names(vars),
SIMPLIFY = FALSE,
USE.NAMES = FALSE)
structure(multiple_results,
class = c("Gmisc_getDescriptionStatsBy", class(multiple_results)),
multiple = TRUE,
raw_data = x)
}
#' @exportS3Method
getDescriptionStatsBy.default <- function(x,
...,
by,
digits = 1,
digits.nonzero = NA,
html = TRUE,
numbers_first = TRUE,
statistics = FALSE,
statistics.sig_lim = 10^-4,
statistics.two_dec_lim = 10^-2,
statistics.suppress_warnings = TRUE,
useNA = c("ifany", "no", "always"),
useNA.digits = digits,
continuous_fn = describeMean,
prop_fn = describeProp,
factor_fn = describeFactors,
show_all_values = FALSE,
hrzl_prop = FALSE,
add_total_col,
total_col_show_perc = TRUE,
use_units = FALSE,
units_column_name = "Units",
default_ref = NULL,
NEJMstyle = FALSE,
percentage_sign = TRUE,
header_count = NULL,
missing_value = "-",
names_of_missing = NULL) {
vars <- list(...)
if (missing(by)) {
by = vars[[1]]
vars[[1]] <- NULL
}
if (length(vars) > 0) {
stop("As of version 3.0 of Gmisc the htmlTable handles unnamed parameters differently. Try to use named arguments instead of positional.")
}
useNA <- match.arg(useNA)
if (!is.na(digits.nonzero)) {
if (!is.numeric(digits.nonzero) ||
floor(digits.nonzero) != digits.nonzero
) {
stop("The digits.nonzero should be an integer, you provided: ", digits.nonzero)
}
if (digits.nonzero < digits) {
stop("The digits.nonzero must be smaller than digits")
}
}
if (!is.function(statistics)) {
if (is.list(statistics) ||
(statistics != FALSE)) {
if (is.list(statistics)) {
types <- c(
"continuous",
"proportion",
"factor"
)
if (any(!names(statistics) %in% types)) {
stop(
"If you want to provide custom functions for generating statistics",
" you must either provide a function or a list with the elements:",
" '", paste(types, collapse = "', '"), "'"
)
}
if (is.numeric(x) &&
length(unique(x)) != 2) {
statistics <- statistics[["continuous"]]
} else if (length(unique(x)) == 2) {
if ("proportion" %in% names(statistics)) {
statistics <- statistics[["proportion"]]
} else {
statistics <- statistics[["factor"]]
}
} else {
statistics <- statistics[["factor"]]
}
if (is.character(statistics)) {
statistics <- get(statistics)
}
}
if (!is.function(statistics)) {
if (length(unique(x)) == 2) {
statistics <- getPvalFisher
} else if (is.numeric(x)) {
if (length(unique(by)) == 2) {
statistics <- getPvalWilcox
} else {
statistics <- getPvalAnova
}
} else {
statistics <- getPvalFisher
}
}
}
}
# Always have a total column if the description statistics
# are presented in a horizontal fashion
if (missing(add_total_col) &&
hrzl_prop) {
add_total_col <- TRUE
}
if (is.null(x)) {
stop(
"You haven't provided an x-value to do the statistics by.",
" This error is most frequently caused by referencing an old",
" variable name that doesn't exist anymore"
)
}
if (is.null(by)) {
stop(
"You haven't provided a by-value to do the statistics by.",
" This error is most frequently caused by referencing an old",
" variable name that doesn't exist anymore"
)
}
# If there is a label for the variable
# that one should be used otherwise go
# with the name of the variable
if (label(x) == "") {
name <- paste0(deparse(substitute(x)), collapse = "")
} else {
name <- label(x)
}
if (is.logical(x)) {
x <- factor(x, levels = c(TRUE, FALSE))
} else if (is.character(x) && any(table(x, by) == 0)) {
x <- factor(x)
}
# Check missing -
# Send a warning, since the user might be unaware of this
# potentially disturbing fact. The dataset should perhaps by
# subsetted by is.na(by) == FALSE
if (any(is.na(by))) {
warning(
sprintf("Your 'by' variable has %d missing values", sum(is.na(by))),
"\n The corresponding 'x' and 'by' variables are automatically removed"
)
x <- x[!is.na(by)]
if (inherits(x, "factor")) {
x <- factor(x)
}
by <- by[!is.na(by)]
if (inherits(by, "factor")) {
by <- factor(by)
}
}
if (useNA == "ifany" &&
any(is.na(x))) {
useNA <- "always"
}
# If all values are to be shown then simply use
# the factors function, provided describeProp was specified
if (show_all_values & deparse(substitute(prop_fn)) == "describeProp") {
prop_fn <- describeFactors
}
if (is.numeric(x) || is(x, "difftime")) {
t <- prNumericDescs(x = x,
by = by,
hrzl_prop = hrzl_prop,
continuous_fn = continuous_fn,
html = html,
digits = digits,
digits.nonzero = digits.nonzero,
numbers_first = numbers_first,
useNA = useNA,
useNA.digits = useNA.digits,
percentage_sign = percentage_sign,
missing_value = missing_value,
names_of_missing = names_of_missing)
} else if ((!is.factor(x) &&
length(unique(na.omit(x))) == 2) ||
(is.factor(x) &&
length(levels(x)) == 2) &&
hrzl_prop == FALSE) {
t <- prPropDescs(x = x,
by = by,
name = name,
default_ref = default_ref,
prop_fn = prop_fn,
html = html,
digits = digits,
digits.nonzero = digits.nonzero,
numbers_first = numbers_first,
useNA = useNA,
useNA.digits = useNA.digits,
percentage_sign = percentage_sign,
missing_value = missing_value,
names_of_missing = names_of_missing,
NEJMstyle = NEJMstyle)
} else {
# Make sure that the total isn't using proportions (happens with hrzl_prop)
prop_fn <- factor_fn
t <- prFactorDescs(x = x,
by = by,
factor_fn = factor_fn,
hrzl_prop = hrzl_prop,
html = html,
digits = digits,
digits.nonzero = digits.nonzero,
numbers_first = numbers_first,
useNA = useNA,
useNA.digits = useNA.digits,
percentage_sign = percentage_sign,
missing_value = missing_value,
names_of_missing = names_of_missing)
}
# Convert into a matrix
results <- prAddEmptyVals(t, missing_value = missing_value) %>%
unlist() %>%
matrix(ncol = length(t)) %>%
prFixDescRownames(t = t, name = name)
attr(results, "column_names") <- prGetDescHeader(by = by, header_count = header_count, html = html)
attr(results, "label") <- name
if (!missing(add_total_col) && add_total_col != FALSE) {
results %<>% prAddTotalDescColumn(x = x,
by = by,
numbers_first = numbers_first,
total_col_show_perc = total_col_show_perc,
show_all_values = show_all_values,
useNA = useNA,
useNA.digits = useNA.digits,
html = html,
digits = digits,
continuous_fn = continuous_fn,
factor_fn = factor_fn,
prop_fn = prop_fn,
percentage_sign = percentage_sign,
header_count = header_count,
add_total_col = add_total_col,
default_ref = default_ref)
}
if (!isFALSE(use_units)) {
results %<>% prAddDescUnitColumn(x = x,
use_units = use_units,
units_column_name = units_column_name)
}
if (is.function(statistics)) {
results %<>% prAddDescStats(x = x,
by = by,
statistics = statistics,
statistics.suppress_warnings = statistics.suppress_warnings,
statistics.sig_lim = statistics.sig_lim,
statistics.two_dec_lim = statistics.two_dec_lim,
html = html)
}
colnames(results) <- attr(results, "column_names")
# Even if one row has the same name this doesn't matter
# at this stage as it is information that may or may
# not be used later on
label(results) <- attr(results, "label")
structure(results,
class = c("Gmisc_getDescriptionStatsBy", class(results)),
multiple = FALSE,
raw_data = list(x = x, by = by))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.