describe | R Documentation |
Obtain a useful array of common summary statistics for a
vector/variable with customized output depending on the class of variable.
Uses a combination of tidyverse packages and data.table to provide a
user-friendly interface that is pipe-friendly while leveraging the
excellent performance of data.table. The use of the ... argument also makes
it incredibly easy to obtain summaries split by grouping variables. While
other similar functions exist in other packages (e.g.
describeBy
or skim
), this version
provides the some of the useful added outputs of the psych package (e.g.
se, skew, and kurtosis for numeric variables) while at the same time
offering slightly more concise syntax than skim (e.g. no preceding group_by
operation is needed for group-wise calculations) while still achieving
comparable processing times to the alternatives. To obtain summaries for
all variables in a data frame use describe_all
instead.
describe( data, y = NULL, ..., digits = 3, type = 2, na.rm = TRUE, sep = "_", output = c("tibble", "dt") )
data |
Either a vector or a data frame or tibble containing the vector ("y") to be summarized and any grouping variables. |
y |
If the data object is a data.frame, this is the variable for which you wish to obtain a descriptive summary. You can use either the quoted or unquoted name of the variable, e.g. "y_var" or y_var. |
... |
If the data object is a data.frame, this special argument accepts
any number of unquoted grouping variable names (also present in the data
source) to use for subsetting, separated by commas, e.g. |
digits |
This determines the number of digits used for rounding of numeric outputs. |
type |
For numeric and integer vectors this determines the type of
skewness and kurtosis calculations to perform. See
|
na.rm |
This determines whether missing values (NAs) should be removed before attempting to calculate summary statistics. |
sep |
A character string to use to separate unique values from their counts ("_" by default). Only applicable to factors and character vectors. |
output |
Output type for each class of variables. dt" for data.table or "tibble" for tibble. |
The output varies as a function of the class of input data/y, referred to as "y" below
For all input variables, the following are returned (part 1):
the total number of cases
number of complete cases
the number of missing values
the proportion of total cases with missing values
In addition to part 1, these measures are provided for dates:
the total number of unique values or levels of y. For dates this tells you how many time points there are
the earliest or minimum date in y
the latest or maximum date in y
In addition to part 1, these measures are provided for factors:
the total number of unique values or levels of y
a logical indicating whether or not y is ordinal
the counts of the top and bottom unique values of y in order of decreasing frequency formatted as "value_count". If there are more than 4 unique values of y, only the top 2 and bottom 2 unique values are shown separated by "...". To get counts for all unique values use counts
instead.
In addition to part 1, these measures are provided for character/string vectors:
the total number of unique values or levels of y
the minimum number of characters in the values of y
the maximum number of characters in the values of y
the counts of the top and bottom unique values of y in order of decreasing frequency formatted as "value_count". If there are more than 4 unique values of y, only the top 2 and bottom 2 unique values are shown separated by "...". To get counts for all unique values use counts
instead.
In addition to part 1, these measures are provided for logical vectors:
the total number of y values that are TRUE
the total number of y values that are FALSE
the proportion of y values that are TRUE
In addition to part 1, these measures are provided for numeric variables:
the mean of y
the standard deviation of y
the standard error of the mean of y
the 0th percentile (the minimum) of y
the 25th percentile of y
the 50th percentile (the median) of y
the 25th percentile of y
the 100th percentile (the maximum) of y
the skewness of the distribution of y
the kurtosis of the distribution of y
Craig P. Hutton, craig.hutton@gov.bc.ca
Altman, D. G., & Bland, J. M. (2005). Standard deviations and standard errors. Bmj, 331(7521), 903.
Bulmer, M. G. (1979). Principles of statistics. Courier Corporation.
D. N. Joanes and C. A. Gill (1998), Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183-189.
mean
, sd
, se
,
quantile
, skewness
, kurtosis
,
counts
, counts_tb
describe(data = pdata, y = y1) #no grouping variables, numeric input class describe(pdata, y1, high_low) #one grouping variable, numeric input class describe(pdata, g) #factor input class describe(pdata, even) #logical input class
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.