# descr: Detailed Statistical Description of Data Frame In SebKrantz/collapse: Advanced and Fast Data Transformation

 descr R Documentation

## Detailed Statistical Description of Data Frame

### Description

`descr` offers a fast and detailed description of each variable in a data frame. Since v1.9.0 it fully supports grouped and weighted computations.

### Usage

``````descr(X, ...)

## Default S3 method:
descr(X, by = NULL, w = NULL, cols = NULL,
Ndistinct = TRUE, higher = TRUE, table = TRUE, sort.table = "freq",
Qprobs = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), Qtype = 7L,
label.attr = "label", stepwise = FALSE, ...)

## S3 method for class 'grouped_df'
descr(X, w = NULL,
Ndistinct = TRUE, higher = TRUE, table = TRUE, sort.table = "freq",
Qprobs = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), Qtype = 7L,
label.attr = "label", stepwise = FALSE, ...)

## S3 method for class 'descr'
as.data.frame(x, ..., gid = "Group")

## S3 method for class 'descr'
print(x, n = 14, perc = TRUE, digits = .op[["digits"]], t.table = TRUE, total = TRUE,
compact = FALSE, summary = !compact, reverse = FALSE, stepwise = FALSE, ...)
``````

### Arguments

`X`

a (grouped) data frame or list of atomic vectors. Atomic vectors, matrices or arrays can be passed but will first be coerced to data frame using `qDF`.

`by`

a factor, `GRP` object, or atomic vector / list of vectors (internally grouped with `GRP`), or a one- or two-sided formula e.g. `~ group1` or `var1 + var2 ~ group1 + group2` to group `X`. See Examples.

`w`

a numeric vector of (non-negative) weights. the default method also supports a one-sided formulas i.e. `~ weightcol` or `~ log(weightcol)`. The `grouped_df` method supports lazy-expressions (same without `~`). See Examples.

`cols`

select columns to describe using column names, indices a logical vector or selector function (e.g. `is.numeric`). Note: `cols` is ignored if a two-sided formula is passed to `by`.

`Ndistinct`

logical. `TRUE` (default) computes the number of distinct values on all variables using `fndistinct`.

`higher`

logical. Argument is passed down to `qsu`: `TRUE` (default) computes the skewness and the kurtosis.

`table`

logical. `TRUE` (default) computes a (sorted) frequency table for all categorical variables (excluding Date variables).

`sort.table`

an integer or character string specifying how the frequency table should be presented:

 Int. String Description 1 "value" sort table by values. 2 "freq" sort table by frequencies. 3 "none" return table in first-appearance order of values, or levels for factors (most efficient).
`Qprobs`

double. Probabilities for quantiles to compute on numeric variables, passed down to `.quantile`. If something non-numeric is passed (i.e. `NULL`, `FALSE`, `NA`, `""` etc.), no quantiles are computed.

`Qtype`

integer. Quantile types 5-9 following Hyndman and Fan (1996) who recommended type 8, default 7 as in `quantile`.

`label.attr`

character. The name of a label attribute to display for each variable (if variables are labeled).

`...`

for `descr`: other arguments passed to `qsu.default`. For `[.descr`: variable names or indices passed to `[.list`. The argument is unused in the `print` and `as.data.frame` methods.

`x`

an object of class 'descr'.

`n`

integer. The maximum number of table elements to print for categorical variables. If the number of distinct elements is `<= n`, the whole table is printed. Otherwise the remaining items are summed into an '... %s Others' category.

`perc`

logical. `TRUE` (default) adds percentages to the frequencies in the table for categorical variables, and, if `!is.null(by)`, the percentage of observations in each group.

`digits`

integer. The number of decimals to print in statistics, quantiles and percentage tables.

`t.table`

logical. `TRUE` (default) prints a transposed table.

`total`

logical. `TRUE` (default) adds a 'Total' column for grouped tables (when using `by` argument).

`compact`

logical. `TRUE` combines statistics and quantiles to generate a more compact printout. Especially useful with groups (`by`).

`summary`

logical. `TRUE` (default) computes and displays a summary of the frequencies, if the size of the table for a categorical variable exceeds `n`.

`reverse`

logical. `TRUE` prints contents in reverse order, starting with the last column, so that the dataset can be analyzed by scrolling up the console after calling `descr`.

`stepwise`

logical. `TRUE` prints one variable at a time. The user needs to press [enter] to see the printout for the next variable. If called from `descr`, the computation is also done one variable at a time, and the finished 'descr' object is returned invisibly.

`gid`

character. Name assigned to the group-id column, when describing data by groups.

### Details

`descr` was heavily inspired by `Hmisc::describe`, but is much faster and has more advanced statistical capabilities. It is principally a wrapper around `qsu`, `fquantile` (`.quantile`), and `fndistinct` for numeric variables, and computes frequency tables for categorical variables using `qtab`. Date variables are summarized with `fnobs`, `fndistinct` and `frange`.

Since v1.9.0 grouped and weighted computations are fully supported. The use of sampling weights will produce a weighted mean, sd, skewness and kurtosis, and weighted quantiles for numeric data. For categorical data, tables will display the sum of weights instead of the frequencies, and percentage tables as well as the percentage of missing values indicated next to 'Statistics' in print, be relative to the total sum of weights. All this can be done by groups. Grouped (weighted) quantiles are computed using `BY`.

For larger datasets, calling the `stepwise` option directly from `descr()` is recommended, as precomputing the statistics for all variables before digesting the results can be time consuming.

The list-object returned from `descr` can efficiently be converted to a tidy data frame using the `as.data.frame` method. This representation will not include frequency tables computed for categorical variables.

### Value

A 2-level nested list-based object of class 'descr'. The list has the same size as the dataset, and contains the statistics computed for each variable, which are themselves stored in a list containing the class, the label, the basic statistics and quantiles / tables computed for the variable (in matrix form).

The object has attributes attached providing the 'name' of the dataset, the number of rows in the dataset ('N'), an attribute 'arstat' indicating whether arrays of statistics where generated by passing arguments (e.g. `pid`) down to `qsu.default`, an attribute 'table' indicating whether `table = TRUE` (i.e. the object could contain tables for categorical variables), and attributes 'groups' and/or 'weights' providing a `GRP` object and/or weight vector for grouped and/or weighted data descriptions.

`qsu`, `qtab`, `fquantile`, `pwcor`, Summary Statistics, Fast Statistical Functions, Collapse Overview

### Examples

``````## Simple Use
descr(iris)
descr(wlddev)
descr(GGDC10S)

# Some useful print options (also try stepwise argument)
print(descr(GGDC10S), reverse = TRUE, t.table = FALSE)
# For bigger data consider: descr(big_data, stepwise = TRUE)

# Generating a data frame
as.data.frame(descr(wlddev, table = FALSE))

## Weighted Desciptions
descr(wlddev, w = ~ replace_na(POP)) # replacing NA's with 0's for fquantile()

## Grouped Desciptions
descr(GGDC10S, ~ Variable)
descr(wlddev, ~ income)
print(descr(wlddev, ~ income), compact = TRUE)

## Grouped & Weighted Desciptions
descr(wlddev, ~ income, w = ~ replace_na(POP))

## Passing Arguments down to qsu.default: for Panel Data Statistics
descr(iris, pid = iris\$Species)
descr(wlddev, pid = wlddev\$iso3c)

``````

SebKrantz/collapse documentation built on Sept. 16, 2024, 6:28 a.m.