get_stats.data.frame: Statistics of columns

View source: R/stats.R

get_stats.data.frameR Documentation

Statistics of columns

Description

Takes a data frame and returns a table of statistics with entries for each column.

Usage

## S3 method for class 'data.frame'
get_stats(
  x,
  t_skew = 2,
  t_kurt = 3.5,
  t_avail = 0.65,
  t_zero = 0.5,
  t_unq = 0.5,
  nsignif = 3,
  ...
)

Arguments

x

A data frame with only numeric columns.

t_skew

Absolute skewness threshold. See details.

t_kurt

Kurtosis threshold. See details.

t_avail

Data availability threshold. See details.

t_zero

A threshold between 0 and 1 for flagging indicators with high proportion of zeroes. See details.

t_unq

A threshold between 0 and 1 for flagging indicators with low proportion of unique values. See details.

nsignif

Number of significant figures to round the output table to.

...

arguments passed to or from other methods.

Details

The statistics (columns in the output table) are as follows (entries correspond to each column):

  • Min: the minimum

  • Max: the maximum

  • Mean: the (arirthmetic) mean

  • Median: the median

  • Std: the standard deviation

  • Skew: the skew

  • Kurt: the kurtosis

  • N.Avail: the number of non-NA values

  • N.NonZero: the number of non-zero values

  • N.Unique: the number of unique values

  • Frc.Avail: the fraction of non-NA values

  • Frc.NonZero: the fraction of non-zero values

  • Frc.Unique: the fraction of unique values

  • Flag.Avail: a data availability flag - columns with Frc.Avail < t_avail will be flagged as "LOW", else "ok".

  • Flag.NonZero: a flag for columns with a high proportion of zeros. Any columns with Frc.NonZero < t_zero are flagged as "LOW", otherwise "ok".

  • Flag.Unique: a unique value flag - any columns with Frc.Unique < t_unq are flagged as "LOW", otherwise "ok".

  • Flag.SkewKurt: a skew and kurtosis flag which is an indication of possible outliers. Any columns with abs(Skew) > t_skew AND Kurt > t_kurt are flagged as "OUT", otherwise "ok".

The aim of this table, among other things, is to check the basic statistics of each column/indicator, and identify any possible issues for each indicator. For example, low data availability, having a high proportion of zeros and/or a low proportion of unique values. Further, the combination of skew and kurtosis (i.e. the Flag.SkewKurt column) is a simple test for possible outliers, which may require treatment using Treat().

See also vignette("analysis").

Value

A data frame of statistics for each column

Examples

# stats of mtcars
get_stats(mtcars)


COINr documentation built on Oct. 9, 2023, 5:07 p.m.