inspect_num: Summary and comparison of numeric columns

View source: R/inspect_num.R

inspect_numR Documentation

Summary and comparison of numeric columns

Description

For a single dataframe, summarise the numeric columns. If two dataframes are supplied, compare numeric columns appearing in both dataframes. For grouped dataframes, summarise numeric columns separately for each group.

Usage

inspect_num(df1, df2 = NULL, breaks = 20, include_int = TRUE)

Arguments

df1

A dataframe.

df2

An optional second dataframe for comparing categorical levels. Defaults to NULL.

breaks

Integer number of breaks used for histogram bins, passed to graphics::hist(). Defaults to 20.

include_int

Logical flag, whether to include integer columns in numeric summaries. Defaults to TRUE. hist(..., breaks). See ?hist for more details.

Details

For a single dataframe, the tibble returned contains the columns:

  • col_name, a character vector containing the column names in df1

  • min, q1, median, mean, q3, max and sd, the minimum, lower quartile, median, mean, upper quartile, maximum and standard deviation for each numeric column.

  • pcnt_na, the percentage of each numeric feature that is missing

  • hist, a named list of tibbles containing the relative frequency of values falling in bins determined by breaks.

For a pair of dataframes, the tibble returned contains the columns:

  • col_name, a character vector containing the column names in df1 and df2

  • hist_1, hist_2, a list column for histograms of each of df1 and df2. Where a column appears in both dataframe, the bins used for df1 are reused to calculate histograms for df2.

  • jsd, a numeric column containing the Jensen-Shannon divergence. This measures the difference in distribution of a pair of binned numeric features. Values near to 0 indicate agreement of the distributions, while 1 indicates disagreement.

  • pval, the p-value corresponding to a NHT that the true frequencies of histogram bins are equal. A small p indicates evidence that the the two sets of relative frequencies are actually different. The test is based on a modified Chi-squared statistic.

For a grouped dataframe, the tibble returned is as for a single dataframe, but where the first k columns are the grouping columns. There will be as many rows in the result as there are unique combinations of the grouping variables.

Value

A tibble containing statistical summaries of the numeric columns of df1, or comparing the histograms of df1 and df2.

Author(s)

Alastair Rushworth

See Also

show_plot

Examples

# Load dplyr for starwars data & pipe
library(dplyr)

# Single dataframe summary
inspect_num(starwars)

# Paired dataframe comparison
inspect_num(starwars, starwars[1:20, ])

# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_num()

inspectdf documentation built on Aug. 9, 2022, 9:05 a.m.