DescriptiveStats: Automatically Calculates Statistics on any Dataset.

View source: R/DescriptiveStats.R

DescriptiveStatsR Documentation

Automatically Calculates Statistics on any Dataset.

Description

Automatically Calculates a wide variety of Statistics on any Dataset (tibble). When called, make sure to assign this function's output to a variable, i.e.: MyDFStats <- DescriptiveStats(MyDF, TRUE).

Usage

DescriptiveStats(
  VarDF,
  CalculateGraphs,
  IncludeInteger = TRUE,
  RoundAt = 2,
  AbbrevStrLevelsAfterNcount = 5,
  AllHistsOn1Page = TRUE,
  AllBoxplotsOn1Page = FALSE,
  AllBarChartsOn1Page = TRUE,
  DependentVar = NULL,
  ShowGraphs = FALSE,
  BoxplotPointsColourVar = NULL,
  NoPrints = FALSE,
  IsTimeSeries = FALSE,
  GroupBy = NULL,
  TimeFlowVar = NULL,
  HideLegendInPerGroup = FALSE,
  BoxPlotPointSize = 0.4,
  BoxPlotPointAlpha = 0.1,
  SampleIfNRowGT = 10000,
  SeedForSampling = NULL,
  CalcPValues = TRUE,
  SignificanceLevel = 0.01,
  CorrVarOrder = "PCA",
  TimeseriesMaxLag = NULL,
  DatesToNowMinusDate = FALSE,
  DatesToCyclicMonth = FALSE,
  DatesToCyclicDayOfWeek = FALSE,
  DatesToCyclicDayOfMonth = FALSE,
  DatesToCyclicDayOfYear = FALSE,
  DatesToYearCat = FALSE,
  DatesToMonthCat = FALSE,
  DatesToDayOfWeekCat = FALSE,
  DatesToDayOfMonthCat = FALSE,
  DatesToDayOfYearCat = FALSE,
  DatesToHourCat = FALSE,
  DatesToMinuteCat = FALSE,
  ExcludeTaperedAutocor = FALSE,
  MaxTaperedRows = 250,
  VarsToExcludeFromTimeseries = NULL,
  ExludeCovariances = TRUE,
  DateBreaks = NULL,
  DateLabels = NULL,
  DateTextAngle = 0,
  Verbose = NULL
)

Arguments

VarDF

A Tibble (data.frame) This is the Dataset to be analysed. Ensure that Categorical variables are set as factor(), texts as character(), Integers as integer(), Booleans as logical()

CalculateGraphs

Boolean. If FALSE then only Descriptives and Pearson Correlations are calculated

IncludeInteger

Boolean. If FALSE then no statistics will occur for Integer variables

RoundAt

Integer. Descriptive Statistics values like Min, Max, etc. will be rounded to this decimal place

AbbrevStrLevelsAfterNcount

Integer. String and Factor unique values (levels) will be displayed up to this number, i.e. 'Colours: Red (3%), Cyan (0.8%), ..., Black (8%)' for a value of 3

AllHistsOn1Page

Boolean. If ShowGraphs==TRUE, then if TRUE, instead of creating a new plot per histogram, just 1 plot will be created, encompassing all histograms juxtaposed inside

AllBoxplotsOn1Page

Boolean. Instead of creating 1 plot per variable, all of them are inside 1 plot. However, when there are differences in the value range, the low-range plots are practically invisible

AllBarChartsOn1Page

Boolean. If ShowGraphs==TRUE, then if TRUE, instead of creating a new plot per Boxplot, just 1 plot will be created, encompassing all histograms juxtaposed inside

DependentVar

String. If there is a variable to be considered as Target/Dependent, then mention its name here

ShowGraphs

Boolean. If TRUE, Certain graphs will be displayed at the time of calculation before you can access them via the variable

BoxplotPointsColourVar

String or Numeric Vector. Either a string indicating the Variable name inside VarDF, or a numerical vector to be used as colour

NoPrints

Boolean. If TRUE, not even the Descriptive Statistics Matrices will be displayed on the time of the calculation. Everything will be accessible from the variable this function's output is assigned to

IsTimeSeries

Boolean. If FALSE then no Timeseries specific statistics or plots will be calculated as it would show nonsense if the dataset is not really a time-series

GroupBy

String. If you want to get statistics per group in addition to the general ones, then mention which Column of VarDF should be used as the grouping variable

TimeFlowVar

String or Date/Numeric Vector. The variable name to be used as X-Axis on TimeFlow plots. The Variable corresponding to this string can be Date or Numeric

HideLegendInPerGroup

Boolean. If there are tens of categories in the per group variable, then the legend takes too much space.

BoxPlotPointSize

Numeric. How big or small you want the dots on the Boxplots to be. Usually a value between 0.1 and 1. The more the rows, the less the value here

BoxPlotPointAlpha

Numeric. How transparent you want the dots on the Boxplots to be. Usually a value between 0.1 and 1. The more the rows, the less the value here

SampleIfNRowGT

Integer. How many rows to keep for the plots. The more the rows, the greater the time it takes to plot everything. ggplot is not optimised for big data, so subsample for plots

SeedForSampling

Integer. Doesn't really matter as subsampling is only for plots, but you can set a seed for the subsampling procedure

CalcPValues

Boolean. Whether or not to calculate (and show on plots) the p-values for Pearson and Spearman correlations

SignificanceLevel

Numeric. The Significance Level to be used for the p-values for Pearson and Spearman correlations

CorrVarOrder

String. AB - Alphabetical, hclust - Order based on Hierarchical cluster analysis, BEA - Bond Energy Algorithm to maximize the measure of effectiveness (ME), PCA - First principal component or angle on the projection on the first two principal components, TSP - Travelling sales person solver to maximize ME

TimeseriesMaxLag

Integer. Max lag for Auto Correlation/Covariance plots.

DatesToNowMinusDate

Boolean. If TRUE then Dates are transformed into integers reflecting how many seconds have passed since the time on the date

DatesToCyclicMonth

Boolean. If TRUE then Dates are transformed into numeric variables containing the cyclic sin and cos of the Month

DatesToCyclicDayOfWeek

Boolean. If TRUE then Dates are transformed into numeric variables containing the cyclic sin and cos of the Day-of-week

DatesToCyclicDayOfMonth

Boolean. If TRUE then Dates are transformed into numeric variables containing the cyclic sin and cos of the Day-of-month

DatesToCyclicDayOfYear

Boolean. If TRUE then Dates are transformed into numeric variables containing the cyclic sin and cos of the Day-of-year

DatesToYearCat

Boolean. If TRUE then Dates are transformed into categorical variables containing the Year of the date

DatesToMonthCat

Boolean. If TRUE then Dates are transformed into categorical variables containing the Month of the date

DatesToDayOfWeekCat

Boolean. If TRUE then Dates are transformed into categorical variables containing the Day-of-week of the date

DatesToDayOfMonthCat

Boolean. If TRUE then Dates are transformed into categorical variables containing the Day-of-month of the date

DatesToDayOfYearCat

Boolean. If TRUE then Dates are transformed into categorical variables containing the Day-of-Year of the date

DatesToHourCat

Boolean. If TRUE then Dates are transformed into categorical variables containing the Hour of the date

DatesToMinuteCat

Boolean. If TRUE then Dates are transformed into categorical variables containing the Minute of the date

ExcludeTaperedAutocor

Boolean. Only counts if IsTimeSeries==TRUE. If TRUE then the TaperedAutocorrelation and TaperedPartialAutocorrelation will not be computed

MaxTaperedRows

Integer. Probably a good idea to not increase it as the time it takes is excessive then

VarsToExcludeFromTimeseries

String Array. The names of the variables which we don't want to include in Time-series analysis (if any)

ExludeCovariances

Boolean. If TRUE, Cross-Covariance and Auto-Covariance will not be calculated

DateBreaks

String. ggplot2 date_breaks parameter. For example: "1 month"

DateLabels

String. ggplot2 date_labels parameter.

DateTextAngle

Integer. ggplot2 theme angle for the date values in X axis

Verbose

Numeric. If there are many columns, calculations can take a long time so we might wanna know when each part finishes and perhaps disable some parts

Details

DESCRIPTIVES: -Numerical Descriptives as a Matrix of Variables' Name, Min, Q1, Mean, Median, Q3, Max, St.Dev., IQR, Observations, NAs -Categorical Descriptives as a Matrix of Observations, NAs, Number of Unique values, and some of those values followed by their percentage of occurrence DESCRIPTIVE PLOTS: -Categorical Distributions as Bar charts -Numerical Distributions as Boxplots with dots overlaid to show actual concentration; dots' colour can optionally be used to display a 2nd numerical dimension (another variable) -Numerical Distributions also as Histograms with a Density plot overlaid and the mean value plotted as a vertical red dotted line CORRELATIONS: -Correlation Matrix (Pearson and Spearman) where Columns can be optionally reordered by Clustering techniques like PCA -Correlation p-values Matrix (Pearson and Spearman) to show is the aforementioned correlation is statistically significant at a user-defined significance level -Correlation Plots with Red colour shows strong positive correlation, all the way to Blue showing strong negative correlation, with non-statistically-significant correlations being crossed out by an 'X' STATISTICAL INFERENCE: Plotting the Categorical variables VS the Dependent Variable If the Dependent variable is Numerical -Plotting the Categorical variables VS the Dependent as Boxplots per the different levels of the Categorical Variable, optionally overlaying a 3rd dimension as coloured dots on the boxplots -Plotting the Numerical variables vs the Dependent as a scatter plot, optionally overlaying a 3rd dimension as colour for the scatterplot's dots IF the Dependent variable is Categorical -Plotting the Categorical variables VS the Dependent as Stacked Bar plots, 1 bar per level of the Dependent variable and 1 colour per level of the Independent one -Plotting the Numerical variables vs the Dependent as Boxplots per the different levels of the Dependent variable, optionally overlaying a 3rd dimension as coloured dots on the boxplots TIMESERIES: -Timeseries visualisation as a scatterplot with a Line chart overlaid where X-axis is time and Y-Axis is the variable's time, optionally overlaying a 3rd dimension as coloured dots on the scatterplot PER GROUP ANALYSIS: -Numerical Variables VS the Group's levels plotted as differently coloured Boxplots -Histograms with Density plots overlaid, juxtaposed per Group's levels as well -Categorical variables VS the group's levels displayed as a Stacked Bat Chart If the Dependent variable is Numerical -Categorical VS Dependent Per Group displayed as 1 plot per Group's level, with each plot being juxtaposed Boxplots of the Dependent variable's distribution for each Categorical Independent variable's level -Numerical VS Dependent Per Group displayed as 1 plot per Group's level, with each plot being a scatterplot of the Dependent variable VS the Numerical Independent variable If the Dependent variable is Categorical -Categorical VS Dependent Per Group displayed as 1 plot per Group's level, with each plot being a Stacked Barchart with 1 bar per level of the Dependent variable and 1 colour per level of the Independent one -Numerical VS Dependent PEr Group displayed as 1 plot per Group's level, with each plot being juxtaposed Boxplots 1 bar per level of the Dependent variable And lastly, there's a Per-Group folder where everything talked so far is done again in a recursive manner for rows corresponding to each group's level, and removing the Group variable

Examples

#Loading the famous mtcars dataset
library(dplyr)
DS <- mtcars %>% mutate(vs = as.factor(vs), am = as.factor(am), gear = as.factor(gear), carb = as.factor(carb)) %>% as_tibble()

#Seeing and understanding the Dataset
print(DS)

#Creating the Variable which holds all the Matrices and Plots
MTCarsStats <- DS %>% DescriptiveStats(CalculateGraphs = TRUE, DependentVar = "mpg", IsTimeSeries = TRUE, GroupBy = "gear", CorrVarOrder = "PCA")

N1h1l1sT/DescriptiveStatsR documentation built on Dec. 9, 2022, 3:57 a.m.