View source: R/DescriptiveStats.R
DescriptiveStats | R Documentation |
Automatically Calculates a wide variety of Statistics on any Dataset (tibble). When called, make sure to assign this function's output to a variable, i.e.: MyDFStats <- DescriptiveStats(MyDF, TRUE).
DescriptiveStats( VarDF, CalculateGraphs, IncludeInteger = TRUE, RoundAt = 2, AbbrevStrLevelsAfterNcount = 5, AllHistsOn1Page = TRUE, AllBoxplotsOn1Page = FALSE, AllBarChartsOn1Page = TRUE, DependentVar = NULL, ShowGraphs = FALSE, BoxplotPointsColourVar = NULL, NoPrints = FALSE, IsTimeSeries = FALSE, GroupBy = NULL, TimeFlowVar = NULL, HideLegendInPerGroup = FALSE, BoxPlotPointSize = 0.4, BoxPlotPointAlpha = 0.1, SampleIfNRowGT = 10000, SeedForSampling = NULL, CalcPValues = TRUE, SignificanceLevel = 0.01, CorrVarOrder = "PCA", TimeseriesMaxLag = NULL, DatesToNowMinusDate = FALSE, DatesToCyclicMonth = FALSE, DatesToCyclicDayOfWeek = FALSE, DatesToCyclicDayOfMonth = FALSE, DatesToCyclicDayOfYear = FALSE, DatesToYearCat = FALSE, DatesToMonthCat = FALSE, DatesToDayOfWeekCat = FALSE, DatesToDayOfMonthCat = FALSE, DatesToDayOfYearCat = FALSE, DatesToHourCat = FALSE, DatesToMinuteCat = FALSE, ExcludeTaperedAutocor = FALSE, MaxTaperedRows = 250, VarsToExcludeFromTimeseries = NULL, ExludeCovariances = TRUE, DateBreaks = NULL, DateLabels = NULL, DateTextAngle = 0, Verbose = NULL )
VarDF |
A Tibble (data.frame) This is the Dataset to be analysed. Ensure that Categorical variables are set as factor(), texts as character(), Integers as integer(), Booleans as logical() |
CalculateGraphs |
Boolean. If FALSE then only Descriptives and Pearson Correlations are calculated |
IncludeInteger |
Boolean. If FALSE then no statistics will occur for Integer variables |
RoundAt |
Integer. Descriptive Statistics values like Min, Max, etc. will be rounded to this decimal place |
AbbrevStrLevelsAfterNcount |
Integer. String and Factor unique values (levels) will be displayed up to this number, i.e. 'Colours: Red (3%), Cyan (0.8%), ..., Black (8%)' for a value of 3 |
AllHistsOn1Page |
Boolean. If ShowGraphs==TRUE, then if TRUE, instead of creating a new plot per histogram, just 1 plot will be created, encompassing all histograms juxtaposed inside |
AllBoxplotsOn1Page |
Boolean. Instead of creating 1 plot per variable, all of them are inside 1 plot. However, when there are differences in the value range, the low-range plots are practically invisible |
AllBarChartsOn1Page |
Boolean. If ShowGraphs==TRUE, then if TRUE, instead of creating a new plot per Boxplot, just 1 plot will be created, encompassing all histograms juxtaposed inside |
DependentVar |
String. If there is a variable to be considered as Target/Dependent, then mention its name here |
ShowGraphs |
Boolean. If TRUE, Certain graphs will be displayed at the time of calculation before you can access them via the variable |
BoxplotPointsColourVar |
String or Numeric Vector. Either a string indicating the Variable name inside VarDF, or a numerical vector to be used as colour |
NoPrints |
Boolean. If TRUE, not even the Descriptive Statistics Matrices will be displayed on the time of the calculation. Everything will be accessible from the variable this function's output is assigned to |
IsTimeSeries |
Boolean. If FALSE then no Timeseries specific statistics or plots will be calculated as it would show nonsense if the dataset is not really a time-series |
GroupBy |
String. If you want to get statistics per group in addition to the general ones, then mention which Column of VarDF should be used as the grouping variable |
TimeFlowVar |
String or Date/Numeric Vector. The variable name to be used as X-Axis on TimeFlow plots. The Variable corresponding to this string can be Date or Numeric |
HideLegendInPerGroup |
Boolean. If there are tens of categories in the per group variable, then the legend takes too much space. |
BoxPlotPointSize |
Numeric. How big or small you want the dots on the Boxplots to be. Usually a value between 0.1 and 1. The more the rows, the less the value here |
BoxPlotPointAlpha |
Numeric. How transparent you want the dots on the Boxplots to be. Usually a value between 0.1 and 1. The more the rows, the less the value here |
SampleIfNRowGT |
Integer. How many rows to keep for the plots. The more the rows, the greater the time it takes to plot everything. ggplot is not optimised for big data, so subsample for plots |
SeedForSampling |
Integer. Doesn't really matter as subsampling is only for plots, but you can set a seed for the subsampling procedure |
CalcPValues |
Boolean. Whether or not to calculate (and show on plots) the p-values for Pearson and Spearman correlations |
SignificanceLevel |
Numeric. The Significance Level to be used for the p-values for Pearson and Spearman correlations |
CorrVarOrder |
String. AB - Alphabetical, hclust - Order based on Hierarchical cluster analysis, BEA - Bond Energy Algorithm to maximize the measure of effectiveness (ME), PCA - First principal component or angle on the projection on the first two principal components, TSP - Travelling sales person solver to maximize ME |
TimeseriesMaxLag |
Integer. Max lag for Auto Correlation/Covariance plots. |
DatesToNowMinusDate |
Boolean. If TRUE then Dates are transformed into integers reflecting how many seconds have passed since the time on the date |
DatesToCyclicMonth |
Boolean. If TRUE then Dates are transformed into numeric variables containing the cyclic sin and cos of the Month |
DatesToCyclicDayOfWeek |
Boolean. If TRUE then Dates are transformed into numeric variables containing the cyclic sin and cos of the Day-of-week |
DatesToCyclicDayOfMonth |
Boolean. If TRUE then Dates are transformed into numeric variables containing the cyclic sin and cos of the Day-of-month |
DatesToCyclicDayOfYear |
Boolean. If TRUE then Dates are transformed into numeric variables containing the cyclic sin and cos of the Day-of-year |
DatesToYearCat |
Boolean. If TRUE then Dates are transformed into categorical variables containing the Year of the date |
DatesToMonthCat |
Boolean. If TRUE then Dates are transformed into categorical variables containing the Month of the date |
DatesToDayOfWeekCat |
Boolean. If TRUE then Dates are transformed into categorical variables containing the Day-of-week of the date |
DatesToDayOfMonthCat |
Boolean. If TRUE then Dates are transformed into categorical variables containing the Day-of-month of the date |
DatesToDayOfYearCat |
Boolean. If TRUE then Dates are transformed into categorical variables containing the Day-of-Year of the date |
DatesToHourCat |
Boolean. If TRUE then Dates are transformed into categorical variables containing the Hour of the date |
DatesToMinuteCat |
Boolean. If TRUE then Dates are transformed into categorical variables containing the Minute of the date |
ExcludeTaperedAutocor |
Boolean. Only counts if IsTimeSeries==TRUE. If TRUE then the TaperedAutocorrelation and TaperedPartialAutocorrelation will not be computed |
MaxTaperedRows |
Integer. Probably a good idea to not increase it as the time it takes is excessive then |
VarsToExcludeFromTimeseries |
String Array. The names of the variables which we don't want to include in Time-series analysis (if any) |
ExludeCovariances |
Boolean. If TRUE, Cross-Covariance and Auto-Covariance will not be calculated |
DateBreaks |
String. ggplot2 date_breaks parameter. For example: "1 month" |
DateLabels |
String. ggplot2 date_labels parameter. |
DateTextAngle |
Integer. ggplot2 theme angle for the date values in X axis |
Verbose |
Numeric. If there are many columns, calculations can take a long time so we might wanna know when each part finishes and perhaps disable some parts |
DESCRIPTIVES: -Numerical Descriptives as a Matrix of Variables' Name, Min, Q1, Mean, Median, Q3, Max, St.Dev., IQR, Observations, NAs -Categorical Descriptives as a Matrix of Observations, NAs, Number of Unique values, and some of those values followed by their percentage of occurrence DESCRIPTIVE PLOTS: -Categorical Distributions as Bar charts -Numerical Distributions as Boxplots with dots overlaid to show actual concentration; dots' colour can optionally be used to display a 2nd numerical dimension (another variable) -Numerical Distributions also as Histograms with a Density plot overlaid and the mean value plotted as a vertical red dotted line CORRELATIONS: -Correlation Matrix (Pearson and Spearman) where Columns can be optionally reordered by Clustering techniques like PCA -Correlation p-values Matrix (Pearson and Spearman) to show is the aforementioned correlation is statistically significant at a user-defined significance level -Correlation Plots with Red colour shows strong positive correlation, all the way to Blue showing strong negative correlation, with non-statistically-significant correlations being crossed out by an 'X' STATISTICAL INFERENCE: Plotting the Categorical variables VS the Dependent Variable If the Dependent variable is Numerical -Plotting the Categorical variables VS the Dependent as Boxplots per the different levels of the Categorical Variable, optionally overlaying a 3rd dimension as coloured dots on the boxplots -Plotting the Numerical variables vs the Dependent as a scatter plot, optionally overlaying a 3rd dimension as colour for the scatterplot's dots IF the Dependent variable is Categorical -Plotting the Categorical variables VS the Dependent as Stacked Bar plots, 1 bar per level of the Dependent variable and 1 colour per level of the Independent one -Plotting the Numerical variables vs the Dependent as Boxplots per the different levels of the Dependent variable, optionally overlaying a 3rd dimension as coloured dots on the boxplots TIMESERIES: -Timeseries visualisation as a scatterplot with a Line chart overlaid where X-axis is time and Y-Axis is the variable's time, optionally overlaying a 3rd dimension as coloured dots on the scatterplot PER GROUP ANALYSIS: -Numerical Variables VS the Group's levels plotted as differently coloured Boxplots -Histograms with Density plots overlaid, juxtaposed per Group's levels as well -Categorical variables VS the group's levels displayed as a Stacked Bat Chart If the Dependent variable is Numerical -Categorical VS Dependent Per Group displayed as 1 plot per Group's level, with each plot being juxtaposed Boxplots of the Dependent variable's distribution for each Categorical Independent variable's level -Numerical VS Dependent Per Group displayed as 1 plot per Group's level, with each plot being a scatterplot of the Dependent variable VS the Numerical Independent variable If the Dependent variable is Categorical -Categorical VS Dependent Per Group displayed as 1 plot per Group's level, with each plot being a Stacked Barchart with 1 bar per level of the Dependent variable and 1 colour per level of the Independent one -Numerical VS Dependent PEr Group displayed as 1 plot per Group's level, with each plot being juxtaposed Boxplots 1 bar per level of the Dependent variable And lastly, there's a Per-Group folder where everything talked so far is done again in a recursive manner for rows corresponding to each group's level, and removing the Group variable
#Loading the famous mtcars dataset library(dplyr) DS <- mtcars %>% mutate(vs = as.factor(vs), am = as.factor(am), gear = as.factor(gear), carb = as.factor(carb)) %>% as_tibble() #Seeing and understanding the Dataset print(DS) #Creating the Variable which holds all the Matrices and Plots MTCarsStats <- DS %>% DescriptiveStats(CalculateGraphs = TRUE, DependentVar = "mpg", IsTimeSeries = TRUE, GroupBy = "gear", CorrVarOrder = "PCA")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.