knitr::opts_chunk$set(
  collapse = TRUE
  , eval = T
  , comment = "#>"
  , out.width = '100%'
  , message = F
  , warning = F
  , echo = T
)
suppressPackageStartupMessages( require(oetteR) )
suppressPackageStartupMessages( require(tidyverse) )

Introduction Group Analysis

When given a dataset, we often a have predifined groups of observations and several variables that we want to explore for relevant variables that describe the differences between those groups. Traditionally we would probably calculate the means of each variable and then sort by difference of the means in percent. Then we would generate histograms of the most promising candidates, check their distributions and then chose an appropriate statistical method to determine whether the difference in mean is of statistical significance. We would also have to use a different methods for numerical and categorical variables. My goal here is to present a workflow that automates all these steps.

f_stat_group_ana

is a function that simply takes a list data_ls generated by f_clean_data() and col_group a character vector that denotes the name of the grouping variable.

It then generates a full html file with interactive histograms of all relevant variables and interactive tables containing summary statistics of all variables. As well as alluvial plots, tabplots and static plots with statistical annotations.

In order to detect statistical relevant variables it uses kruskal and regular anova and performs a shapiro test for normality. Depending on the shapio result it selects either the kruskal or the regular anova p value.

It sorts the relevant variables by percent difference for numerical and percent difference in frequency for categorical variables.

It uses in total a number of functions that are quite usefull on its own, including a decent histogram function f_plot_hist, which we are going to walk through step-bystep.

Sample Data

We are going to use the ISLR::Caravan data set which consists of 86 different variables. We are going to investigate which variables influnce the purchase of a caravan by grouping on variable Purchase.

data_ls = f_clean_data(ISLR::Caravan)

col_group = 'Purchase'

Calculate P values and percent difference

df_anova = f_stat_anova( data_ls, col_group )

df_chi = f_stat_chi_square( data_ls, col_group)

df_anova

df_chi

Combine the results of anova and chisquare

f_stat_combine_anova_with_chi_square() keeps either the regular or the kruskal anova p value depending on the results of the shapiro test (shapiro_stat > 0.9 p_val > 0.05). Then combines the df_anova and the df_chi dataframes into one.

df_stat = f_stat_combine_anova_with_chi_square( df_anova, df_chi ) %>%
  arrange( desc(diff_perc) )

df_stat

Plot histograms

f_plot_hist

this function provides a standard interface for histograms with sensible defaults. It automatically detects whether a variable is numeric or categorical and adds statistical information on its own. See the following examples.

The statistical annotation is added using the ggpubr package. ggpubr::stat_compare_means() can be used to add significance annotation to most plots. Although we have to predefine the values that should be compared. For this I wrote f_plot_generate_comparison_pairs() which automatically looks for combinations that are significantly different using a non-parametric wilcox test.

data_ls2 = f_clean_data(mtcars)
f_plot_hist('disp', data_ls2)
f_plot_hist('disp', data_ls2, add = 'median')
f_plot_hist('disp', data_ls2, add = 'none')
f_plot_hist('disp', data_ls2, y_axis = 'density')
f_plot_hist('cyl', data_ls2 , group = 'gear' )
f_plot_hist('cyl', data_ls2 , group = 'gear', y_axis = 'density' )
f_plot_hist('cyl', data_ls2, y_axis = 'density' )
f_plot_hist('cyl', data_ls2, y_axis = 'count' )
f_plot_hist('disp', data_ls2, graph_type = 'line', group = 'cyl')
f_plot_hist('disp', data_ls2, graph_type = 'bar', group = 'cyl')
f_plot_hist('disp', data_ls2, graph_type = 'violin', group = 'cyl'
             , caption ='caption', title = 'title', subtitle = 'subtitle')

Generating histograms using a pipe

df_stat %>%
  filter( diff_perc >= 30, p_value <= 0.001 ) %>%
  .$variable %>%
  map( f_plot_hist, data_ls, col_group, y_axis = 'density' )

Tables with means/medians and frequencies

f_stat_group_mean_medians() and f_stat_group_counts_percentages() create meaningful tables with minimal input. They actually are recalculating the statistics again. Which is compuationally a bit wasteful but should not take to much extra time. f_datatable_universal creates a nicely formatted interactive table. That can be easily be downloaded as an excel file.

f_stat_group_mean_medians( data_ls, col_group ) %>%
  f_datatable_universal()

f_stat_group_counts_percentages( data_ls, col_group ) %>%
  f_datatable_universal()

Bringing it all together

We can apply all these steps automatically by simply using f_stat_group_ana(). We have the following (and more options) when rendering.

-return_taglist (Instead of rednering the html file we can get a taglist in return which we can integrate into any Rmd document) - alluvial (adds an additional html file with an alluvial plot) - tabplot( adds an additional html file with tabplots) - static_plots( adds an additional html file with static versions of the histograms, all histograms loose the statistical annotations when converted to interactive plotly objects )

The reason why we generate seperate html files for each plot type is that I did not yet manage to write a generic function that is able to print a list of heterogeneous objects inside a Rmd file. So far I am using f_plot_obj_2_html() to generate html files from homogeneous lists.

For demsonstration purposes we will generate 4 html files with rather high p value and percent difference thresholds. We will integrate them as screenshots.

f_stat_group_ana( data_ls, col_group
                  , thresh_p_val = 0.0001, thresh_diff_perc = 30
                  # it ususally makes sense to increase limit the amount of variables in an alluvial plot
                  , alluvial_thresh_p_val = 0.0001, alluvial_thresh_diff_perc = 30
                  , alluvial = T
                  , static_plots = T
                  , tabplot = T 
                  , output_file = 'f_stat_group_ana'
                  , quiet = T )

Screenshot Main File

webshot::webshot( 'f_stat_group_ana.html' )

file.remove('f_stat_group_ana.html')

Screenshot Alluvial File

webshot::webshot( 'f_stat_group_ana_alluvial.html' )

file.remove('f_stat_group_ana_alluvial.html')

Screenshot Tablplot File

webshot::webshot( 'f_stat_group_ana_tabplots.html' )

file.remove('f_stat_group_ana_tabplots.html')

Screenshot Static Plots File

webshot::webshot( 'f_stat_group_ana_stat_plots.html' )

file.remove('f_stat_group_ana_stat_plots.html')

file.remove('webshot.png')


erblast/oetteR documentation built on May 27, 2019, 12:11 p.m.