R/visstat_core.R
In visStatistics: Automated Selection and Visualisation of Statistical Hypothesis Tests

Documented in visstat_core

# MIT License----
# Copyright (c) 2020 Sabine Schilling
# Feedback highly welcome: sabineschilling@gmx.ch

# Header visstat_core -----
#'
#' Automated Visualization of Statistical Hypothesis Testing
#'
#' @description \code{visstat_core()} provides automated selection and visualization 
#' of a statistical hypothesis test between a two vectors in
#' a given \code{data.frame} named \code{dataframe} based on the data's type, 
#' distribution, sample size, and the
#' specified \code{conf.level}.
#' \code{varsample} and \code{varfactor} are \code{character}
#' strings corresponding to the column names of the chosen vectors in \code{dataframe}. 
#' These vectors must be of type \code{integer}, \code{numeric} or \code{factor}.
#' The automatically generated output figures
#' illustrate the selected statistical hypothesis test, display the main test
#' statistics, and include assumption checks and post hoc comparisons when
#' applicable. The primary test results are returned as a list object.
#'
#' @details The decision logic for selecting a statistical test is described below.
#' For more details, please refer to the package's \code{vignette("visstat_coreistics")}.
#' Throughout, data of class \code{numeric} or \code{integer} are referred to as
#' numeric, while data of class \code{factor} are referred to as categorical.
#' The significance level \code{alpha} is defined as one minus the confidence
#' level, given by the argument \code{conf.level}. Assumptions of normality and
#' homoscedasticity are considered met when the corresponding test yields a
#' p-value greater than \code{alpha = 1 - conf.level}.
#' The choice of statistical tests performed by \code{visstat_core()} depends on
#' whether the data are numeric or categorical, the number of levels in the
#' categorical variable, the distribution of the data, and the chosen
#' \code{conf.level}. The function prioritises interpretable visual output and
#' tests that remain valid under their assumptions, following the logic below:
#' 
#' (1) When the response is numerical and the predictor is categorical, tests of
#' central tendency are performed. If the predictor has two levels:
#' \code{t.test()} is used if both groups have more than 30 observations (Lumley
#' et al. (2002) <doi:10.1146/annurev.publhealth.23.100901.140546>). For smaller
#' samples, normality is assessed using \code{shapiro.test()}. If both groups
#' return p-values greater than \code{alpha}, \code{t.test()} is applied;
#' otherwise, \code{wilcox.test()} is used.
#' For predictors with more than two levels, \code{aov()} is initially fitted.
#' Residual normality is tested with \code{shapiro.test()} and \code{ad.test()}.
#' If \code{p > alpha} for either test, normality is assumed. Homogeneity of
#' variance is tested with \code{bartlett.test()}. If \code{p > alpha},
#' \code{aov()} with \code{TukeyHSD()} is used. If \code{p <= alpha},
#' \code{oneway.test()} is applied with \code{TukeyHSD()}. If residuals are not
#' normal, \code{kruskal.test()} with \code{pairwise.wilcox.test()} is used.
#' 
#' (2): When both the response and predictor are numerical, a linear model
#' \code{lm()} is fitted, with residual diagnostics and a confidence band plot.
#' 
#' (3): When both variables are categorical, \code{visstat_core()} uses
#' \code{chisq.test()} or \code{fisher.test()} depending on expected counts,
#' following Cochran's rule (Cochran (1954) <doi:10.2307/3001666>).
#' 
#' Implemented main tests: 
#' 
#' \code{t.test()}, \code{wilcox.test()}, \code{aov()},
#' \code{oneway.test()}, \code{lm()}, \code{kruskal.test()},
#' \code{fisher.test()}, \code{chisq.test()}.
#' 
#' Implemented tests for assumptions:
#' \itemize{
#'   \item Normality: \code{shapiro.test()} and \code{ad.test()}
#'   \item Heteroscedasticity: \code{bartlett.test()}
#' }
#' 
#' Implemented post hoc tests:
#' \itemize{
#'   \item \code{TukeyHSD()} for \code{aov()} and \code{oneway.test()}
#'   \item \code{pairwise.wilcox.test()} for \code{kruskal.test()}
#' }
#' @seealso
#' See also the package's vignette
#' \code{vignette("visStatistics")} for the overview,
#' and the accompanying webpage
#' \url{https://shhschilling.github.io/visStatistics/}.
#'
#' @param dataframe \code{data.frame} with at least two columns.
#' @param varsample \code{character} string matching a column name in 
#'   \code{dataframe}. Interpreted as the response if the referenced column is 
#'   of class \code{numeric} or \code{integer} and the column named by 
#'   \code{varfactor} is of class \code{factor}.
#' @param varfactor \code{character} string matching a column name in 
#'   \code{dataframe}. Interpreted as the grouping variable if the referenced 
#'   column is of class \code{factor} and the column named by \code{varsample} 
#'   is of class \code{numeric} or \code{integer}.
#' @param conf.level Confidence level
#' @param numbers a logical indicating whether to show numbers in mosaic count
#'   plots.
#' @param minpercent number between 0 and 1 indicating minimal fraction of total
#'   count data of a category to be displayed	in mosaic count plots.
#' @param graphicsoutput saves plot(s) of type "png",  "jpg", "tiff" or  "bmp"
#'   in directory specified in \code{plotDirectory}. If graphicsoutput=NULL, no
#'   plots are saved.
#' @param plotName graphical output is stored following the naming convention
#'   "plotName.graphicsoutput" in \code{plotDirectory}. Without specifying this
#'   parameter, plotName is automatically generated following the convention
#'   "statisticalTestName_varsample_varfactor".
#' @param plotDirectory specifies directory, where generated plots are stored.
#'   Default is current working directory.
#' @return \code{list} containing statistics of automatically selected test
#'   meeting assumptions. All values are returned as invisible copies.
#'   Values can be accessed by assigning a return value to \code{visstat_core}.

#' @examples
#' # Welch Two Sample t-test (t.test())
#' visstat_core(mtcars, "mpg", "am")
#'
#' ## Wilcoxon rank sum test (wilcox.test())
#' grades_gender <- data.frame(
#'   Sex = as.factor(c(rep("Girl", 20), rep("Boy", 20))),
#'   Grade = c(
#'     19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4,
#'     20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.3,
#'     16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4,
#'     7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6
#'   )
#' )
#' visstat_core(grades_gender, "Grade", "Sex")
#'
#' ## Welch's oneway ANOVA not assuming equal variances (oneway.test())
#' anova_npk <- visstat_core(npk, "yield", "block")
#' anova_npk # prints summary of tests
#'
#' ## Kruskal-Wallis rank sum test (kruskal.test())
#' visstat_core(iris, "Petal.Width", "Species")
#' visstat_core(InsectSprays, "count", "spray")
#'
#' ## Linear regression  (lm())
#' visstat_core(trees, "Girth", "Height", conf.level = 0.99)
#'
#' ## Pearson's Chi-squared test (chisq.test())
#' ### Transform array to data.frame
#' HairEyeColorDataFrame <- counts_to_cases(as.data.frame(HairEyeColor))
#' visstat_core(HairEyeColorDataFrame, "Hair", "Eye")
#'
#' ## Fisher's exact test (fisher.test())
#' HairEyeColorMaleFisher <- HairEyeColor[, , 1]
#' ### slicing out a 2 x2 contingency table
#' blackBrownHazelGreen <- HairEyeColorMaleFisher[1:2, 3:4]
#' blackBrownHazelGreen <- counts_to_cases(as.data.frame(blackBrownHazelGreen))
#' fisher_stats <- visstat_core(blackBrownHazelGreen, "Hair", "Eye")
#' fisher_stats # print out summary statistics
#'
#' ## Saving the graphical output in directory "plotDirectory"
#' ## A) Saving graphical output of type "png" in temporary directory tempdir()
#' ##    with default naming convention:
#' visstat_core(blackBrownHazelGreen, "Hair", "Eye",
#'   graphicsoutput = "png",
#'   plotDirectory = tempdir()
#' )
#'
#' ## Remove graphical output from plotDirectory
#' file.remove(file.path(tempdir(), "chi_squared_or_fisher_Hair_Eye.png"))
#' file.remove(file.path(tempdir(), "mosaic_complete_Hair_Eye.png"))
#'
#' ## B) Specifying pdf as output type:
#' visstat_core(iris, "Petal.Width", "Species",
#'   graphicsoutput = "pdf",
#'   plotDirectory = tempdir()
#' )
#'
#' ## Remove graphical output from plotDirectory
#' file.remove(file.path(tempdir(), "kruskal_Petal_Width_Species.pdf"))
#'
#' ## C) Specifying "plotName" overwrites default naming convention
#' visstat_core(iris, "Petal.Width", "Species",
#'   graphicsoutput = "pdf",
#'   plotName = "kruskal_iris", plotDirectory = tempdir()
#' )
#' ## Remove graphical output from plotDirectory
#' file.remove(file.path(tempdir(), "kruskal_iris.pdf"))
#'
#' @import vcd
#' @import Cairo
#' @import graphics
#' @import grDevices
#' @import grid
#' @import multcompView
#' @import stats
#' @import utils
#' @importFrom nortest ad.test
#' @export visstat_core


visstat_core <- function(dataframe,
                    varsample,
                    varfactor,
                    conf.level = 0.95,
                    numbers = TRUE,
                    minpercent = 0.05,
                    graphicsoutput = NULL,
                    plotName = NULL,
                    plotDirectory = getwd()) {
  stopifnot(is.data.frame(dataframe))
  stopifnot(varsample %in% names(dataframe))
  stopifnot(varfactor %in% names(dataframe))
  
  
  
  # store default graphical parameters------
  oldparvisstat_core <- par(no.readonly = TRUE)
  oldparvisstat_core$new <- FALSE # reset the default value
  on.exit(par(oldparvisstat_core))
  
  # Collect plot paths from plot_paths <- c(plot_paths, saveGraphVisstat())
  plot_paths <- character(0)
   
  # Set default values---------------------------
  alpha <- 1 - conf.level
  
  ## Get input variables---------------------------------
  input <-
    get_samples_fact_inputfile(dataframe, varsample, varfactor)
  # out of function get_groups_inputfile
  samples <- input$samples
  fact <- input$fact
  name_of_sample <- input$name_of_sample
  name_of_factor <- input$name_of_factor
  matchingCriteria <- input$matchingCriteria
  
  # dependent on samples, fact, name_of_sample, name_of_factor, conf.level,
  
  typesample <- class(samples)
  typefactor <-
    class(fact) # type of independent variable returned as a character vector
  
  
  # transform independent variable "fact" of class "character" to factor
  if (typefactor == "character") {
    fact <-
      as.factor(fact) # transform  "fact" of class "character" to factor
    typefactor <- class(fact) # store the class of type "factor"
  }
  
  #  Check order 
  
  if ((typefactor == "numeric" || typefactor == "integer") && typesample == "factor") {
    stop("A numeric or integer predictor with a factor response is ignored.")
  }
  
  maxlabels <- length(levels(samples))
  ## Comparison of all  possible combinations of input variables --------------
  ## A) median or mean-----
  # requirement: only two levels of factors
  # if the chosen "sample" is numeric or integer, we can perform parametric tests
  # like the t-test (if the assumption of normal distribution is met )
  # otherwise Wilcoxon test
  
  if ((
    # Wilcoxon or t-test -----
    typesample == "integer" | typesample == "numeric"
  ) &&
  (typefactor == "factor") && nlevels(fact) == 2) {
    # check if there is at least one entry in each group, if not return empty
    twosamples <-
      create_two_samples_vector(samples, fact) # returns list with three entries
    if (length(twosamples) < 3) {
      vis_sample_fact <-
        warning("In each group must be at least one member.")
    } else {
      # t-Test -----
      x1 <- twosamples$sample1
      x2 <- twosamples$sample2
      # The two-sample t-test is robust to non-normality due to
      # the central limit theorem.->
      # Checking for normality of samples not necessary,if sample size roughly >30
      # citation: THE IMPORTANCE OF THE NORMALITY ASSUMPTION IN LARGE PUBLIC
      # HEALTH DATA SETS
      # DOI: 10.1146/annurev.publhealth.23.100901.140546
      #
      # Check normality of both samples with Shapiro -Test-----
      # Check assumptions of Shapiro-Test:length between 3 and 5000,
      # at least one level returns TRUE if size between 3 and 50000
      #
      # There are two different ways to justify the use of the t-test:
      # 1.Data is normally distributed and you have at least two samples per group
      # 2. You have large (N>30)sample sizes in each group.
      
      shapiro_assumptions1 <- check_assumptions_shapiro(x1)
      shapiro_assumptions2 <- check_assumptions_shapiro(x2)
      
      if (shapiro_assumptions1 == TRUE) {
        p1 <- test_norm(twosamples$sample1)
      }
      
      if (shapiro_assumptions2 == TRUE) {
        p2 <- test_norm(twosamples$sample2)
      }
      # Check if normal distributions are given in both samples by Shapiro --
      # Assume normal distributions if the p-value is greater alpha
      # Perform always t-test if both samples are >30
      
      if (length(twosamples$sample1) > 30 &
          length(twosamples$sample2) > 30) {
        openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
        vis_sample_fact <- two_sample_t_test(
          samples,
          fact,
          alternative = c("two.sided"),
          paired = FALSE,
          var.equal = FALSE,
          conf.level = conf.level,
          samplename = name_of_sample,
          factorname = name_of_factor
        )
        
        if (is.null(plotName)) {
          filename <-
            paste("ttest_", name_of_sample, "_", name_of_factor, sep = "")
        } else {
          filename <- plotName
        }
        plot_paths <- c(plot_paths, saveGraphVisstat(filename, type = graphicsoutput, fileDirectory = plotDirectory))
      }
      # 2. If assumptions of t-test are not met: Wilcoxon, else t-test
      else if (!exists("p1") |
               (if (exists("p1")) {
                 p1$p.value < alpha
               } else {
                 FALSE
               }) |
               !exists("p2") |
               (if (exists("p2")) {
                 (p2$p.value < alpha)
               } else {
                 FALSE
               })) {
        # case 1: Wilcoxon-Test:
        # normal distribution not given for n<limit
        openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
        
        vis_sample_fact <- two_sample_wilcoxon_test(
          samples,
          fact,
          alternative = "two.sided",
          conf.level = conf.level,
          notchf = F,
          samplename = varsample,
          # factorname = matchingCriteria,
          factorname = varfactor,
          cex = 1
        )
        
        
        if (is.null(plotName)) {
          filename <-
            paste("wilcoxon-test_",
                  name_of_sample,
                  "_",
                  name_of_factor,
                  sep = "")
        } else {
          filename <- plotName
        }
        
        
        plot_paths <- c(plot_paths, saveGraphVisstat(fileName = filename,
                         type = graphicsoutput,
                         fileDirectory = plotDirectory))
      } else {
        openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
        
        vis_sample_fact <- two_sample_t_test(
          samples,
          fact,
          alternative = "two.sided",
          paired = F,
          var.equal = F,
          conf.level = conf.level,
          samplename = varsample,
          factorname = varfactor
        )
        
        if (is.null(plotName)) {
          filename <-
            paste("ttest_", name_of_sample, "_", name_of_factor, sep = "")
        } else {
          filename <- plotName
        }
        
        plot_paths <- c(plot_paths, saveGraphVisstat(fileName = filename,
                         type = graphicsoutput,
                         fileDirectory = plotDirectory))
      }
      
      # attr(vis_sample_fact, "plot_paths") <- plot_paths
      # class(vis_sample_fact) <- "visstat"
      # return(invisible(vis_sample_fact))
    }
  }
  
  
  
  
  
  ## B) Chi2 and Mosaic-----
  
  if (typefactor == "factor" && typesample == "factor") {
    if (check_assumptions_count_data(samples, fact) == FALSE) {
      vis_sample_fact <-
        makeTable(samples, fact, name_of_sample, name_of_factor)
    } else {
      # Chi^2 Test-----
      openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
      
      vis_chi <-
        vis_chi_squared_test(samples, fact, name_of_sample, "groups")
      if (is.null(plotName)) {
        filename <- paste("chi_squared_or_fisher_",
                          name_of_sample,
                          "_",
                          name_of_factor,
                          sep = "")
      } else {
        filename <- paste(plotName, "_", "chi_squared_or_fisher", sep = "")
      }
      
      plot_paths <- c(plot_paths, saveGraphVisstat(fileName = filename,
                       type = graphicsoutput,
                       fileDirectory = plotDirectory))
      # Mosaic plots -----
      # a) complete labeled mosaic graph
      
      if (maxlabels > 7) {
        numberflag <- F
      } else {
        numberflag <- T
      }
      
      openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
      
      vis_mosaic_res <- vis_mosaic(
        samples,
        fact,
        name_of_sample = name_of_sample,
        name_of_factor = name_of_factor,
        minperc = 0,
        numbers = numberflag
      )
      
      
      
      if (is.null(plotName)) {
        filename <- paste("mosaic_complete_",
                          name_of_sample,
                          "_",
                          name_of_factor,
                          sep = "")
      } else {
        filename <- paste(plotName, "_", "mosaic_complete", sep = "")
      }
      
      
      
      plot_paths <- c(plot_paths, saveGraphVisstat(filename, type = graphicsoutput, fileDirectory = plotDirectory))
      
      # b) reduced plots if number of of levels>7
      # Display only categories with at least minpercent of entries
      
      if (maxlabels > 7) {
        openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
        
        vis_mosaic_res <- vis_mosaic(
          samples,
          fact,
          name_of_sample = name_of_sample,
          name_of_factor = "groups",
          minperc = minpercent,
          numbers = T
        )
        plot_paths <- c(plot_paths, saveGraphVisstat(
          paste(
            "mosaic_reduced_",
            name_of_sample,
            "_",
            name_of_factor,
            sep = ""
          )),
          type = graphicsoutput,
          fileDirectory = plotDirectory
        )
      }
      
      vis_sample_fact <- c(vis_chi, vis_mosaic_res)
    }
  }
  # C) both types numerical: Regression-----
  
  # Both samples and fact of type integer or numeric
  # Regression
  #
  #
  if ((typefactor == "integer" |
       typefactor == "numeric") &&
      (typesample == "integer" | typesample == "numeric")) {
    # samples: independent variable, factor: dependent   variable
    # check normality
    normality_residual_assumptioon <-
      vis_normality_assumptions(samples, fact, conf.level = conf.level)
    
    openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
    
    vis_sample_fact <- vis_regression(
      samples,
      # y: dependent
      fact,
      # x: independent
      name_of_factor = name_of_factor,
      name_of_sample = name_of_sample,
      conf.level = conf.level
    )
    if (is.null(plotName)) {
      filename <-
        paste("regression_", name_of_sample, "_", name_of_factor, sep = "")
    } else {
      filename <- paste(plotName)
    }
    
    
    
    
    plot_paths <- c(plot_paths, saveGraphVisstat(fileName = filename,
                     type = graphicsoutput,
                     fileDirectory = plotDirectory))
  }
  
  # D) more than two comparisons-----
  # A) sample is numeric or integer: ANOVA or Kruskal/Wallis
  
  # excellent tutorial
  # https://www.scribbr.com/statistics/anova-in-r/
  
  
  if (typefactor == "factor" &&
      (typesample == "integer" | typesample == "numeric") &&
      nlevels(fact) > 2) {
    visanova <- vis_anova_assumptions(
      samples,
      fact,
      conf.level = conf.level,
      samplename = varsample,
      factorname = varfactor
    )
    
    
    if (visanova$shapiro_test$p.value > alpha |
        visanova$ad_test$p.value > alpha) {
      openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
      
      vis_sample_fact <- vis_anova(
        samples,
        fact,
        samplename = varsample,
        factorname = varfactor,
        conf.level = conf.level
      )
      
      
      if (is.null(plotName)) {
        filename <-
          paste("anova_", name_of_sample, "_", name_of_factor, sep = "")
      } else {
        filename <- paste(plotName)
      }
      
      
      plot_paths <- c(plot_paths, saveGraphVisstat(fileName = filename,
                       type = graphicsoutput,
                       fileDirectory = plotDirectory))
      
      
      
      
      # if p -values of both Shapiro-Wilk and Kruskall-Wallis-Test are smaller than 0.05, Kruskall-Wallis-Test
    } else {
      openGraphCairo(type = graphicsoutput, fileDirectory = plotDirectory)
      
      vis_sample_fact <- vis_Kruskal_Wallis_clusters(
        samples,
        fact,
        conf.level = conf.level,
        samplename = varsample,
        factorname = varfactor,
        cex = 1,
        notch = F
      )
      
      if (is.null(plotName)) {
        filename <-
          paste("kruskal_", name_of_sample, "_", name_of_factor, sep = "")
      } else {
        filename <- paste(plotName)
      }
      
      
      plot_paths <- c(plot_paths, saveGraphVisstat(fileName = filename,
                       type = graphicsoutput,
                       fileDirectory = plotDirectory))
    }
  }
  
  
  
  # At the very end:
  attr(vis_sample_fact, "plot_paths") <- plot_paths
  class(vis_sample_fact) <- "visstat"
  
  return(invisible(vis_sample_fact))
  
  
}
# End of visstat_core function -------
Any scripts or data that you put into this service are public.
visStatistics documentation built on June 8, 2025, 1:58 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
visStatistics
Automated Selection and Visualisation of Statistical Hypothesis Tests

R/visstat_core.R
In visStatistics: Automated Selection and Visualisation of Statistical Hypothesis Tests

Defines functions visstat_core

Documented in visstat_core

Try the visStatistics package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

visStatistics Automated Selection and Visualisation of Statistical Hypothesis Tests

R/visstat_core.R In visStatistics: Automated Selection and Visualisation of Statistical Hypothesis Tests

Defines functions visstat_core

Documented in visstat_core

Try the visStatistics package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

visStatistics
Automated Selection and Visualisation of Statistical Hypothesis Tests

R/visstat_core.R
In visStatistics: Automated Selection and Visualisation of Statistical Hypothesis Tests