R/normalityAssessment.R
In ufs: A Collection of Utilities

Documented in normalityAssessment

#' normalityAssessment and samplingDistribution
#'
#' normalityAssessment can be used to assess whether a variable and the
#' sampling distribution of its mean have an approximately normal distribution.
#'
#' samplingDistribution is a convenient wrapper for normalityAssessment that
#' makes it easy to quickly generate a sample and sampling distribution from
#' frequencies (or proportions).
#'
#' dataShape computes the skewness and kurtosis.
#'
#'
#' normalityAssessment provides a number of normality tests and draws
#' histograms of the sample data and the sampling distribution of the mean
#' (most statistical tests assume the latter is normal, rather than the first;
#' normality of the sample data guarantees normality of the sampling
#' distribution of the mean, but if the sample size is sufficiently large, the
#' sampling distribution of the mean is approximately normal even when the
#' sample data are not normally distributed). Note that for the sampling
#' distribution, the degrees of freedom are usually so huge that the normality
#' tests, negligible deviations from normality will already result in very
#' small p-values.
#'
#' samplingDistribution makes it easy to quickly assess the distribution of a
#' variables based on frequencies or proportions, and dataShape computes
#' skewness and kurtosis.
#'
#' @aliases normalityAssessment samplingDistribution dataShape
#' @param sampleVector Numeric vector containing the sample data.
#' @param samples Number of samples to use when constructing sampling
#' distribution.
#' @param digits Number of digits to use when printing results.
#' @param samplingDistColor Color to use when drawing the sampling
#' distribution.
#' @param normalColor Color to use when drawing the standard normal curve.
#' @param samplingDistLineSize Size of the line used to draw the sampling
#' distribution.
#' @param normalLineSize Size of the line used to draw the standard normal
#' distribution.
#' @param xLabel.sampleDist Label of x axis of the distribution of the sample.
#' @param yLabel.sampleDist Label of y axis of the distribution of the sample.
#' @param xLabel.samplingDist Label of x axis of the sampling distribution.
#' @param yLabel.samplingDist Label of y axis of the sampling distribution.
#' @param xLabs,yLabs The axis labels for the three plots (should be vectors of
#' three elements; the first specifies the X or Y axis label for the rightmost
#' plot (the histogram), the second for the middle plot (the QQ plot), and the
#' third for the rightmost plot (the box plot).
#' @param popValues The possible values (levels) of the relevant variable. For
#' example, for a dichotomous variable, this can be "c(1:2)" (or "c(1, 2)").
#' Note that samplingDistribution is for manually specifying the frequency
#' distribution (or proportions); if you have a vector with 'raw' data, just
#' call normalityAssessment directly.
#' @param popFrequencies The frequencies corresponding to each value in
#' popValues; must be in the same order! See the examples.
#' @param sampleSize Size of the sample; the sum of the frequencies if not
#' specified.
#' @param na.rm Whether to remove missing data first.
#' @param type Type of skewness and kurtosis to compute; either 1 (g1 and g2),
#' 2 (G1 and G2), or 3 (b1 and b2). See Joanes & Gill (1998) for more
#' information.
#' @param conf.level Confidence of confidence intervals.
#' @param plots Whether to display plots.
#' @param qqCI Whether to show the confidence interval for the QQ plot.
#' @param labelOutliers Whether to label outliers with their row number in the
#' box plot.
#' @param sampleFromPop If true, the sample vector is created by sampling from
#' the population information specified; if false, rep() is used to generate
#' the sample vector. Note that is proportions are supplied in popFrequencies,
#' sampling from the population is necessary!
#' @param sampleSizeOverride Whether to use the sample size of the sample as
#' sample size for the sampling distribution, instead of the sampling
#' distribution size. This makes sense, because otherwise, the sample size and
#' thus sensitivity of the null hypothesis significance tests is a function of
#' the number of samples used to generate the sampling distribution.
#' @param extraNotification Whether to be particularly informative.
#' @param headerPrefix A prefix to insert before the heading (e.g. to use Markdown
#' headings).
#' @param suppressPlot Whether to suppress (`TRUE`) or print (`FALSE`) the plot.
#' @param x The object to print/pander.
#' @param ... Additional arguments are passed on, usually to the default methods.
#' @return
#'
#' An object with several results, the most notably of which are:
#' \item{plot.sampleDist}{Histogram of sample distribution}
#' \item{sw.sampleDist}{Shapiro-Wilk normality test of sample distribution}
#' \item{ad.sampleDist}{Anderson-Darling normality test of sample distribution}
#' \item{ks.sampleDist}{Kolmogorov-Smirnof normality test of sample
#' distribution} \item{kurtosis.sampleDist}{Kurtosis for sample distribution}
#' \item{skewness.sampleDist}{Skewness for sample distribution}
#' \item{plot.samplingDist}{Histogram of sampling distribution}
#' \item{sw.samplingDist}{Shapiro-Wilk normality test of sampling distribution}
#' \item{ad.samplingDist}{Anderson-Darling normality test of sampling
#' distribution} \item{ks.samplingDist}{Kolmogorov-Smirnof normality test of
#' sampling distribution} \item{dataShape.samplingDist}{Skewness and kurtosis
#' for sampling distribution}
#' @keywords utilities
#' @rdname normalityAssessment
#' @examples
#'
#' ### Note: the 'not run' is simply because running takes a lot of time,
#' ###       but these examples are all safe to run!
#' \dontrun{
#'
#' normalityAssessment(rnorm(35));
#'
#' ### Create a distribution of three possible values and
#' ### show the sampling distribution for the mean
#' popValues <- c(1, 2, 3);
#' popFrequencies <- c(20, 50, 30);
#' sampleSize <- 100;
#' samplingDistribution(popValues = popValues,
#'                      popFrequencies = popFrequencies,
#'                      sampleSize = sampleSize);
#'
#' ### Create a very skewed distribution of ten possible values
#' popValues <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
#' popFrequencies <- c(2, 4, 8, 6, 10, 15, 12, 200, 350, 400);
#' samplingDistribution(popValues = popValues,
#'                      popFrequencies = popFrequencies,
#'                      sampleSize = sampleSize, digits=5);
#' }
#'
#' @export normalityAssessment
normalityAssessment <- function(sampleVector, samples = 10000, digits=2,
                                samplingDistColor = "#2222CC",
                                normalColor = "#00CC00",
                                samplingDistLineSize = 2,
                                normalLineSize = 1,
                                xLabel.sampleDist = NULL,
                                yLabel.sampleDist = NULL,
                                xLabel.samplingDist = NULL,
                                yLabel.samplingDist = NULL,
                                sampleSizeOverride = TRUE) {

  ### Create object for returning results
  res <- list(sampleVector.raw = sampleVector,
              sampleVector = sampleVector[stats::complete.cases(sampleVector)],
              sampleSize = length(sampleVector[stats::complete.cases(sampleVector)]),
              samples = samples,
              digits = digits);

  ### Construct temporary dataset for
  ### plotting sample distribution
  normalX <- c(seq(min(res$sampleVector), max(res$sampleVector),
                   by=(max(res$sampleVector) - min(res$sampleVector))/(res$sampleSize-1)));
  normalY <- stats::dnorm(normalX, mean=mean(res$sampleVector),
                          sd=stats::sd(res$sampleVector));
  sampleDistY <- res$sampleVector;
  tempDat <- data.frame(normalX = normalX, normalY = normalY, sampleDist = sampleDistY);
  tempBinWidth <- (max(res$sampleVector) - min(res$sampleVector)) / 30;

  ### Generate labels if these weren't specified
  if (is.null(xLabel.sampleDist)) {
    xLabel.sampleDist <- extractVarName(deparse(substitute(sampleVector)));
  }
  if (is.null(yLabel.sampleDist)) {
    yLabel.sampleDist <- paste0('Frequencies for n=', res$sampleSize);
  }

  ### Plot sample distribution
  res$plot.sampleDist <- normalHist(tempDat$sampleDist,
                                    xLabel=xLabel.sampleDist,
                                    yLabel=yLabel.sampleDist,
                                    distributionColor=samplingDistColor,
                                    normalColor=normalColor,
                                    distributionLineSize=samplingDistLineSize,
                                    normalLineSize=normalLineSize)$plot +
    ggplot2::ggtitle("Sample distribution");

  res$qqPlot.sampleDist <- ggqq(tempDat$sampleDist);

  ### Take 'samples' samples of sampleSize people and store the means
  ### (first generate an empty vector to store the means)
  res$samplingDistribution <- replicate(samples,
                                        mean(sample(res$sampleVector,
                                                    size=res$sampleSize,
                                                    replace=TRUE)));
  # res$samplingDistribution <- c();
  # for (i in 1:samples) {
  #   res$samplingDistribution[i] <- mean(sample(res$sampleVector, size=res$sampleSize,
  #                                              replace=TRUE));
  # }

  ### Construct temporary dataset for
  ### plotting sampling distribution
  normalX <- c(seq(min(res$samplingDistribution), max(res$samplingDistribution),
                   by=(max(res$samplingDistribution) - min(res$samplingDistribution))/(res$samples-1)));
  normalY <- stats::dnorm(normalX, mean=mean(res$samplingDistribution),
                          sd=stats::sd(res$samplingDistribution));
  samplingDistY <- res$samplingDistribution;
  tempDat <- data.frame(normalX = normalX, normalY = normalY, samplingDist = samplingDistY);
  tempBinWidth <- (max(res$samplingDistribution) - min(res$samplingDistribution)) / 30;

  ### Generate labels if these weren't specified
  if (is.null(xLabel.samplingDist)) {
    xLabel.samplingDist <- extractVarName(deparse(substitute(sampleVector)));
  }
  if (is.null(yLabel.samplingDist)) {
    yLabel.samplingDist <- paste0('Frequencies for ', res$samples, ' samples of n=', res$sampleSize);
  }

  ### Plot sampling distribution
  res$plot.samplingDist <- normalHist(tempDat$samplingDist,
                                      xLabel=xLabel.samplingDist,
                                      yLabel=yLabel.samplingDist,
                                      distributionColor=samplingDistColor,
                                      normalColor=normalColor,
                                      distributionLineSize=samplingDistLineSize,
                                      normalLineSize=normalLineSize)$plot +
    ggplot2::ggtitle("Sampling distribution");

  res$qqPlot.samplingDist <- ggqq(tempDat$samplingDist,
                                  sampleSizeOverride = res$sampleSize);

  ### Shapiro Wilk test - if there are more than 5000
  ### datapoints, only use the first 5000 datapoints
  res$sw.sampleDist <- ifelseObj(res$sampleSize > 5000,
                              stats::shapiro.test(res$sampleVector[1:5000]),
                              stats::shapiro.test(res$sampleVector));
  res$sw.samplingDist <- ifelseObj(res$samples > 5000,
                                   stats::shapiro.test(res$samplingDistribution[1:5000]),
                                   stats::shapiro.test(res$samplingDistribution));

  ### Anderson-Darling test
  res$ad.sampleDist <- ad.test_from_nortest(res$sampleVector);
  res$ad.samplingDist <- ad.test_from_nortest(res$samplingDistribution);

  ### Kolomogorov-Smirnof test
  suppressWarnings(res$ks.sampleDist <-
                     stats::ks.test(res$sampleVector,
                                    "pnorm",
                                    alternative = "two.sided"));
  suppressWarnings(res$ks.samplingDist <-
                     stats::ks.test(res$samplingDistribution,
                                    "pnorm",
                                    alternative = "two.sided"));

  ### Skewness and kurtosis
  res$dataShape.sampleDist <- dataShape(res$sampleVector, plots=FALSE);
  res$dataShape.samplingDist <- dataShape(res$samplingDistribution,
                                          sampleSizeOverride=ifelse(sampleSizeOverride,
                                                                    length(res$sampleVector),
                                                                    NULL),
                                          plots=FALSE);

  ### Set class for returnable object and return it
  class(res) <- 'normalityAssessment';
  return(res);

}

#' @method print normalityAssessment
#' @rdname normalityAssessment
#' @export
print.normalityAssessment <- function (x, ...) {

  if (x$sampleSize > 5000) {
    sw.sampleDist <- paste0("Shapiro-Wilk: p=", round(x$sw.sampleDist$p.value, x$digits),
                                   " (W=", round(x$sw.sampleDist$statistic, x$digits),
                                   "; NOTE: based on the first 5000 of ",
                                 x$sampleSize, " observations)");
  }
  else {
    sw.sampleDist <- paste0("Shapiro-Wilk: p=", round(x$sw.sampleDist$p.value, x$digits),
                                   " (W=", round(x$sw.sampleDist$statistic, x$digits),
                                   "; based on ", x$sampleSize, " observations)");
  }

  if (x$samples > 5000) {
    sw.samplingDist <- paste0("Shapiro-Wilk: p=", round(x$sw.samplingDist$p.value, x$digits),
                       " (W=", round(x$sw.samplingDist$statistic, x$digits),
                       "; NOTE: based on the first 5000 of ",
                       x$samples, " observations)");
  }
  else {
    sw.samplingDist <- paste0("Shapiro-Wilk: p=", round(x$sw.samplingDist$p.value, x$digits),
                           " (W=", round(x$sw.samplingDist$statistic, x$digits),
                           "; based on ", x$samples, " observations)");
  }

  ### Show output
  cat("## SAMPLE DISTRIBUTION ###\n");
  cat(paste0("Sample distribution of ", x$sampleSize,
             " observations\n",
             "Mean=", round(mean(x$sampleVector), x$digits),
             ", median=", round(stats::median(x$sampleVector), x$digits),
             ", SD=", round(stats::sd(x$sampleVector), x$digits),
             ", and therefore SE of the mean = ",
             round(stats::sd(x$sampleVector)/sqrt(x$sampleSize), x$digits),
             "\n\n"));
  print(x$dataShape.sampleDist, extraNotification=FALSE);
  cat(paste0("\n", sw.sampleDist, "\n",
             "Anderson-Darling: p=",
             round(x$ad.sampleDist$p.value, x$digits),
             # round(x$ad.sampleDist@test$p.value, x$digits),
             " (A=",
             round(x$ad.sampleDist$statistic, x$digits),
             # round(x$ad.sampleDist@test$statistic, x$digits),
             ")\n",
             "Kolmogorov-Smirnof: p=", round(x$ks.sampleDist$p.value, x$digits),
             " (D=", round(x$ks.sampleDist$statistic, x$digits), ")"));

  cat("\n\n## SAMPLING DISTRIBUTION FOR THE MEAN ###\n");
  cat(paste0("Sampling distribution of ", x$samples, " samples of n=", x$sampleSize, "\n",
             "Mean=", round(mean(x$samplingDistribution), x$digits),
             ", median=", round(stats::median(x$samplingDistribution), x$digits),
             ", SD=", round(sqrt(stats::var(x$samplingDistribution)), x$digits),
             "\n\n"));
  print(x$dataShape.samplingDist, extraNotification=FALSE);
  cat(paste0("\n", sw.samplingDist, "\n",
             "Anderson-Darling: p=",
             round(x$ad.samplingDist$p.value, x$digits),
             # round(x$ad.samplingDist@test$p.value, x$digits),
             " (A=",
             round(x$ad.samplingDist$statistic, x$digits),
             # round(x$ad.samplingDist@test$statistic, x$digits),
             ")\n",
             "Kolmogorov-Smirnof: p=", round(x$ks.samplingDist$p.value, x$digits),
             " (D=", round(x$ks.samplingDist$statistic, x$digits), ")"));

  ### Plots
  gridExtra::grid.arrange(x$plot.sampleDist,
                          x$plot.samplingDist,
                          x$qqPlot.sampleDist,
                          x$qqPlot.samplingDist,
                          ncol=2);

  invisible();
}

#' @method pander normalityAssessment
#' @rdname normalityAssessment
#' @export
pander.normalityAssessment <- function (x, headerPrefix = "#####",
                                        suppressPlot = FALSE, ...) {

  if (x$sampleSize > 5000) {
    sw.sampleDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.sampleDist$p.value, x$digits + 1),
                            " (W=", round(x$sw.sampleDist$statistic, x$digits),
                            "; NOTE: based on the first 5000 of ",
                            x$sampleSize, " observations)");
  }
  else {
    sw.sampleDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.sampleDist$p.value, x$digits + 1),
                            " (W=", round(x$sw.sampleDist$statistic, x$digits),
                            "; based on ", x$sampleSize, " observations)");
  }

  if (x$samples > 5000) {
    sw.samplingDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.samplingDist$p.value, x$digits + 1),
                              " (W=", round(x$sw.samplingDist$statistic, x$digits),
                              "; NOTE: based on the first 5000 of ",
                              x$samples, " observations)");
  }
  else {
    sw.samplingDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.samplingDist$p.value, x$digits + 1),
                              " (W=", round(x$sw.samplingDist$statistic, x$digits),
                              "; based on ", x$samples, " observations)");
  }

  ### Show output
  cat0("\n\n\n", headerPrefix, " Sample distribution\n\n");
  cat(paste0("Sample distribution of ", x$sampleSize,
             " observations  \n",
             "Mean=", round(mean(x$sampleVector), x$digits),
             ", median=", round(stats::median(x$sampleVector), x$digits),
             ", SD=", round(stats::sd(x$sampleVector), x$digits),
             ", and therefore SE of the mean = ",
             round(stats::sd(x$sampleVector)/sqrt(x$sampleSize), x$digits),
             "\n\n"));
  pander(x$dataShape.sampleDist, extraNotification=FALSE);
  cat(paste0("\n\n", sw.sampleDist, "  \n",
             "Anderson-Darling: ",
             formatPvalue(x$ad.sampleDist$p.value,x$digits + 1),
             # formatPvalue(x$ad.sampleDist@test$p.value,x$digits + 1),
             " (A=",
             round(x$ad.sampleDist$statistic, x$digits),
             # round(x$ad.sampleDist@test$statistic, x$digits),
             ")  \n",
             "Kolmogorov-Smirnof: ", formatPvalue(x$ks.sampleDist$p.value, x$digits + 1),
             " (D=", round(x$ks.sampleDist$statistic, x$digits), ")"));

  cat0("\n\n", headerPrefix, " Sampling distribution of the mean\n\n");
  cat(paste0("Sampling distribution of ", x$samples, " samples of n=", x$sampleSize, "  \n",
             "Mean=", round(mean(x$samplingDistribution), x$digits),
             ", median=", round(stats::median(x$samplingDistribution), x$digits),
             ", SD=", round(sqrt(stats::var(x$samplingDistribution)), x$digits),
             "\n\n"));
  pander(x$dataShape.samplingDist, extraNotification=FALSE);
  cat(paste0("\n\n", sw.samplingDist, "  \n",
             "Anderson-Darling: ",
             formatPvalue(x$ad.samplingDist$p.value, x$digits + 1),
             # formatPvalue(x$ad.samplingDist@test$p.value, x$digits + 1),
             " (A=",
             round(x$ad.samplingDist$statistic, x$digits),
             # round(x$ad.samplingDist@test$statistic, x$digits),
             ")  \n",
             "Kolmogorov-Smirnof: ", formatPvalue(x$ks.samplingDist$p.value, x$digits + 1),
             " (D=", round(x$ks.samplingDist$statistic, x$digits), ")"));
  cat("\n\n\n");
  ### Plots
  if (!suppressPlot) {
    gridExtra::grid.arrange(x$plot.sampleDist,
                            x$plot.samplingDist,
                            x$qqPlot.sampleDist,
                            x$qqPlot.samplingDist,
                            ncol=2);
  }
  invisible();
}