R/normalityAssessment.R

#' #' normalityAssessment and samplingDistribution
#' #'
#' #' normalityAssessment can be used to assess whether a variable and the
#' #' sampling distribution of its mean have an approximately normal distribution.
#' #'
#' #' samplingDistribution is a convenient wrapper for normalityAssessment that
#' #' makes it easy to quickly generate a sample and sampling distribution from
#' #' frequencies (or proportions).
#' #'
#' #' dataShape computes the skewness and kurtosis.
#' #'
#' #'
#' #' normalityAssessment provides a number of normality tests and draws
#' #' histograms of the sample data and the sampling distribution of the mean
#' #' (most statistical tests assume the latter is normal, rather than the first;
#' #' normality of the sample data guarantees normality of the sampling
#' #' distribution of the mean, but if the sample size is sufficiently large, the
#' #' sampling distribution of the mean is approximately normal even when the
#' #' sample data are not normally distributed). Note that for the sampling
#' #' distribution, the degrees of freedom are usually so huge that the normality
#' #' tests, negligible deviations from normality will already result in very
#' #' small p-values.
#' #'
#' #' samplingDistribution makes it easy to quickly assess the distribution of a
#' #' variables based on frequencies or proportions, and dataShape computes
#' #' skewness and kurtosis.
#' #'
#' #' @aliases normalityAssessment samplingDistribution dataShape
#' #' @param sampleVector Numeric vector containing the sample data.
#' #' @param samples Number of samples to use when constructing sampling
#' #' distribution.
#' #' @param digits Number of digits to use when printing results.
#' #' @param samplingDistColor Color to use when drawing the sampling
#' #' distribution.
#' #' @param normalColor Color to use when drawing the standard normal curve.
#' #' @param samplingDistLineSize Size of the line used to draw the sampling
#' #' distribution.
#' #' @param normalLineSize Size of the line used to draw the standard normal
#' #' distribution.
#' #' @param xLabel.sampleDist Label of x axis of the distribution of the sample.
#' #' @param yLabel.sampleDist Label of y axis of the distribution of the sample.
#' #' @param xLabel.samplingDist Label of x axis of the sampling distribution.
#' #' @param yLabel.samplingDist Label of y axis of the sampling distribution.
#' #' @param xLabs,yLabs The axis labels for the three plots (should be vectors of
#' #' three elements; the first specifies the X or Y axis label for the rightmost
#' #' plot (the histogram), the second for the middle plot (the QQ plot), and the
#' #' third for the rightmost plot (the box plot).
#' #' @param popValues The possible values (levels) of the relevant variable. For
#' #' example, for a dichotomous variable, this can be "c(1:2)" (or "c(1, 2)").
#' #' Note that samplingDistribution is for manually specifying the frequency
#' #' distribution (or proportions); if you have a vector with 'raw' data, just
#' #' call normalityAssessment directly.
#' #' @param popFrequencies The frequencies corresponding to each value in
#' #' popValues; must be in the same order! See the examples.
#' #' @param sampleSize Size of the sample; the sum of the frequencies if not
#' #' specified.
#' #' @param na.rm Whether to remove missing data first.
#' #' @param type Type of skewness and kurtosis to compute; either 1 (g1 and g2),
#' #' 2 (G1 and G2), or 3 (b1 and b2). See Joanes & Gill (1998) for more
#' #' information.
#' #' @param conf.level Confidence of confidence intervals.
#' #' @param plots Whether to display plots.
#' #' @param qqCI Whether to show the confidence interval for the QQ plot.
#' #' @param labelOutliers Whether to label outliers with their row number in the
#' #' box plot.
#' #' @param sampleFromPop If true, the sample vector is created by sampling from
#' #' the population information specified; if false, rep() is used to generate
#' #' the sample vector. Note that is proportions are supplied in popFrequencies,
#' #' sampling from the population is necessary!
#' #' @param sampleSizeOverride Whether to use the sample size of the sample as
#' #' sample size for the sampling distribution, instead of the sampling
#' #' distribution size. This makes sense, because otherwise, the sample size and
#' #' thus sensitivity of the null hypothesis significance tests is a function of
#' #' the number of samples used to generate the sampling distribution.
#' #' @param extraNotification Whether to be particularly informative.
#' #' @param headerPrefix A prefix to insert before the heading (e.g. to use Markdown
#' #' headings).
#' #' @param suppressPlot Whether to suppress (`TRUE`) or print (`FALSE`) the plot.
#' #' @param x The object to print/pander.
#' #' @param ... Additional arguments are passed on, usually to the default methods.
#' #' @return
#' #'
#' #' An object with several results, the most notably of which are:
#' #' \item{plot.sampleDist}{Histogram of sample distribution}
#' #' \item{sw.sampleDist}{Shapiro-Wilk normality test of sample distribution}
#' #' \item{ad.sampleDist}{Anderson-Darling normality test of sample distribution}
#' #' \item{ks.sampleDist}{Kolmogorov-Smirnof normality test of sample
#' #' distribution} \item{kurtosis.sampleDist}{Kurtosis for sample distribution}
#' #' \item{skewness.sampleDist}{Skewness for sample distribution}
#' #' \item{plot.samplingDist}{Histogram of sampling distribution}
#' #' \item{sw.samplingDist}{Shapiro-Wilk normality test of sampling distribution}
#' #' \item{ad.samplingDist}{Anderson-Darling normality test of sampling
#' #' distribution} \item{ks.samplingDist}{Kolmogorov-Smirnof normality test of
#' #' sampling distribution} \item{dataShape.samplingDist}{Skewness and kurtosis
#' #' for sampling distribution}
#' #' @keywords utilities
#' #' @rdname normalityAssessment
#' #' @examples
#' #'
#' #' ### Note: the 'not run' is simply because running takes a lot of time,
#' #' ###       but these examples are all safe to run!
#' #' \dontrun{
#' #'
#' #' normalityAssessment(rnorm(35));
#' #'
#' #' ### Create a distribution of three possible values and
#' #' ### show the sampling distribution for the mean
#' #' popValues <- c(1, 2, 3);
#' #' popFrequencies <- c(20, 50, 30);
#' #' sampleSize <- 100;
#' #' samplingDistribution(popValues = popValues,
#' #'                      popFrequencies = popFrequencies,
#' #'                      sampleSize = sampleSize);
#' #'
#' #' ### Create a very skewed distribution of ten possible values
#' #' popValues <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
#' #' popFrequencies <- c(2, 4, 8, 6, 10, 15, 12, 200, 350, 400);
#' #' samplingDistribution(popValues = popValues,
#' #'                      popFrequencies = popFrequencies,
#' #'                      sampleSize = sampleSize, digits=5);
#' #' }
#' #'
#' #' @export normalityAssessment
#' normalityAssessment <- function(sampleVector, samples = 10000, digits=2,
#'                                 samplingDistColor = "#2222CC",
#'                                 normalColor = "#00CC00",
#'                                 samplingDistLineSize = 2,
#'                                 normalLineSize = 1,
#'                                 xLabel.sampleDist = NULL,
#'                                 yLabel.sampleDist = NULL,
#'                                 xLabel.samplingDist = NULL,
#'                                 yLabel.samplingDist = NULL,
#'                                 sampleSizeOverride = TRUE) {
#'
#'   ### Create object for returning results
#'   res <- list(sampleVector.raw = sampleVector,
#'               sampleVector = sampleVector[stats::complete.cases(sampleVector)],
#'               sampleSize = length(sampleVector[stats::complete.cases(sampleVector)]),
#'               samples = samples,
#'               digits = digits);
#'
#'   ### Construct temporary dataset for
#'   ### plotting sample distribution
#'   normalX <- c(seq(min(res$sampleVector), max(res$sampleVector),
#'                    by=(max(res$sampleVector) - min(res$sampleVector))/(res$sampleSize-1)));
#'   normalY <- stats::dnorm(normalX, mean=mean(res$sampleVector),
#'                           sd=stats::sd(res$sampleVector));
#'   sampleDistY <- res$sampleVector;
#'   tempDat <- data.frame(normalX = normalX, normalY = normalY, sampleDist = sampleDistY);
#'   tempBinWidth <- (max(res$sampleVector) - min(res$sampleVector)) / 30;
#'
#'   ### Generate labels if these weren't specified
#'   if (is.null(xLabel.sampleDist)) {
#'     xLabel.sampleDist <- ufs::extractVarName(deparse(substitute(sampleVector)));
#'   }
#'   if (is.null(yLabel.sampleDist)) {
#'     yLabel.sampleDist <- paste0('Frequencies for n=', res$sampleSize);
#'   }
#'
#'   ### Plot sample distribution
#'   res$plot.sampleDist <- normalHist(tempDat$sampleDist,
#'                                     xLabel=xLabel.sampleDist,
#'                                     yLabel=yLabel.sampleDist,
#'                                     distributionColor=samplingDistColor,
#'                                     normalColor=normalColor,
#'                                     distributionLineSize=samplingDistLineSize,
#'                                     normalLineSize=normalLineSize)$plot +
#'     ggplot2::ggtitle("Sample distribution");
#'
#'   res$qqPlot.sampleDist <- ggqq(tempDat$sampleDist);
#'
#'   ### Take 'samples' samples of sampleSize people and store the means
#'   ### (first generate an empty vector to store the means)
#'   res$samplingDistribution <- replicate(samples,
#'                                         mean(sample(res$sampleVector,
#'                                                     size=res$sampleSize,
#'                                                     replace=TRUE)));
#'   # res$samplingDistribution <- c();
#'   # for (i in 1:samples) {
#'   #   res$samplingDistribution[i] <- mean(sample(res$sampleVector, size=res$sampleSize,
#'   #                                              replace=TRUE));
#'   # }
#'
#'   ### Construct temporary dataset for
#'   ### plotting sampling distribution
#'   normalX <- c(seq(min(res$samplingDistribution), max(res$samplingDistribution),
#'                    by=(max(res$samplingDistribution) - min(res$samplingDistribution))/(res$samples-1)));
#'   normalY <- stats::dnorm(normalX, mean=mean(res$samplingDistribution),
#'                           sd=stats::sd(res$samplingDistribution));
#'   samplingDistY <- res$samplingDistribution;
#'   tempDat <- data.frame(normalX = normalX, normalY = normalY, samplingDist = samplingDistY);
#'   tempBinWidth <- (max(res$samplingDistribution) - min(res$samplingDistribution)) / 30;
#'
#'   ### Generate labels if these weren't specified
#'   if (is.null(xLabel.samplingDist)) {
#'     xLabel.samplingDist <- ufs::extractVarName(deparse(substitute(sampleVector)));
#'   }
#'   if (is.null(yLabel.samplingDist)) {
#'     yLabel.samplingDist <- paste0('Frequencies for ', res$samples, ' samples of n=', res$sampleSize);
#'   }
#'
#'   ### Plot sampling distribution
#'   res$plot.samplingDist <- normalHist(tempDat$samplingDist,
#'                                       xLabel=xLabel.samplingDist,
#'                                       yLabel=yLabel.samplingDist,
#'                                       distributionColor=samplingDistColor,
#'                                       normalColor=normalColor,
#'                                       distributionLineSize=samplingDistLineSize,
#'                                       normalLineSize=normalLineSize)$plot +
#'     ggplot2::ggtitle("Sampling distribution");
#'
#'   res$qqPlot.samplingDist <- ggqq(tempDat$samplingDist,
#'                                   sampleSizeOverride = res$sampleSize);
#'
#'   ### Shapiro Wilk test - if there are more than 5000
#'   ### datapoints, only use the first 5000 datapoints
#'   res$sw.sampleDist <- ifelseObj(res$sampleSize > 5000,
#'                               stats::shapiro.test(res$sampleVector[1:5000]),
#'                               stats::shapiro.test(res$sampleVector));
#'   res$sw.samplingDist <- ifelseObj(res$samples > 5000,
#'                                    stats::shapiro.test(res$samplingDistribution[1:5000]),
#'                                    stats::shapiro.test(res$samplingDistribution));
#'
#'   ### Anderson-Darling test
#'   res$ad.sampleDist <- ad.test_from_nortest(res$sampleVector);
#'   res$ad.samplingDist <- ad.test_from_nortest(res$samplingDistribution);
#'
#'   ### Kolomogorov-Smirnof test
#'   suppressWarnings(res$ks.sampleDist <-
#'                      stats::ks.test(res$sampleVector,
#'                                     "pnorm",
#'                                     alternative = "two.sided"));
#'   suppressWarnings(res$ks.samplingDist <-
#'                      stats::ks.test(res$samplingDistribution,
#'                                     "pnorm",
#'                                     alternative = "two.sided"));
#'
#'   ### Skewness and kurtosis
#'   res$dataShape.sampleDist <- dataShape(res$sampleVector, plots=FALSE);
#'   res$dataShape.samplingDist <- dataShape(res$samplingDistribution,
#'                                           sampleSizeOverride=ifelse(sampleSizeOverride,
#'                                                                     length(res$sampleVector),
#'                                                                     NULL),
#'                                           plots=FALSE);
#'
#'   ### Set class for returnable object and return it
#'   class(res) <- 'normalityAssessment';
#'   return(res);
#'
#' }
#'
#' #' @method print normalityAssessment
#' #' @rdname normalityAssessment
#' #' @export
#' print.normalityAssessment <- function (x, ...) {
#'
#'   if (x$sampleSize > 5000) {
#'     sw.sampleDist <- paste0("Shapiro-Wilk: p=", round(x$sw.sampleDist$p.value, x$digits),
#'                                    " (W=", round(x$sw.sampleDist$statistic, x$digits),
#'                                    "; NOTE: based on the first 5000 of ",
#'                                  x$sampleSize, " observations)");
#'   }
#'   else {
#'     sw.sampleDist <- paste0("Shapiro-Wilk: p=", round(x$sw.sampleDist$p.value, x$digits),
#'                                    " (W=", round(x$sw.sampleDist$statistic, x$digits),
#'                                    "; based on ", x$sampleSize, " observations)");
#'   }
#'
#'   if (x$samples > 5000) {
#'     sw.samplingDist <- paste0("Shapiro-Wilk: p=", round(x$sw.samplingDist$p.value, x$digits),
#'                        " (W=", round(x$sw.samplingDist$statistic, x$digits),
#'                        "; NOTE: based on the first 5000 of ",
#'                        x$samples, " observations)");
#'   }
#'   else {
#'     sw.samplingDist <- paste0("Shapiro-Wilk: p=", round(x$sw.samplingDist$p.value, x$digits),
#'                            " (W=", round(x$sw.samplingDist$statistic, x$digits),
#'                            "; based on ", x$samples, " observations)");
#'   }
#'
#'   ### Show output
#'   cat("## SAMPLE DISTRIBUTION ###\n");
#'   cat(paste0("Sample distribution of ", x$sampleSize,
#'              " observations\n",
#'              "Mean=", round(mean(x$sampleVector), x$digits),
#'              ", median=", round(stats::median(x$sampleVector), x$digits),
#'              ", SD=", round(stats::sd(x$sampleVector), x$digits),
#'              ", and therefore SE of the mean = ",
#'              round(stats::sd(x$sampleVector)/sqrt(x$sampleSize), x$digits),
#'              "\n\n"));
#'   print(x$dataShape.sampleDist, extraNotification=FALSE);
#'   cat(paste0("\n", sw.sampleDist, "\n",
#'              "Anderson-Darling: p=",
#'              round(x$ad.sampleDist$p.value, x$digits),
#'              # round(x$ad.sampleDist@test$p.value, x$digits),
#'              " (A=",
#'              round(x$ad.sampleDist$statistic, x$digits),
#'              # round(x$ad.sampleDist@test$statistic, x$digits),
#'              ")\n",
#'              "Kolmogorov-Smirnof: p=", round(x$ks.sampleDist$p.value, x$digits),
#'              " (D=", round(x$ks.sampleDist$statistic, x$digits), ")"));
#'
#'   cat("\n\n## SAMPLING DISTRIBUTION FOR THE MEAN ###\n");
#'   cat(paste0("Sampling distribution of ", x$samples, " samples of n=", x$sampleSize, "\n",
#'              "Mean=", round(mean(x$samplingDistribution), x$digits),
#'              ", median=", round(stats::median(x$samplingDistribution), x$digits),
#'              ", SD=", round(sqrt(stats::var(x$samplingDistribution)), x$digits),
#'              "\n\n"));
#'   print(x$dataShape.samplingDist, extraNotification=FALSE);
#'   cat(paste0("\n", sw.samplingDist, "\n",
#'              "Anderson-Darling: p=",
#'              round(x$ad.samplingDist$p.value, x$digits),
#'              # round(x$ad.samplingDist@test$p.value, x$digits),
#'              " (A=",
#'              round(x$ad.samplingDist$statistic, x$digits),
#'              # round(x$ad.samplingDist@test$statistic, x$digits),
#'              ")\n",
#'              "Kolmogorov-Smirnof: p=", round(x$ks.samplingDist$p.value, x$digits),
#'              " (D=", round(x$ks.samplingDist$statistic, x$digits), ")"));
#'
#'   ### Plots
#'   gridExtra::grid.arrange(x$plot.sampleDist,
#'                           x$plot.samplingDist,
#'                           x$qqPlot.sampleDist,
#'                           x$qqPlot.samplingDist,
#'                           ncol=2);
#'
#'   invisible();
#' }
#'
#' #' @method pander normalityAssessment
#' #' @rdname normalityAssessment
#' #' @export
#' pander.normalityAssessment <- function (x, headerPrefix = "#####",
#'                                         suppressPlot = FALSE, ...) {
#'
#'   if (x$sampleSize > 5000) {
#'     sw.sampleDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.sampleDist$p.value, x$digits + 1),
#'                             " (W=", round(x$sw.sampleDist$statistic, x$digits),
#'                             "; NOTE: based on the first 5000 of ",
#'                             x$sampleSize, " observations)");
#'   }
#'   else {
#'     sw.sampleDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.sampleDist$p.value, x$digits + 1),
#'                             " (W=", round(x$sw.sampleDist$statistic, x$digits),
#'                             "; based on ", x$sampleSize, " observations)");
#'   }
#'
#'   if (x$samples > 5000) {
#'     sw.samplingDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.samplingDist$p.value, x$digits + 1),
#'                               " (W=", round(x$sw.samplingDist$statistic, x$digits),
#'                               "; NOTE: based on the first 5000 of ",
#'                               x$samples, " observations)");
#'   }
#'   else {
#'     sw.samplingDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.samplingDist$p.value, x$digits + 1),
#'                               " (W=", round(x$sw.samplingDist$statistic, x$digits),
#'                               "; based on ", x$samples, " observations)");
#'   }
#'
#'   ### Show output
#'   cat0("\n\n\n", headerPrefix, " Sample distribution\n\n");
#'   cat(paste0("Sample distribution of ", x$sampleSize,
#'              " observations  \n",
#'              "Mean=", round(mean(x$sampleVector), x$digits),
#'              ", median=", round(stats::median(x$sampleVector), x$digits),
#'              ", SD=", round(stats::sd(x$sampleVector), x$digits),
#'              ", and therefore SE of the mean = ",
#'              round(stats::sd(x$sampleVector)/sqrt(x$sampleSize), x$digits),
#'              "\n\n"));
#'   pander(x$dataShape.sampleDist, extraNotification=FALSE);
#'   cat(paste0("\n\n", sw.sampleDist, "  \n",
#'              "Anderson-Darling: ",
#'              formatPvalue(x$ad.sampleDist$p.value,x$digits + 1),
#'              # formatPvalue(x$ad.sampleDist@test$p.value,x$digits + 1),
#'              " (A=",
#'              round(x$ad.sampleDist$statistic, x$digits),
#'              # round(x$ad.sampleDist@test$statistic, x$digits),
#'              ")  \n",
#'              "Kolmogorov-Smirnof: ", formatPvalue(x$ks.sampleDist$p.value, x$digits + 1),
#'              " (D=", round(x$ks.sampleDist$statistic, x$digits), ")"));
#'
#'   cat0("\n\n", headerPrefix, " Sampling distribution of the mean\n\n");
#'   cat(paste0("Sampling distribution of ", x$samples, " samples of n=", x$sampleSize, "  \n",
#'              "Mean=", round(mean(x$samplingDistribution), x$digits),
#'              ", median=", round(stats::median(x$samplingDistribution), x$digits),
#'              ", SD=", round(sqrt(stats::var(x$samplingDistribution)), x$digits),
#'              "\n\n"));
#'   pander(x$dataShape.samplingDist, extraNotification=FALSE);
#'   cat(paste0("\n\n", sw.samplingDist, "  \n",
#'              "Anderson-Darling: ",
#'              formatPvalue(x$ad.samplingDist$p.value, x$digits + 1),
#'              # formatPvalue(x$ad.samplingDist@test$p.value, x$digits + 1),
#'              " (A=",
#'              round(x$ad.samplingDist$statistic, x$digits),
#'              # round(x$ad.samplingDist@test$statistic, x$digits),
#'              ")  \n",
#'              "Kolmogorov-Smirnof: ", formatPvalue(x$ks.samplingDist$p.value, x$digits + 1),
#'              " (D=", round(x$ks.samplingDist$statistic, x$digits), ")"));
#'   cat("\n\n\n");
#'   ### Plots
#'   if (!suppressPlot) {
#'     gridExtra::grid.arrange(x$plot.sampleDist,
#'                             x$plot.samplingDist,
#'                             x$qqPlot.sampleDist,
#'                             x$qqPlot.samplingDist,
#'                             ncol=2);
#'   }
#'   invisible();
#' }

Try the rosetta package in your browser

Any scripts or data that you put into this service are public.

rosetta documentation built on March 7, 2023, 7:40 p.m.