Nothing
#' #' normalityAssessment and samplingDistribution
#' #'
#' #' normalityAssessment can be used to assess whether a variable and the
#' #' sampling distribution of its mean have an approximately normal distribution.
#' #'
#' #' samplingDistribution is a convenient wrapper for normalityAssessment that
#' #' makes it easy to quickly generate a sample and sampling distribution from
#' #' frequencies (or proportions).
#' #'
#' #' dataShape computes the skewness and kurtosis.
#' #'
#' #'
#' #' normalityAssessment provides a number of normality tests and draws
#' #' histograms of the sample data and the sampling distribution of the mean
#' #' (most statistical tests assume the latter is normal, rather than the first;
#' #' normality of the sample data guarantees normality of the sampling
#' #' distribution of the mean, but if the sample size is sufficiently large, the
#' #' sampling distribution of the mean is approximately normal even when the
#' #' sample data are not normally distributed). Note that for the sampling
#' #' distribution, the degrees of freedom are usually so huge that the normality
#' #' tests, negligible deviations from normality will already result in very
#' #' small p-values.
#' #'
#' #' samplingDistribution makes it easy to quickly assess the distribution of a
#' #' variables based on frequencies or proportions, and dataShape computes
#' #' skewness and kurtosis.
#' #'
#' #' @aliases normalityAssessment samplingDistribution dataShape
#' #' @param sampleVector Numeric vector containing the sample data.
#' #' @param samples Number of samples to use when constructing sampling
#' #' distribution.
#' #' @param digits Number of digits to use when printing results.
#' #' @param samplingDistColor Color to use when drawing the sampling
#' #' distribution.
#' #' @param normalColor Color to use when drawing the standard normal curve.
#' #' @param samplingDistLineSize Size of the line used to draw the sampling
#' #' distribution.
#' #' @param normalLineSize Size of the line used to draw the standard normal
#' #' distribution.
#' #' @param xLabel.sampleDist Label of x axis of the distribution of the sample.
#' #' @param yLabel.sampleDist Label of y axis of the distribution of the sample.
#' #' @param xLabel.samplingDist Label of x axis of the sampling distribution.
#' #' @param yLabel.samplingDist Label of y axis of the sampling distribution.
#' #' @param xLabs,yLabs The axis labels for the three plots (should be vectors of
#' #' three elements; the first specifies the X or Y axis label for the rightmost
#' #' plot (the histogram), the second for the middle plot (the QQ plot), and the
#' #' third for the rightmost plot (the box plot).
#' #' @param popValues The possible values (levels) of the relevant variable. For
#' #' example, for a dichotomous variable, this can be "c(1:2)" (or "c(1, 2)").
#' #' Note that samplingDistribution is for manually specifying the frequency
#' #' distribution (or proportions); if you have a vector with 'raw' data, just
#' #' call normalityAssessment directly.
#' #' @param popFrequencies The frequencies corresponding to each value in
#' #' popValues; must be in the same order! See the examples.
#' #' @param sampleSize Size of the sample; the sum of the frequencies if not
#' #' specified.
#' #' @param na.rm Whether to remove missing data first.
#' #' @param type Type of skewness and kurtosis to compute; either 1 (g1 and g2),
#' #' 2 (G1 and G2), or 3 (b1 and b2). See Joanes & Gill (1998) for more
#' #' information.
#' #' @param conf.level Confidence of confidence intervals.
#' #' @param plots Whether to display plots.
#' #' @param qqCI Whether to show the confidence interval for the QQ plot.
#' #' @param labelOutliers Whether to label outliers with their row number in the
#' #' box plot.
#' #' @param sampleFromPop If true, the sample vector is created by sampling from
#' #' the population information specified; if false, rep() is used to generate
#' #' the sample vector. Note that is proportions are supplied in popFrequencies,
#' #' sampling from the population is necessary!
#' #' @param sampleSizeOverride Whether to use the sample size of the sample as
#' #' sample size for the sampling distribution, instead of the sampling
#' #' distribution size. This makes sense, because otherwise, the sample size and
#' #' thus sensitivity of the null hypothesis significance tests is a function of
#' #' the number of samples used to generate the sampling distribution.
#' #' @param extraNotification Whether to be particularly informative.
#' #' @param headerPrefix A prefix to insert before the heading (e.g. to use Markdown
#' #' headings).
#' #' @param suppressPlot Whether to suppress (`TRUE`) or print (`FALSE`) the plot.
#' #' @param x The object to print/pander.
#' #' @param ... Additional arguments are passed on, usually to the default methods.
#' #' @return
#' #'
#' #' An object with several results, the most notably of which are:
#' #' \item{plot.sampleDist}{Histogram of sample distribution}
#' #' \item{sw.sampleDist}{Shapiro-Wilk normality test of sample distribution}
#' #' \item{ad.sampleDist}{Anderson-Darling normality test of sample distribution}
#' #' \item{ks.sampleDist}{Kolmogorov-Smirnof normality test of sample
#' #' distribution} \item{kurtosis.sampleDist}{Kurtosis for sample distribution}
#' #' \item{skewness.sampleDist}{Skewness for sample distribution}
#' #' \item{plot.samplingDist}{Histogram of sampling distribution}
#' #' \item{sw.samplingDist}{Shapiro-Wilk normality test of sampling distribution}
#' #' \item{ad.samplingDist}{Anderson-Darling normality test of sampling
#' #' distribution} \item{ks.samplingDist}{Kolmogorov-Smirnof normality test of
#' #' sampling distribution} \item{dataShape.samplingDist}{Skewness and kurtosis
#' #' for sampling distribution}
#' #' @keywords utilities
#' #' @rdname normalityAssessment
#' #' @examples
#' #'
#' #' ### Note: the 'not run' is simply because running takes a lot of time,
#' #' ### but these examples are all safe to run!
#' #' \dontrun{
#' #'
#' #' normalityAssessment(rnorm(35));
#' #'
#' #' ### Create a distribution of three possible values and
#' #' ### show the sampling distribution for the mean
#' #' popValues <- c(1, 2, 3);
#' #' popFrequencies <- c(20, 50, 30);
#' #' sampleSize <- 100;
#' #' samplingDistribution(popValues = popValues,
#' #' popFrequencies = popFrequencies,
#' #' sampleSize = sampleSize);
#' #'
#' #' ### Create a very skewed distribution of ten possible values
#' #' popValues <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
#' #' popFrequencies <- c(2, 4, 8, 6, 10, 15, 12, 200, 350, 400);
#' #' samplingDistribution(popValues = popValues,
#' #' popFrequencies = popFrequencies,
#' #' sampleSize = sampleSize, digits=5);
#' #' }
#' #'
#' #' @export normalityAssessment
#' normalityAssessment <- function(sampleVector, samples = 10000, digits=2,
#' samplingDistColor = "#2222CC",
#' normalColor = "#00CC00",
#' samplingDistLineSize = 2,
#' normalLineSize = 1,
#' xLabel.sampleDist = NULL,
#' yLabel.sampleDist = NULL,
#' xLabel.samplingDist = NULL,
#' yLabel.samplingDist = NULL,
#' sampleSizeOverride = TRUE) {
#'
#' ### Create object for returning results
#' res <- list(sampleVector.raw = sampleVector,
#' sampleVector = sampleVector[stats::complete.cases(sampleVector)],
#' sampleSize = length(sampleVector[stats::complete.cases(sampleVector)]),
#' samples = samples,
#' digits = digits);
#'
#' ### Construct temporary dataset for
#' ### plotting sample distribution
#' normalX <- c(seq(min(res$sampleVector), max(res$sampleVector),
#' by=(max(res$sampleVector) - min(res$sampleVector))/(res$sampleSize-1)));
#' normalY <- stats::dnorm(normalX, mean=mean(res$sampleVector),
#' sd=stats::sd(res$sampleVector));
#' sampleDistY <- res$sampleVector;
#' tempDat <- data.frame(normalX = normalX, normalY = normalY, sampleDist = sampleDistY);
#' tempBinWidth <- (max(res$sampleVector) - min(res$sampleVector)) / 30;
#'
#' ### Generate labels if these weren't specified
#' if (is.null(xLabel.sampleDist)) {
#' xLabel.sampleDist <- ufs::extractVarName(deparse(substitute(sampleVector)));
#' }
#' if (is.null(yLabel.sampleDist)) {
#' yLabel.sampleDist <- paste0('Frequencies for n=', res$sampleSize);
#' }
#'
#' ### Plot sample distribution
#' res$plot.sampleDist <- normalHist(tempDat$sampleDist,
#' xLabel=xLabel.sampleDist,
#' yLabel=yLabel.sampleDist,
#' distributionColor=samplingDistColor,
#' normalColor=normalColor,
#' distributionLineSize=samplingDistLineSize,
#' normalLineSize=normalLineSize)$plot +
#' ggplot2::ggtitle("Sample distribution");
#'
#' res$qqPlot.sampleDist <- ggqq(tempDat$sampleDist);
#'
#' ### Take 'samples' samples of sampleSize people and store the means
#' ### (first generate an empty vector to store the means)
#' res$samplingDistribution <- replicate(samples,
#' mean(sample(res$sampleVector,
#' size=res$sampleSize,
#' replace=TRUE)));
#' # res$samplingDistribution <- c();
#' # for (i in 1:samples) {
#' # res$samplingDistribution[i] <- mean(sample(res$sampleVector, size=res$sampleSize,
#' # replace=TRUE));
#' # }
#'
#' ### Construct temporary dataset for
#' ### plotting sampling distribution
#' normalX <- c(seq(min(res$samplingDistribution), max(res$samplingDistribution),
#' by=(max(res$samplingDistribution) - min(res$samplingDistribution))/(res$samples-1)));
#' normalY <- stats::dnorm(normalX, mean=mean(res$samplingDistribution),
#' sd=stats::sd(res$samplingDistribution));
#' samplingDistY <- res$samplingDistribution;
#' tempDat <- data.frame(normalX = normalX, normalY = normalY, samplingDist = samplingDistY);
#' tempBinWidth <- (max(res$samplingDistribution) - min(res$samplingDistribution)) / 30;
#'
#' ### Generate labels if these weren't specified
#' if (is.null(xLabel.samplingDist)) {
#' xLabel.samplingDist <- ufs::extractVarName(deparse(substitute(sampleVector)));
#' }
#' if (is.null(yLabel.samplingDist)) {
#' yLabel.samplingDist <- paste0('Frequencies for ', res$samples, ' samples of n=', res$sampleSize);
#' }
#'
#' ### Plot sampling distribution
#' res$plot.samplingDist <- normalHist(tempDat$samplingDist,
#' xLabel=xLabel.samplingDist,
#' yLabel=yLabel.samplingDist,
#' distributionColor=samplingDistColor,
#' normalColor=normalColor,
#' distributionLineSize=samplingDistLineSize,
#' normalLineSize=normalLineSize)$plot +
#' ggplot2::ggtitle("Sampling distribution");
#'
#' res$qqPlot.samplingDist <- ggqq(tempDat$samplingDist,
#' sampleSizeOverride = res$sampleSize);
#'
#' ### Shapiro Wilk test - if there are more than 5000
#' ### datapoints, only use the first 5000 datapoints
#' res$sw.sampleDist <- ifelseObj(res$sampleSize > 5000,
#' stats::shapiro.test(res$sampleVector[1:5000]),
#' stats::shapiro.test(res$sampleVector));
#' res$sw.samplingDist <- ifelseObj(res$samples > 5000,
#' stats::shapiro.test(res$samplingDistribution[1:5000]),
#' stats::shapiro.test(res$samplingDistribution));
#'
#' ### Anderson-Darling test
#' res$ad.sampleDist <- ad.test_from_nortest(res$sampleVector);
#' res$ad.samplingDist <- ad.test_from_nortest(res$samplingDistribution);
#'
#' ### Kolomogorov-Smirnof test
#' suppressWarnings(res$ks.sampleDist <-
#' stats::ks.test(res$sampleVector,
#' "pnorm",
#' alternative = "two.sided"));
#' suppressWarnings(res$ks.samplingDist <-
#' stats::ks.test(res$samplingDistribution,
#' "pnorm",
#' alternative = "two.sided"));
#'
#' ### Skewness and kurtosis
#' res$dataShape.sampleDist <- dataShape(res$sampleVector, plots=FALSE);
#' res$dataShape.samplingDist <- dataShape(res$samplingDistribution,
#' sampleSizeOverride=ifelse(sampleSizeOverride,
#' length(res$sampleVector),
#' NULL),
#' plots=FALSE);
#'
#' ### Set class for returnable object and return it
#' class(res) <- 'normalityAssessment';
#' return(res);
#'
#' }
#'
#' #' @method print normalityAssessment
#' #' @rdname normalityAssessment
#' #' @export
#' print.normalityAssessment <- function (x, ...) {
#'
#' if (x$sampleSize > 5000) {
#' sw.sampleDist <- paste0("Shapiro-Wilk: p=", round(x$sw.sampleDist$p.value, x$digits),
#' " (W=", round(x$sw.sampleDist$statistic, x$digits),
#' "; NOTE: based on the first 5000 of ",
#' x$sampleSize, " observations)");
#' }
#' else {
#' sw.sampleDist <- paste0("Shapiro-Wilk: p=", round(x$sw.sampleDist$p.value, x$digits),
#' " (W=", round(x$sw.sampleDist$statistic, x$digits),
#' "; based on ", x$sampleSize, " observations)");
#' }
#'
#' if (x$samples > 5000) {
#' sw.samplingDist <- paste0("Shapiro-Wilk: p=", round(x$sw.samplingDist$p.value, x$digits),
#' " (W=", round(x$sw.samplingDist$statistic, x$digits),
#' "; NOTE: based on the first 5000 of ",
#' x$samples, " observations)");
#' }
#' else {
#' sw.samplingDist <- paste0("Shapiro-Wilk: p=", round(x$sw.samplingDist$p.value, x$digits),
#' " (W=", round(x$sw.samplingDist$statistic, x$digits),
#' "; based on ", x$samples, " observations)");
#' }
#'
#' ### Show output
#' cat("## SAMPLE DISTRIBUTION ###\n");
#' cat(paste0("Sample distribution of ", x$sampleSize,
#' " observations\n",
#' "Mean=", round(mean(x$sampleVector), x$digits),
#' ", median=", round(stats::median(x$sampleVector), x$digits),
#' ", SD=", round(stats::sd(x$sampleVector), x$digits),
#' ", and therefore SE of the mean = ",
#' round(stats::sd(x$sampleVector)/sqrt(x$sampleSize), x$digits),
#' "\n\n"));
#' print(x$dataShape.sampleDist, extraNotification=FALSE);
#' cat(paste0("\n", sw.sampleDist, "\n",
#' "Anderson-Darling: p=",
#' round(x$ad.sampleDist$p.value, x$digits),
#' # round(x$ad.sampleDist@test$p.value, x$digits),
#' " (A=",
#' round(x$ad.sampleDist$statistic, x$digits),
#' # round(x$ad.sampleDist@test$statistic, x$digits),
#' ")\n",
#' "Kolmogorov-Smirnof: p=", round(x$ks.sampleDist$p.value, x$digits),
#' " (D=", round(x$ks.sampleDist$statistic, x$digits), ")"));
#'
#' cat("\n\n## SAMPLING DISTRIBUTION FOR THE MEAN ###\n");
#' cat(paste0("Sampling distribution of ", x$samples, " samples of n=", x$sampleSize, "\n",
#' "Mean=", round(mean(x$samplingDistribution), x$digits),
#' ", median=", round(stats::median(x$samplingDistribution), x$digits),
#' ", SD=", round(sqrt(stats::var(x$samplingDistribution)), x$digits),
#' "\n\n"));
#' print(x$dataShape.samplingDist, extraNotification=FALSE);
#' cat(paste0("\n", sw.samplingDist, "\n",
#' "Anderson-Darling: p=",
#' round(x$ad.samplingDist$p.value, x$digits),
#' # round(x$ad.samplingDist@test$p.value, x$digits),
#' " (A=",
#' round(x$ad.samplingDist$statistic, x$digits),
#' # round(x$ad.samplingDist@test$statistic, x$digits),
#' ")\n",
#' "Kolmogorov-Smirnof: p=", round(x$ks.samplingDist$p.value, x$digits),
#' " (D=", round(x$ks.samplingDist$statistic, x$digits), ")"));
#'
#' ### Plots
#' gridExtra::grid.arrange(x$plot.sampleDist,
#' x$plot.samplingDist,
#' x$qqPlot.sampleDist,
#' x$qqPlot.samplingDist,
#' ncol=2);
#'
#' invisible();
#' }
#'
#' #' @method pander normalityAssessment
#' #' @rdname normalityAssessment
#' #' @export
#' pander.normalityAssessment <- function (x, headerPrefix = "#####",
#' suppressPlot = FALSE, ...) {
#'
#' if (x$sampleSize > 5000) {
#' sw.sampleDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.sampleDist$p.value, x$digits + 1),
#' " (W=", round(x$sw.sampleDist$statistic, x$digits),
#' "; NOTE: based on the first 5000 of ",
#' x$sampleSize, " observations)");
#' }
#' else {
#' sw.sampleDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.sampleDist$p.value, x$digits + 1),
#' " (W=", round(x$sw.sampleDist$statistic, x$digits),
#' "; based on ", x$sampleSize, " observations)");
#' }
#'
#' if (x$samples > 5000) {
#' sw.samplingDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.samplingDist$p.value, x$digits + 1),
#' " (W=", round(x$sw.samplingDist$statistic, x$digits),
#' "; NOTE: based on the first 5000 of ",
#' x$samples, " observations)");
#' }
#' else {
#' sw.samplingDist <- paste0("Shapiro-Wilk: ", formatPvalue(x$sw.samplingDist$p.value, x$digits + 1),
#' " (W=", round(x$sw.samplingDist$statistic, x$digits),
#' "; based on ", x$samples, " observations)");
#' }
#'
#' ### Show output
#' cat0("\n\n\n", headerPrefix, " Sample distribution\n\n");
#' cat(paste0("Sample distribution of ", x$sampleSize,
#' " observations \n",
#' "Mean=", round(mean(x$sampleVector), x$digits),
#' ", median=", round(stats::median(x$sampleVector), x$digits),
#' ", SD=", round(stats::sd(x$sampleVector), x$digits),
#' ", and therefore SE of the mean = ",
#' round(stats::sd(x$sampleVector)/sqrt(x$sampleSize), x$digits),
#' "\n\n"));
#' pander(x$dataShape.sampleDist, extraNotification=FALSE);
#' cat(paste0("\n\n", sw.sampleDist, " \n",
#' "Anderson-Darling: ",
#' formatPvalue(x$ad.sampleDist$p.value,x$digits + 1),
#' # formatPvalue(x$ad.sampleDist@test$p.value,x$digits + 1),
#' " (A=",
#' round(x$ad.sampleDist$statistic, x$digits),
#' # round(x$ad.sampleDist@test$statistic, x$digits),
#' ") \n",
#' "Kolmogorov-Smirnof: ", formatPvalue(x$ks.sampleDist$p.value, x$digits + 1),
#' " (D=", round(x$ks.sampleDist$statistic, x$digits), ")"));
#'
#' cat0("\n\n", headerPrefix, " Sampling distribution of the mean\n\n");
#' cat(paste0("Sampling distribution of ", x$samples, " samples of n=", x$sampleSize, " \n",
#' "Mean=", round(mean(x$samplingDistribution), x$digits),
#' ", median=", round(stats::median(x$samplingDistribution), x$digits),
#' ", SD=", round(sqrt(stats::var(x$samplingDistribution)), x$digits),
#' "\n\n"));
#' pander(x$dataShape.samplingDist, extraNotification=FALSE);
#' cat(paste0("\n\n", sw.samplingDist, " \n",
#' "Anderson-Darling: ",
#' formatPvalue(x$ad.samplingDist$p.value, x$digits + 1),
#' # formatPvalue(x$ad.samplingDist@test$p.value, x$digits + 1),
#' " (A=",
#' round(x$ad.samplingDist$statistic, x$digits),
#' # round(x$ad.samplingDist@test$statistic, x$digits),
#' ") \n",
#' "Kolmogorov-Smirnof: ", formatPvalue(x$ks.samplingDist$p.value, x$digits + 1),
#' " (D=", round(x$ks.samplingDist$statistic, x$digits), ")"));
#' cat("\n\n\n");
#' ### Plots
#' if (!suppressPlot) {
#' gridExtra::grid.arrange(x$plot.sampleDist,
#' x$plot.samplingDist,
#' x$qqPlot.sampleDist,
#' x$qqPlot.samplingDist,
#' ncol=2);
#' }
#' invisible();
#' }
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.