In mascaretti/moxier: An Introduction To Statistics With R

library(knitr)
library(learnr)
knitr::opts_chunk$set(echo = TRUE, exercise = FALSE)

Hypothesis Testing and Anova

Two-Sample Z tests for comparing the means (unpaired)

Data Import

We import the data

t1 <- moxier::barcellona
t2 <- moxier::milano

p <- dim(t1)[2]
n1 <- dim(t1)[1]
n2 <- dim(t2)[1]

and plot them.

plot(rbind(t1, t2), col = c(rep("red", n1), rep("black", n2)))

Hypothesis

We select the level of significance

alpha <- 0.01

and the null hypothesis.

delta.mean.H0 <- 0

We then compute the estimate of the difference between the means

(delta.mean.est <- mean(t1[, "Temperatura"]) - mean(t2[, "Temperatura"]))

and estimate the standard deviation (if it is not known)

(sd.est <- sqrt(var(t1[, "Temperatura"]) / n1 + var(t2[, "Temperatura"]) / n2))

We compute the Z-test statistic

(zstat <- (delta.mean.est - delta.mean.H0) / sd.est)

and compute the boundaries of the rejection region.

(cfr.z <- qnorm(1 - alpha / 2))

We check if this leads to the rejection of $H_{0}$ or not

abs(zstat) > cfr.z

and compute the p-value.

(pvalue <- ifelse(zstat >= 0, (1 - pnorm(zstat)) * 2, pnorm(zstat) * 2))

We finally compute the two-sample Z intervals for comparing the means (unpaired)

CI <- c(
  inf = delta.mean.est - sd.est * qnorm(1 - alpha / 2),
  center = delta.mean.est,
  sup = delta.mean.est + sd.est * qnorm(1 - alpha / 2)
)
kable(CI)

Multivariate two-sample tests for comparing the means (unpaired)

We consider a component-wise testing approach: an adjusted two-sample Z test for comparing the means (unpaired).

We select the level of signficance,

alpha <- 0.01

the null hypothesism

delta.mean.H0 <- c(0, 0, 0)

we compute three estimates of the three differences of the means.

(delta.mean.est <- colMeans(t1) - colMeans(t2))

After that, we estiamte the standard deviations of the three estimates (if unkwonw),

(sd.est <- sqrt(apply(t1, 2, var) / n1 + apply(t2, 2, var) / n2))

we compute the Z-test statistics,

(zstat <- (delta.mean.est - delta.mean.H0) / sd.est)

adjust the level of signficance (e.g. Bonferroni adjustment);

(alpha.adj <- alpha / p)

we compute the boundaries of the rejection regions,

(cfr.z <- qnorm(1 - alpha.adj / 2))

check if the boundaries lead to the rejection of $H_0$ or not,

abs(zstat) > cfr.z

compute the p-value,

(pvalue <- ifelse(zstat >= 0, (1 - pnorm(zstat)) * 2, pnorm(zstat) * 2))

the adjusted p-value,

(pvalue.adj <- pmin(pvalue * p, 1))

and finally show the adjusted two-sample Z intervals for comparing the means (unpaird) (e.g. Bonferroni intervals)

CI <- cbind(
  inf = delta.mean.est - sd.est * qnorm(1 - alpha.adj / 2),
  center = delta.mean.est,
  sup = delta.mean.est + sd.est * qnorm(1 - alpha.adj / 2)
)
kable(CI)

Global Testing Approach (Hotelling's test)

We now approach the issue by using a global testing approach, namely Hotelling's test.

We select the level of signficance,

alpha <- 0.01

choose the null hypothesis (a vector in $\mathbb{R}^3$).

delta.mean.H0 <- c(0, 0, 0)

compute three estimates of the the three differences of the means (again, a vector in $\mathbb{R}^3$),

(delta.mean.est <- colMeans(t1) - colMeans(t2))

we estimate the covariance matrix (a matrix in $\mathbb{R}^{3 \times 3}$).

(cov.est <- cov(t1) / n1 + cov(t2) / n2)

We then compute the $\chi^{2}$-test statistic,

(chi2stat <- (delta.mean.est - delta.mean.H0) %*% solve(cov.est) %*% (delta.mean.est - delta.mean.H0))

compute the boundary of the elliptical rejection region,

(cfr.chi2 <- qchisq(1 - alpha, p))

check if this leads to the rejection of $H_{0}$ or not

chi2stat > cfr.chi2

and compute the p-value.

(pvalue <- 1 - pchisq(chi2stat, p))

Moreover, we show the $\chi^2$ confide:nce region for comparing the means (unpaired). They can't be visualised whenever $p > 3$. We are interested in checking the shades along the three axes (i.e., simultaneous confidence intervals)

CI.sim <- cbind(
  inf = delta.mean.est - sd.est * sqrt(qchisq(1 - alpha, p)),
  center = delta.mean.est,
  sup = delta.mean.est + sd.est * sqrt(qchisq(1 - alpha, p))
)
kable(CI.sim)

One-Way ANOVA

Introduction

We load the data. We have $p = 1$ and $g = 6$.

chickwts
summary(chickwts)

We build the vector of responses (e.g. weight) and the vector of treatment levels (e.g. feed type)

response <- chickwts[, "weight"]
treatment <- chickwts[, "feed"]

We compute the sample size,

n <- length(response)

the treatment group sample sizes

ng <- table(treatment)

and the number of treatments.

treatment.levels <- levels(treatment)
g <- length(treatment.levels)

We also plot the treatment groups.

boxplot(response ~ treatment)

Computing ANOVA

We wish to test for the equality of the $g$ means of the treatment groups:

$H_{0}: \, \tau_{1} = \tau_{2} = \tau_{3} = \tau_{4} = 0$
$H_{1}: \, H_{0}^c$

We compute the estimates of the $g$ means under the null hypothesis,

rep(mean(response), g)

and we compute the estimates of the six means under the alternative hypothesis. Note that these are valid (albeit sub-optimal) estimates even under the null hypothesis.

tapply(response, treatment, mean)

We perform the ANOVA decomposition.

fit <- aov(response ~ treatment)
summary(fit)

And we perform pairwise comparision (e.g. Bonferroni). We need to make $g \dot \frac{g - 1}{2}$ comparisons.

mean.est.matrix <- matrix(NA, nrow = g, ncol = g, dimnames = list(treatment.levels, treatment.levels))
zstat.matrix <- matrix(NA, nrow = g, ncol = g, dimnames = list(treatment.levels, treatment.levels))
pvalue.matrix <- matrix(NA, nrow = g, ncol = g, dimnames = list(treatment.levels, treatment.levels))
pvalue.adj.matrix <- matrix(NA, nrow = g, ncol = g, dimnames = list(treatment.levels, treatment.levels))
for (i in 1:(g - 1)) {
  for (j in (i + 1):g) {
    test.ij <- t.test(response[treatment == treatment.levels[i]], response[treatment == treatment.levels[j]])
    mean.est.matrix[i, j] <- diff(test.ij$estimate)
    zstat.matrix[i, j] <- test.ij$statistic
    pvalue.matrix[i, j] <- test.ij$p.value
    pvalue.adj.matrix[i, j] <- min(test.ij$p.value * (g * (g - 1) / 2), 1)
  }
}

We now plot the pairwise estimated differences of the means,

(mean.est.matrix)

the pairwise estimated test statistics,

(zstat.matrix)

the pairwise p-values

(pvalue.matrix)

and the pairwise adjusted p-values.

(pvalue.adj.matrix)
pvalue.adj.matrix < 0.05

Two-way ANOVA

We compute a two-way ANOVA for the following problem:

Response: Distance per fuel litre [km/l]
Factor 1: Gasoline Supplier (0: Esso, 1: Shell)
Factor 2: Gasoline Type (0: 95, 1: 98)

We set the variables.

km <- c(18.7, 16.8, 20.1, 22.4, 14.0, 15.2, 22.0, 23.3)
gaz.supplier <- factor(c("Esso", "Esso", "Esso", "Esso", "Shell", "Shell", "Shell", "Shell"))
gaz.type <- factor(c("95", "95", "98", "98", "95", "95", "98", "98"))
gaz.supplier.type <- factor(c("Esso95", "Esso95", "Esso98", "Esso98", "Shell95", "Shell95", "Shell98", "Shell98"))

We estimate the means under the hypothesis of no effect of supplier nor type.

(M <- mean(km))

We estimate the means under the hypothesis of no effect of type only.

(Md.supplier <- tapply(km, gaz.supplier, mean))

We estimate the means under the hypothesis of no effect of supplier only.

(M.type <- tapply(km, gaz.type, mean))

We estimate the means under the hypothesis of an effect of both supplier and type.

(M.supplier.type <- tapply(km, gaz.supplier.type, mean))

And perform the analysis.

summary.aov(aov(km ~ gaz.supplier + gaz.type + gaz.supplier:gaz.type))

summary.aov(aov(km ~ gaz.supplier + gaz.type))

summary.aov(aov(km ~ gaz.type))