$\$

Overview

$\$

# install.packages("latex2exp")

library(latex2exp)

#knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width=6, fig.height=4) 


set.seed(230)
# get some images and data that are used in this document
SDS230::download_data("gingko_RCT.rda")
SDS230::download_data("alcohol.rda")
SDS230::download_image("gingko_pills.jpg")
# a function to get the MAD statistic
get_MAD_stat <- function(quantitative_data, grouping_data) {

 # we can use the by() function to get the means separately for each group
 group_means <- as.vector(by(quantitative_data, grouping_data, mean, na.rm = TRUE))

  total <- 0 
  num_group_pairs <- 0

  for (iGroup1 in 1:(length(group_means) - 1)) {
    for (iGroup2 in (iGroup1 + 1):(length(group_means))) {

      total <- total + abs(group_means[iGroup1] - group_means[iGroup2])
      num_group_pairs <- num_group_pairs + 1

    }
  }

  total/num_group_pairs

}  # end of the MAD function 

$\$

Part 1: Randomization test for comparing two means in R

Let's us examine the randomized controlled trial experiment by Solomon et al (2002) to see if there is evidence that taking a gingko pills improves memory. To read the original paper see: https://jamanetwork.com/journals/jama/fullarticle/195207

Step 1: State the null and alternative hypotheses

$H_0: \mu_{gingko} - \mu_{control} = 0$ $H_A: \mu_{gingko} - \mu_{control} \ne 0$

$\alpha = 0.05$

$\$

Step 2a: Plot the data

load("gingko_RCT.rda")


# plot the data
boxplot(gingko, placebo, 
        names = c("Gingko", "Placebo"),
        ylab = "Memory score")



# create a stripchart
data_list <- list(gingko, placebo) 


stripchart(data_list, 
           group.names = c("Gingko", "Placebo"), 
           method = "jitter",
           xlab = "Memory score", 
           col = c("red", "blue"))

$\$

Step 2b: Calculate the observed statistic

(obs_stat <- mean(gingko) - mean(placebo))

$\$

Step 3: Create the null distribution

# combine the data from the treatment and placebo groups together
combined_data <- c(gingko, placebo)
n_gingko <- length(gingko)
total <- length(combined_data)


# use a for loop to create shuffled treatment and placebo groups and shuffled statistics 
null_distribution <-  NULL
for (i in 1:10000) {

       # shuffle data
        shuff_data <- sample(combined_data)

        # create fake treatment and control groups
        shuff_gingko   <-  shuff_data[1:n_gingko]
        shuff_placebo  <-  shuff_data[(n_gingko + 1):total]

        # save the statistic of interest
        null_distribution[i] <- mean(shuff_gingko) - mean(shuff_placebo)


}


# plot the null distribution as a histogram
hist(null_distribution, 
     breaks = 20,
     main = "Null distribution", 
     xlab = TeX("$\\bar{x}_{shuff-gingko} - \\bar{x}_{shuff-placebo}$"))

$\$

Step 4: Calculate a p-value

# plot the null distribution again with a red line a the value of the observed statistic
hist(null_distribution, 
     breaks = 20,
     main = "Null distribution", 
     xlab = TeX("$\\bar{x}_{shuff-gingko} - \\bar{x}_{shuff-placebo}$"))


abline(v = obs_stat, col = "red")



# calculate the p-value
(p_value_left_tail <- sum(null_distribution <= obs_stat)/10000)
(p_value_right_tail <- sum(null_distribution >= abs(obs_stat))/10000)

(p_value <- p_value_left_tail + p_value_right_tail)

$\$

Step 5: Make a decision

Since r p_value is greater than $\alpha = 0.05$ we can not reject the null hypothesis. Thus if we are using the Neyman-Pearson paradigm, we do not have sufficient evidence to say that the pill is effective.

$\$

$\$

Part 2: Permutation test comparing more than two means in R

Is there differences between the average amount of beer that countries in different continents drink? Let's explore this more below!

Note: This question was inspired by these sources: - https://blog.minitab.com/en/michelle-paret/what-is-anova-and-who-drinks-the-most-beer - https://fivethirtyeight.com/features/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/

Please check out those websites if you are interested in reading more!

$\$

Step 1: State the null and alternative hypotheses

$H_0: \mu_{Africa} = \mu_{Asia} = \mu_{Europe} = \mu_{North America} = \mu_{Oceania} = \mu_{South America}$

$H_A: \mu_{i} \ne \mu_{j}$ for some groups i and j

$\alpha = 0.05$

$\$

Step 2: Calculate the statistic of interest and visualize the data

# SDS230::download_data("alcohol.rda")

load("alcohol.rda")

# how many countries are in each continent
table(alcohol$continent)

# look at the average number of beers consumed per capita in each country
by(alcohol$beer_servings, alcohol$continent, mean)


# view a box plot of the data
boxplot(beer_servings ~ continent, data = alcohol)


# use the get_MAD_stat() defined at the top of this R Markdown document to get the observed statistic
(obs_stat <- get_MAD_stat(alcohol$beer_servings, alcohol$continent))

$\$

Step 3: Create a null distribution

To create a null distribution, we need to create a statistic consistent with the null hypothesis, and then repeat the process many times.

We can create a MAD statistic consistent with the null hypothesis by shuffling the continent names and then recalculate the MAD statistic on the shuffled data. To shuffle the data we can use the sample() function.

Because calculating the MAD statistic is slow, let's only create 1,000 points in our null distribution.

# create the null distribution
null_dist <- NULL
for (i in 1:1000) {

  null_dist[i] <- get_MAD_stat(alcohol$beer_servings, sample(alcohol$continent))

}

$\$

Step 4: Calculate the p-value

# visualize the null distribution
hist(null_dist, xlim = c(0, 80))
abline(v = obs_stat, col = "red")


# calculate the p-value
(pval <- sum(null_dist >= obs_stat)/length(null_dist))

$\$

Step 5: Conclusion

The p-value is less than our significance level of $alpha = 0.05$ so we reject the null hypothesis and conclude that there are differences in alcohol consumption between different continents.

$\$

What is the underlying population/random process we are making claims about???

Every time you do an analysis that involves statistical inference (e.g., every time you create a confidence interval, run a hypothesis tests, etc.) you are trying to answer a question about some larger underlying population or random process. For the hypothesis test run here that looks at the relationship between beer consumption and continents, what is the underlying population/process we are trying to make inferences about?



emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.