$\$

Overview

$\$

# install.packages("latex2exp")

library(latex2exp)

#knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width=6, fig.height=4) 


set.seed(230)
# get some images and data that are used in this document
SDS230::download_data("gingko_RCT.rda")
SDS230::download_data("alcohol.rda")
SDS230::download_image("gingko_pills.jpg")
# a function to get the MAD statistic
get_MAD_stat <- function(quantitative_data, grouping_data) {

 # we can use the by() function to get the means separately for each group
 group_means <- as.vector(by(quantitative_data, grouping_data, mean, na.rm = TRUE))

  total <- 0 
  num_group_pairs <- 0

  for (iGroup1 in 1:(length(group_means) - 1)) {
    for (iGroup2 in (iGroup1 + 1):(length(group_means))) {

      total <- total + abs(group_means[iGroup1] - group_means[iGroup2])
      num_group_pairs <- num_group_pairs + 1

    }
  }

  total/num_group_pairs

}  # end of the MAD function 

$\$

Part 1: Randomization test for comparing two means in R

Let's us examine the randomized controlled trial experiment by Solomon et al (2002) to see if there is evidence that taking a gingko pills improves memory. To read the original paper see: https://jamanetwork.com/journals/jama/fullarticle/195207

Step 1: State the null and alternative hypotheses

$H_0: \mu_{Gingko} - \mu_{Placebo} = 0$ $H_A: \mu_{Gingko} - \mu_{Placebo} \ne 0$

$\alpha = 0.05$

$\$

Step 2a: Plot the data

load("gingko_RCT.rda")



# plot the data
boxplot(gingko, placebo, 
        names = c("Gingko", "Placebo"))



# create a stripchart
data_list <- list(gingko, placebo)

stripchart(data_list,
           group.names = c("Gingko", "Placebo"),
           col = c("red", "blue"),
           method = "jitter")

$\$

Step 2b: Calculate the observed statistic

(obs_stat <- mean(gingko) - mean(placebo))

$\$

Step 3: Create the null distribution

# combine the data from the treatment and placebo groups together
combo_data <- c(gingko, placebo)
n_gingko <- length(gingko)
n_total <- length(combo_data)


# use a for loop to create shuffled treatment and placebo groups and shuffled statistics 
null_dist <- NULL
for (i in 1:10000) {

       # shuffle data
        shuff_data <- sample(combo_data)

        # create fake treatment and control groups
        shuff_treat <- shuff_data[1:n_gingko]
        shuff_control <- shuff_data[(n_gingko + 1):n_total]

        # save the statistic of interest
        null_dist[i] <- mean(shuff_treat) - mean(shuff_control)

}
# plot the null distribution as a histogram

hist(null_dist, breaks = 20)
abline(v = obs_stat, col = "red")
abline(v = -1 * obs_stat, col = "red")

$\$

Step 4: Calculate a p-value

# plot the null distribution again with a red line a the value of the observed statistic




# calculate the p-value

(pval_left <- sum(null_dist <= obs_stat)/length(null_dist))
(pval_right <- mean(null_dist >= -1 * obs_stat))

(pval <- pval_left + pval_right)

$\$

Step 5: Make a decision

$\$

$\$

Part 2: Permutation test comparing more than two means in R

Is there differences between the average amount of beer that countries in different continents drink? Let's explore this more below!

Note: This question was inspired by these sources: - https://blog.minitab.com/en/michelle-paret/what-is-anova-and-who-drinks-the-most-beer - https://fivethirtyeight.com/features/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/

Please check out those websites if you are interested in reading more!

$\$

Step 1: State the null and alternative hypotheses

$\$

Step 2: Calculate the statistic of interest and visualize the data

load("alcohol.rda")



# look at the average number of beers consumed per capita in each country




# view a box plot of the data




# use the get_MAD_stat() defined at the top of this R Markdown document to get the observed statistic

$\$

Step 3: Create a null distribution

To create a null distribution, we need to create a statistic consistent with the null hypothesis, and then repeat the process many times.

We can create a MAD statistic consistent with the null hypothesis by shuffling the continent names and then recalculate the MAD statistic on the shuffled data. To shuffle the data we can use the sample() function.

Because calculating the MAD statistic is slow, let's only create 1,000 points in our null distribution.

# create the null distribution

$\$

Step 4: Calculate the p-value

# visualize the null distribution





# calculate the p-value

$\$

Step 5: Conclusion

$\$

What is the underlying population/random process we are making claims about???

Every time you do an analysis that involves statistical inference (e.g., every time you create a confidence interval, run a hypothesis tests, etc.) you are trying to answer a question about some larger underlying population or random process. For the hypothesis test run here that looks at the relationship between beer consumption and continents, what is the underlying population/process we are trying to make inferences about?



emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.