$\$
Welcome to homework 10, the last homework problem set of the semester! The purpose of this homework is to practice running hypothesis tests for comparing multiple proportions (chi-squared tests) and for comparing multiple means (analysis of variance). Please submit a compiled pdf with your answers to the exercises to Gradescope by 11pm on Sunday April 21st.
Functions you might find useful are dchisq()
, pchisq()
, barplot()
, sum()
, df()
, and pf()
, along with the list of functions found on Canvas. For full credit on problems involving R, be sure to label all your figures and to "show your work" by making sure all values that are answers to questions are printed/visible in your R Markdown pdf.
As always, if you need help with any of the homework assignments, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions forum. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions. Finally, reviewing the class slides and videos will be helpful if you run into difficulties.
download.file("https://raw.githubusercontent.com/emeyers/SDS100/master/ClassMaterial/data/movies.Rda", "movies.Rda", mode = "wb")
library(SDS100) library(knitr) options(scipen=999) set.seed(100)
$\$
In the first set of exercises you will run a hypothesis test to compare counts in multiple categories. I encourage you to review the examples in the Lock5 textbook for additional practice. For all your answers be sure to "show your work" by printing the output of your calculations in the R chunks.
$\$
Rock-paper-scissors is a hand game played between two people, where people make hand gestures in the shape of either a rock, a piece of paper, or a pair of scissors. In the problem below we will analyze the first move people typically play in games of rock-paper-scissors. If you are not familiar with the game, please check out either this youtube video or the wikipedia article that explains the rules (and if you're still unsure, I am happy to discuss the rules during office hours).
The table below shows the choices made by 119 players on the first turn of a rock-paper-scissors game. A player gains an advantage in playing this game if there is evidence that the choices made on the first turn are not equally distributed among the three options. In the following set of exercises we will run a $\chi^2$ goodness-of-fit test to see if there is evidence that any of the proportions are different from 1/3.
| Option Selected | Observed Counts (O) | ------------------|---------------------| | Rock | 66 | | Paper | 39 | | Scissor | 14 | | Total | 119 |
Exercise 1.1 (5 points): As described above, in this set of exercises we will run a $\chi^2$ goodness-of-fit hypothesis test to see if there is evidence that any of the proportions for the first move in rock-paper-scissors games differ from 1/3. Let's start this hypothesis test by stating the null and alternative hypotheses using the appropriate symbols.
Answers
$\$
Exercise 1.2 (points 8): Now let's complete step 2 of hypothesis testing by calculating the $\chi^2$ statistic. To start doing this, first fill in the table below showing the counts you would expect to get for Rock, Paper and Scissors if the null hypothesis was true.
After you have filled in this table, then use the code section below to calculate the $\chi^2$ statistic. Recall that the $\chi^2$ statistic is defined as: $\chi^2 = \sum_i^k \frac{(O_i - E_i)^2}{E_i}$, where $O_i$ is the $i^{th}$ observed count (from the table above exercise 1.1) and $E_i$ is the $i^{th}$ expected count (from the table you filled out below). Be sure to print out the value you get for the $\chi^2$ in the code section below as well.
Answers
| Option Selected | Expected Counts (E) | ------------------|---------------------| | Rock | fill in | | Paper | fill in | | Scissors | fill in | | Total | 119 |
$\$
Exercise 1.3 (7 points): Next do step 3 of hypothesis testing by plotting the null distribution. Start by writing down the appropriate degrees of freedom for the null distribution in this problem in the answer section. Then use the dchisq(x_vals, df)
function to plot the null chi-squared distribution for the null distribution. In particular, use the following steps to plot the null distribution.
Use the x_vals
vector below for the appropriate x-values.
Create an object called y_vals
that has the density function values by using dchisq(x_vals, df)
with the appropriate degrees of freedom.
Use plot(x_vals, y_vals)
to plot this null distribution.
Add a vertical red line at the observed $\chi^2$ statistic value you calculated in exercise 1.2.
Finally, in the answer section, write down what you expect the p-value to be based on looking at the plot you created of the null distribution.
Answers
x_vals <- seq(0, 40)
$\$
Exercise 1.4 (5 points): Now do step 4 of hypothesis testing by calculating the p-value using the pchisq(obs_stat, df)
function, and print out the p-value.
Hint: To determine which tail to use, think about what a small $\chi^2$ means and what a $\chi^2$ means with respect to the null hypothesis.
$\$
Exercise 1.5 (4 points): Based on the p-value, what do you conclude? Do these results affect how you will play rock-paper-scissors in the future?
Answer
$\$
In the next set of exercise you will run a $\chi^2$ test for association between two categorical variables. In particular, you will assess whether the age distribution for the usage of social networking sites changed between 2008 and 2010.
To address this question, the Pew Research Center, conducted a survey of randomly sampled American adults in 2008 and 2010, asking them about their use of social networking sites such as Facebook. The table below shows the age distribution for the number of people in the random sample that claimed to use social networking sites in 2008 and in 2010. For more information on this study please see this website
| Age | 2008 | 2010 | Total | -------------------|--------|---------| | 18-22 | 138 | 152 | 290 | | 23-35 | 197 | 303 | 500 | | 36-49 | 108 | 246 | 354 | | 50+ | 52 | 246 | 298 | | Total | 495 | 947 | 1442 |
$\$
Exercise 2.1 (5 points): To assess whether the age distribution for the usage of social networking sites changed between 2008 and 2010, let's start our $\chi^2$ test for association between two categorical variables with step 1 by stating the null and alternative hypotheses in words.
Answers:
$\$
Exercise 2.2 (6 points): Before going on to step 2 of calculating the observed statistic, let's first visualize the data. To do this create a vector called observed_2008
that has the age distribution data from 2008, and a vector called observed_2010
that has the age distribution data from 2010. Next create a bar plot for the data from 2008 and a bar plot for the data from 2010. In the answer section below, describe how you think the age distributions have changed (if at all) from 2008 to 2010.
Answers
$\$
Exercise 2.3 (12 points): Let's now do step 2 of our hypothesis test by calculating the observed $\chi^2$ statistic. Helpful hint: In order to do this, you will need to calculate the expected counts under the assumption that there is no difference in the age distributions between 2008 and 2010. Recall from class lecture videos, the expected counts for a given cell can be found using the formula: $Expected ~ Count ~ = \frac{R_i ~ \cdot ~ C_j}{N}$, where $R_i$ is the row total count for row i, $C_j$ is the column total count for column j, and N is the overall total count.
First, please fill in the table in the answer section below showing the expected counts (under the null hypothesis) for 2008 and 2010 for each age bracket.
Next, in the answer section below, report the degrees of freedom that should be used based on running a $\chi^2$ test of associate on tables of this size.
Next calculate the $\chi^2$ statistic using the R chunk below. One way you can do this is as follows:
Create an object called row_totals
which is the sum of the observed_2008
vector and the observed_2010
vector.
Next create an object called expected_2008
which is the row_totals
multiplied by the 2008 total count and divided by the overall total count. Similarly create a vector called expected_2010
which is the row_totals
multiplied by the 2010 total count and divided by the overall total count.
Create an object called chi_stat_2008
which has the $\chi^2$ statistic calculated for the data from 2008 using the formula: $\chi^2 = \sum_i^k \frac{(O_i - E_i)^2}{E_i}$. Hint: recalled that if you have two vectors, v1
and v2
, then writing v1 - v2
will subtract each element of v2
from the elements in v1
. Similarly, if you have a vector v
then writing v^2
will square each element in v
. The sum() function, which sums all elements of a vector, will also be useful.
Create an object called chi_stat_2010
using the same calculations as in step 3 but for the 2010 data.
Create the final $\chi^2$ statistic by adding the values in chi_stat_2008
and chi_stat_2010
together. Be sure to print the value out as well.
Answers
The expected count table:
| Age | 2008 | 2010 | Total | -------------------|--------|---------| | 18-22 | | | 290 | | 23-35 | | | 500 | | 36-49 | | | 354 | | 50+ | | | 298 | | Total | 495 | | 1442 |
$\$
Exercise 2.4 (5 points): Now calculate the p-value from the $\chi^2$ statistic and degrees of freedom you calculated in exercise 2.3 using the pchisq()
function. Print out the value of this statistic as well.
$\$
Exercise 2.5 (5 points): Finally make a decision. Does it appear the age distributions of social media users is the same in 2008 and 2010?
Answer:
$\$
On homework 7 part 1, you examined whether movies that have particular Motion Picture Association of America (MPAA) ratings (i.e., G, PG, PG-13 or R) are enjoyed more by movie critics using data from over 500 movies randomly selected from the Rotten Tomatoes website. To do your analysis, you ran a randomization test that created a null that was based on using the MAD statistic. In the exercises below, you will reanalyze this question, however instead of running a randomization test, you will run a one-way analysis of variance (ANOVA) to determine whether several means are the same.
As you will recall from homework 7, the code below loads the movie data and creates two vectors. The first vector is called mpaa_rating
and has the MPAA rating for each movie (i.e., the vector contains categorical data with the levels of: G, PG, PG-13, and R). The second vector is called critics_score
and contains ratings critics gave to movies on a scale from 1 to 100. You will use these two vectors (mpaa_rating
and critics_score
) to examine whether the average (mean) movie scores differ depending on the rating MPAA rating level. If you are interested in learning more about this full dataset a codebook describing the variables can be found on this website.
$\$
Exercise 3.0 (2 points):
To start your analysis, report the sample size N for these two vectors, mpaa_rating
and critics_score
, using the length()
function.
# load the data load('movies.Rda') # only keep movies rated "G", "PG", "PG-13", "R" (you can ignore this) movies2 <- movies[movies$mpaa_rating %in% c("G", "PG", "PG-13", "R"), ] movies2$mpaa_rating <- droplevels(movies2$mpaa_rating) # get the MPAA ratings and the rotten tomatoes critic scores # you will only use these two vectors for the rest of these exercises mpaa_rating <- movies2$mpaa_rating critics_score <- movies2$critics_score # display the sample size of the two vectors you will use in your analyses
$\$
Exercise 3.1 (3 points): Start your hypothesis test with step 1 by stating the null and alternative hypotheses in symbols and words. Also state the alpha level that is most commonly used.
In words
In symbols
The significance level
$\$
Part 3.2 (4 points): In order for the F-distribution to be a valid null distribution for our F-statistic, two conditions (i.e., assumptions) must hold. The two conditions that need to be met are: 1) the data from each group must be relatively normal, and 2) the variances (or standard deviations) in each group must be approximately the same.
We can check the first condition as to whether the data is relatively normal by creating box plots of the data. As long as the boxplots look reasonably symmetric and there are not too many extreme outliers this condition should be met. Please use the R chunk below to create a boxplot of our data by group (in this case, rating). Remember, we can create a box plot comparing several groups using the syntax boxplot(data ~ grouping)
. Then in the answer section report whether the data in each group looks relatively normal.
We can check the second condition as to whether the variances within each group are approximately the same by comparing the standard deviations between the groups. As long as the largest standard deviation is not twice as big as the smallest standard deviation this condition is met. The code below uses the by()
function to calculate the standard deviation separately for each group (the by(data, grouping, function)
works by taking three arguments and applies the function
arguments separately to data
separately for each grouping
). In the answer section below, describe whether the output of the by()
function indicates whether the equal variance condition is met.
Answer
# create a boxplot comparing the groups # calculating the standard deviation separately for each group by(critics_score, mpaa_rating, sd)
Part 3.3 (3 points): Next let's compute the F-statistic that we discussed in class to compare the mean critic scores between the different MPAA rating levels. To do this we can use the get_F_stat(quantitative_data, grouping_data)
function from the ClassTools library. Save the F-statistic value to an object called obs_stat
and report the value of this statistic below.
# Calculate the F statistic
Answer:
$\$
Exercise 3.4 (6 points): Now let's do step 3 of the hypothesis test by plotting the null distribution. Start by calculating the values of the two parameters, degrees of freedom 1 and 2, and write down their values in the answer section below. Then create your y-values using df(x_vals, df1, df2)
, using the x_vals
object that is given in the R chunk. Plot your null distribution with plot(x_vals, y_vals, type = "l")
. Finally, add the observed statistic as a red vertical line to the plot. Based on looking at this null distribution, what do you think the p-value is?
Answer
The degrees of freedom are:
# use these x values to plot the density function x_vals <- seq(0, 10, by = .01) # created the denisty function values and plot them # add a vertical line at the value of the observed statistic
Answers
$\$
Exercise 3.5 (4 points): Let's do step 4 of the hypothesis test by calculating the p-value using the pf(obs_stat, df1, df2)
function. Report what the p-value is (and make sure you look at the correct tail). Is this close to what you estimated by looking at the null distribution above?
Answers
$\$
Exercise 3.6 (3 points): Now complete step 5 of hypothesis testing by making a judgement. Are you able to reject the null hypothesis? What do you conclude?
Answer
$\$
Exercise 3.7 (5 points): R also has a built in ANOVA function that can do all the steps of an ANOVA for you. To create an ANOVA table using this function requires doing two steps:
you need to fit the model using the syntax: fit <- aov(data ~ grouping)
you can then print the ANOVA table using summary(fit)
Please run the code described above to create a ANOVA table in the R chunk below and print out this ANOVA table. Do the results match what you got above? (Note that the p-value is labelled 'Pr(>F)' in the summary output.)
Answer Do the results match?
$\$
Bonus exercise (0 points): As mentioned above, on homework 7 you ran a hypothesis test examining this same question of whether critics scores differed based on MPAA rating but you did it using a permutation test using the MAD statistic. As a bonus question, repeat hypothesis step 3-5 for running a permutation test, but use the F-statistic instead of the MAD statistic (i.e., use the get_F_stat()
function, etc.). Do you come to the same conclusion?
Answer
# create the null distribution # plot the null distribution with a red vertical line for the statistic value # calculate the p-value
$\$
Please state the question you will address in your final project and where you will get the data from. Also, load the data into R in the chunk below to verify that you can get the data into R.
$\$
Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 10.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.