$\$
The purpose of this homework is to gain practice running analysis of variance hypothesis tests. Please fill in the appropriate code and write answers to all questions in the answer sections, then submit a compiled pdf with your answers through Gradescope by 11pm on Sunday December 3rd.
As always, if you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.
SDS230::download_data("popout_data.rda") SDS230::download_image("popout_stimuli.png")
library(knitr) library(latex2exp) library(dplyr) library(ggplot2) options(scipen=999) opts_chunk$set(tidy.opts=list(width.cutoff=60)) set.seed(230) # set the random number generator to always give the same random numbers
$\$
In this homework you will analyze data from a cognitive science experiment that examined how the human visual system processes information. In particular, it is well known that when an object is very visually distinct from its surroundings, the object appears to "pop-out" and humans can rapidly identify the object. However, it is not known whether reaction times are affected by the position of where the object appears, or if having multiple irrelevant distractor objects affects reaction times. Understanding how humans process different visual stimuli could lead to better computer vision systems and could give insight into neural disorders (e.g., simultanagnosia).
The data you will analyze was collected in an experiment run by students at Hampshire College in the spring of 2017. In the experiment, participants were shown images of diamond shapes on a screen, where one side of the diamond was cut off (see images below). If the left side of the diamond was cutoff, participants had to press the "z" key, and if the right side of the diamond was cut off, participants had to press the "/" key. Participants needed to respond as quickly and as accurately as possible to these images.
On some trials only a single diamond was shown. On other trials a "salient" diamond of one color was presented with 8 other "distractor" diamonds of a different color, and participants needed to respond to the diamond that had a unique color. The diamonds appeared in one of 9 locations (in a 3 x 3 grid), and the diamonds were either red or green.
Our analyses on this homework will focus on whether reaction times were affected by the position of the salient diamond, and whether reaction times were affected by the presence of distractor diamonds.
The study was done using a "factorial design" where all permutations of each condition were shown in blocks of 72 stimuli. A complete experiment consisted of 10 blocks (720 stimuli in total). The stimuli conditions that made up each block were:
position
(9 levels): The position where the target (popout) stimulus appeared.
condition
(2 levels): Whether a trial was a isolated stimulus trial, or
whether it as a cluttered trial where distractors also appeared.
color
(2 levels): The color of the target stimulus that participants needed
to respond to, which could be red or green. When present, the distractor stimuli
were always the opposite color as the target stimulus (e.g., if the target was
red, then the distractors were all green).
cur_dir
(2 levels): whether the right or left side of the diamond was cut
off. This also indicates whether the participants needed to press the "z" or "/"
key. The cut off sides of all the distractors stimuli were random chosen and
irrelevant for the task.
In addition to these factor variables that describe the stimulus conditions
present on each experimental trial, the data frame also contains the
quantitative variable reaction_time
that measures participants' reaction times
on each trial in milliseconds. The reaction_time
will be the response variable
(y) that we will model as a function of some of the explanatory variables listed
above (namely the position
and condition
variables). Finally, the data frame
had a few additional variables, including the participant
variable, which is
an unique ID for each of the 8 participant who participated in the study. On the
final problem on this problem set, you might want to explore these additional
variables.
Important Note: The data from this experiment has not yet been published, so please do not share this data outside of the class.
$\$
$\$
To start we will run an one-way analysis of variance to see if mean reaction times differ depending on the spatial position of where an object appeared on the screen. The positions of where the objects could appear on the screen are shown in the schematic below, where position 1 means the upper left corner of the screen, position 8 is the lower middle of the screen, etc.
| | | | |-----|-----|-----| | 1 | 2 | 3 | | 4 | 5 | 6 | | 7 | 8 | 9 |
The data set is loaded in the R chunk below and some basic processing is applied to simplify the data set. For the purpose of our analyses, we will only analyze trials where the salient ("pop-out") item did not appear at the central location (i.e, we will not analyze data from position 5).x
# load the data load('popout_data.rda') # simplify the data set a little popout_data <- popout_data |> na.omit() |> filter(position != 5) |> mutate(position = droplevels(position)) dim(popout_data)
$\$
Part 1.1 (3 points):
Let's start running our one-way ANOVA in the same way we run all hypothesis tests, namely by stating the null and alternative hypotheses in symbols and words. Also state the alpha level that is most commonly used.
Answers
In words
In symbols
The significance level
$\$
Part 1.2a (4 points):
Now that you have stated your hypotheses, you are ready to begin analyzing the data. A first useful step in doing this is to visualize the data, so please use ggplot to create a box plot that shows reaction times as a function of the 8 different positions we are analyzing.
From looking at your visualization, in the answer section please discuss whether you think there is a statistically significant difference in mean reaction times depending on position.
Answer
$\$
Part 1.2b (4 points): In order for the F-distribution to be a valid null distribution for our F-statistic, two conditions must be met which are:
Also, as with almost all hypothesis tests, the data points need to be independent, which we will assume that is the case here.
One can check these conditions either at the start of the analysis, or at the end before one draws a final conclusion. Let's check these conditions now!
We can check the first condition, as to whether the variances within each group are approximately the same, by comparing the standard deviations of reaction times within each group (position). As long as the largest standard deviation is not twice as big as the smallest standard deviation this condition is roughly met. Please use dplyr to check this condition, and print the standard deviations for each position as well as the ratio between the largest and smallest standard deviation. In the answer section below, describe whether this condition appears to be met.
Answer
$\$
Part 1.2c (6 points): We can check the second condition, as to whether the data is relatively normal in each group, by visually inspecting the data. We could do this in a few ways, including by creating a histogram of the residuals or by creating a Q-Q plot of the residuals. Here, the residuals are the difference between each point and its group mean.
To create these plots it will be useful to first create a data frame called
popout_residuals
that is derived from the popout_data
data frame. Like the
popout_data
data frame, every row in popout_residuals
represents a single
trial, but popout_residuals
also has two additional variables. The first
variable is called group_mean_rt
and should have the mean reaction time for
each group (position). The second variable is called rt_residuals
and should
contain the difference between the reaction time for each trial and the mean
reaction time for the corresponding position (i.e., the values in the
group_mean_rt
variable). Using the group_by()
function in conjunction with
the mutate()
function will be helpful for adding these variables to the data
frame.
Once you have created the popout_residuals
data frame, create the Q-Q plot
using the qqnorm()
function. Based on this visualization, do the residuals
appear normally distributed?
Note: as discussed in class, the ANOVA is fairly robust to departures from homoscedasticity and normality, particularly for large data sets. However it is still useful to check these conditions.
Answer
$\$
Part 1.2d (6 points):
As you should see from looking a the Q-Q plot of the residuals, the residuals do not appear to be that normally distributed. In particular, reaction times are always positive values, and is often the case with data that only has positive values, the data is right skewed.
To deal with this right skew, please create a transformed version of the
reaction times, by taking their natural log using the log()
function. In
particular, create a new data frame called popout_log_data
that is the same as
the original popout_data
except that the values in the reaction_time
variable should be log transformed. Then, use the popout_log_data
to recreate
the popout_residuals
data frame on the log transformed data, and then use this
data to recreate the Q-Q plot of the residuals to show that the residuals are
now more normally distributed.
For all the analyses in the rest of the problem set, please use the
popout_log_data
data frame since this transformed data more closely matches
the ANOVA conditions.
$\$
Part 1.2e (12 points): Now that we have examined the ANOVA conditions, let's use dplyr to create an ANOVA table. ANOVA tables have the following form:
| Source | df | Sum of Sq. | Mean Square | F-stat | p-value | |-------------|---------|-------------|------------------|---------|---------| | Groups | K - 1 | SSG | MSG = SSG/(K-1) | MSG/MSE | | | Error | N - K | SSE | MSE = SSE/(N-K) | | | | Total | N - 1 | SSTotal | | | |
Where:
\begin{align} SSG &= \sum_{i = 1}^K n_i (\bar{x}i - \bar{x}{tot})^2 \ SSE &= \sum_{i = 1}^K\sum_{j = 1}^{n_i}(x_{ij} - \bar{x}i)^2 \ SST &= \sum{i = 1}^K\sum_{j = 1}^{n_i}(x_{ij} - \bar{x}_{tot})^2 \end{align}
For every cell in the table above with an expression in symbols, use dplyr to calculate a numeric replacement and write it into the table in the answer section below.
To do this, use the popout_residuals
data frame you created in part 1.2d (that
is based on the log transformed reaction times) to fill in the ANOVA table,
which will be easier than using the popout_log_data
data frame. Be sure to
print out all the values as you compute them in the R chunk below to "show your
work" (i.e., print out the dfs, SSG, SSE, MSG, etc.). You will fill in the
p-value in this table later during part 1.5 of this homework.
Note: You can use the lm()
aov
or anova()
to check your answers, but you
are not allowed to use these functions to compute the F-statistic; i.e., you
need to use just the popout_log_data
data frame and dplyr to fill in the ANOVA
table and to compute the F-statistic.
Answer:
| Source | df | Sum of Sq. | Mean Square | F-stat | p-value | |-------------|---------|-------------|------------------|---------|------------| | Groups | | | | | | | Error | | | | | | | Total | | | | | |
$\$
Exercise 1.3 (5 points): Now let's create plot the null distribution for
this hypothesis test. To create the appropriate F-distribution, use the degrees
of freedom you calculated in part 1.2e. Then use either base R graphics or
ggplot to plot the F-distribution density function (you can get the density
values using the df()
function), and add the observed statistic as a red
vertical line to the plot. From looking at this null distribution, what do you
think the p-value is?
Answer
$\$
Part 1.4 (4 points): Now do step 4 of hypothesis testing by calculating the
p-value using the pf()
function. Report what the p-value is (and make sure you
look at the correct tail). Is this close to what you estimated by looking at the
null distribution above? Also, fill in the p-value in the ANOVA table in part
1.2e above.
Answers
$\$
Part 1.5 (3 points): Now complete step 5 of hypothesis testing by making a judgment. Are you able to reject the null hypothesis? What do you conclude?
Answer
$\$
Exercise 1.6 (5 points): As we have discussed, we can use R's lm()
function to run an ANOVA. Running an ANOVA and creating table requires two
steps:
We must fit a model using the syntax: fit <- lm(response_variable ~ categorical_predictor, data = my_data)
We can then print an ANOVA table using anova(fit)
Please use the lm()
function to fit a model that predicts reaction times from
the different stimulus positions. Save the model that you fit to an object
called fit_position
. Then use the anova()
function to see whether the
results match the results you had in parts 1.1 to 1.5 above. Report below
whether the results match.
Answer:
$\$
Exercise 1.7 (5 points):
When we use the lm()
function to run an ANOVA, we can use the plot()
function on the model we have fit to get diagnostic plots which we can use to
assess whether assumptions underlying the ANOVA were met (i.e., whether the
residuals are normally distributed with equal variance). Please use the plot()
function on the fit_position
model you created above to create diagnostic
plots. Also describe what is determining the x-axis location of the points in
the residuals vs. fitted values plot (upper left plot).
Answer:
$\$
Let's now run a two-way ANOVA to assess at once whether position and cluttered displays can affect reaction times.
$\$
Part 2.1 (9 points): Start your hypothesis test by stating the null and alternative hypotheses in symbols and words. State these null and alternative hypotheses for the main effect of position, the main effect of whether the condition is an isolated or cluttered trial, and also for the interaction effect of position and whether the condition is an isolated/cluttered trial (i.e., you should have 3 sets of hypotheses).
In words
In symbols
In symbols
In words
In symbols
$\$
Part 2.2 (4 points):
Now fit a model that only has main effects for position and isolated/cluttered condition (i.e., that does not contain an interaction term). Are the main effects statistically significant?
Answer
$\$
Now let's visualize the data to see if there is an interaction between position and isolated/cluttered condition.
Use the interaction.plot()
function to visualize the position on the x-axis
with different lines for the different isolated and cluttered conditions. For
this plot, use the original popout_data
rather than the log transformed data
in order to make the y-axis scale more interpretable (i.e., the y-axis scale
will be in terms of milliseconds rather than log milliseconds).
Also, try to make the plot look nice by making sure the labels are meaningful and choose a decent color scheme. Based on the this visualization, does there seem to be an interaction between position and isolated/cluttered conditions?
Answer:
$\$
Now fit a model has both the main effects and an interaction effect for position and isolated/cluttered condition. Are the main effects and interaction term statistically significant? What do you conclude?
Answer:
$\$
In order for our inferences to be valid, the assumptions underlying the ANOVA should be met. Please check these assumptions now and report whether they appear to be met.
Answer
$\$
This is a "challenge problem" that you should try to figure out without getting help from the TAs.
Please explore some other aspect of the experiment to see if you can gain additional insight by creating at least one additional plot, and by running one additional ANOVA. For example, there are many other variables in the original data frame that you could explore to see how they affect reaction times, some of which have big effects. You could also run a "repeated measures" ANOVA where you treat each participant as a factor to see how reaction times differ between participants. We can discuss repeated measure ANOVAs more in class as well.
Once you have figured out an interesting analysis, describe what your plot and ANOVA results show in the answer section (e.g., comment on any statistically significant relationships you see).
Answer
$\$
Describe what you are planning on doing for your final project. Also, load the
data you will use in your project in the R chunk below and print out the first
few rows of the data you are using (e.g., by using the head()
function). As
mentioned before, I encourage you to email TA's and ULA's, or to talk to me if
you need any help with your project.
Answer
$\$
Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 10.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.