knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This tutorial walks through several types of anlayses that can be done using the shroom package. The focus will be on situations that can be analyzed using t-tests and similar methods to compare 2 groups. In particular
Things not yet implemented
Load the shroom package. You can download it with instal.packages("shroom") if you haven't used it ever before. See "Loading the shroom package" for more information.
library(shroom)
Load other essential packages
library(ggpubr) #plotting library(cowplot) library(dplyr) #data cleaning library(car) #? library(data.tree)# plotting data structure library(treemap) library(plotrix) #std.err function library(bbmle) #AIC table library(lme4) library(lawstat) #? library(effsize) #Cohen's dw
The data are internal to the package and so can be easily loaded.
data("wingscores")
For a real analysis you need to start from the .csv file with the raw data, eg data_ind_flies_expt1_2018.csv if you are working with the 2018 shroom data. This data is included when you download the package. If you search your computer for it you should be able to locate it. It is in a diretory called "extdata" within the directory for the shroom package which you downloaded.
Using the "Import Dataset" button is helpful because it brings up a file navigation system that is similar to a normal program and generates the code you need to load the data.
Its good to copy and paste this code into your script so you always have it.
Note that the code generated by "Import Dataset" creates an absolute path for the file name. This means that if you move the file that the original code will no longer work and you will have to got back through the "Import Dataset" process again.
Because I am familiar with the ins and outs of R I use "relative file paths"; if I move files, my code accomodates them being moved. However, this is a bit abstract and is more advanced. You are encouraged to skip this section if you are new to R.
There here() fuction of there here packages tells me where an RStudio project file is.
here::here()
I can construct a valid, flexible flie path using here::here()
# You don't need to do this path. <- here::here("inst/extdata")
What is in the directory
list.files(path.)
Full file path
file. <- here::here("inst/extdata/data_ind_fly_wing_scores_2018.csv")
I then load the data
# load data using path and file saved in file. object flies <- read.csv(file.)
This information is covered in more detail in "Subsetting a dataframe with dplyr."
dim(flies) names(flies) head(flies)
flies <- flies %>% select(-stock.num, -loaded, -file, -gene, -temp.C)
Data frame now has fewer columns
dim(flies)
We will walk through analyses comparing just two groups: female control flies versus female experimental flies. In subsequent tutorials we'll look at more complex data.
Because these data result from summarizing the work done by each student, I will refer to it as student-level data. In contrast, the raw data generated by the student is fly-level data because each data point represents a single fly.
student1.F <- flies %>% filter(student.ID == 1 & sex == "F")
Data frame now has many fewer rows
dim(student1.F)
Examine the data and consider the issues raised by Consider the issues raied by Zuur et al 2010
We can think of the data like this
working <- student1.F working$i <- 1:nrow(working) i.E <- which(working$E.or.C == "E") i.use <- c(min(working$i[i.E]) ,min(working$i[i.E])+1 ,max(working$i[i.E])-1 ,max(working$i[i.E]) ,min(working$i[-i.E]) ,min(working$i[-i.E])+1 ,max(working$i[-i.E])-1 ,max(working$i[-i.E])) i.medians <- round(c(median(working$i[i.E]) ,median(working$i[-i.E])),0) working[-i.use, "i"] <- "*\n*\n*" working$pathString <- with(working, paste(allele, E.or.C, i, sep = "/")) i <- sort(c(i.use,i.medians)) working.tree <- as.Node(working[i,]) SetGraphStyle(working.tree, rankdir="LR") plot(working.tree)
gghistogram(data = student1.F, x = "wing.score", fill = "E.or.C", add = "mean", #add line formean position = "dodge") #
ggboxplot(data = student1.F, y = "wing.score", x = "E.or.C", add = "mean", fill = "E.or.C")
Numeric summaries are useful when writing about the results and summarizing them for potential future meta-anlysts.
Step:
student1.F %>% group_by(E.or.C) %>% summarize(mean = mean(wing.score), n = n(), sd = sd(wing.score), se = plotrix::std.error(wing.score), CI95 = 1.96*std.error(wing.score))
Consider these questions after you run the test:
t.test(wing.score ~ E.or.C, data = student1.F)
Cohen's D is an relative effect size measure. It is sometiems called a standardized mean difference (SMD) because it is the difference between the means of two groups, "standardized" by a standard deviation pooled from from the 2 groups: (mean.1 - mean.2)/pooled.SD, where pooled.SD is a "weighted" mean of the two SDs. In meta analysis it is more commone now I think to use a simliar effecti size, Hedge's g.
Cohen's d makes all the same assumptions as a t-test BUT does NOT have the ability to correct like Welch's t-test.
effsize::cohen.d(wing.score ~ E.or.C, data = student1.F)
m.H0 <- lm(wing.score ~ 1, data = student1.F) m.Ha <- lm(wing.score ~ E.or.C, data = student1.F)
We can also calcualte a means model by dropping the intercept
m.means <- lm(wing.score ~ -1 + E.or.C, data = student1.F)
The model has two paramters
summary(m.Ha)
The general model in R-like notation is wing.score ~ treatment
THis is defined by 2 parameters the intercept and the treatment effect. The treatment effect in this case corresponds to the difference between the two treatments.
The general model with both parameters is: wing.score ~ intercept + treatment.effect
We can plug in the values we have: wing.score ~ 4.8 + -0.63*(Experimental?)
where Experimental? = 0 if its the control treatment and 1 if its the experimental treatment. TWo equations are therefore implied
Control flies: wing.score ~ 4.8 + -0.630 Experimental flies: wing.score ~ 4.8 + -0.631
Which simplfies to
Control flies: wing.score ~ 4.8 Experimental flies: wing.score ~ 4.8 + -0.63
We can compare the models using p-values or AIC.
Get p-value by comparing models using anova()
anova(m.H0, #null model m.Ha) #alternative model
Compare models with AIC. AICtab is from the package bbmle
bbmle::AICtab(m.H0, m.Ha)
AICc is a better choice than AIC. This requires ICtab()
bbmle::ICtab(m.H0, m.Ha, type = "AICc")
In this context, resid subtracts the group mean (E or C) from each value. This is how much each observation differs from the group mean.
resid(m.Ha)
The mean of the residuals is approximately zero
mean(resid(m.Ha))
hist(resid(m.Ha))
plot(m.Ha, which = 2)
We can wrap our y variable wing.score in log()
m.H0.log <- lm(log(wing.score) ~ 1, data = student1.F) m.Ha.log <- lm(log(wing.score) ~ E.or.C, data = student1.F)
#raw data anova(m.H0, m.Ha) #transformed data anova(m.H0.log, m.Ha.log)
Not much better...
hist(resid(m.Ha.log))
More informative that histogram of the residuals; dots should fall on the line...
plot(m.Ha.log, 2)
# column with group variables summary(student1.F$E.or.C) # residuls of model column with group var plot(resid(m.Ha.log) ~ student1.F$E.or.C)
Run the test
wilcox.test(wing.score ~ E.or.C, data = student1.F)
You might see an error "cannot compute exact p-value with ties." The computation of the p-values depend on whether there are ties or not in the data; I doubt this is an issue ever in pratice.
Compare p-values: both are very small. The t-test p-value is a bit smaller. Often when you use a test and violate some of its assumptions, your p-values are too big.
wilcox.test(wing.score ~ E.or.C, data = student1.F, exact = F)$p.value t.test(wing.score ~ E.or.C, data = student1.F)$p.value
Seperate out each group into seperate object
S1.F.E <- student1.F %>% filter(E.or.C == "E") S1.F.C <- student1.F %>% filter(E.or.C == "C")
run test
brunner.munzel.test(x = S1.F.E$wing.score, y = S1.F.C$wing.score)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.