knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This tutorial walks through several types of anlayses that can be done using the shroom package. The focus will be on situations that can be analyzed using t-tests and similar methods to compare 2 groups. In particular

TO DO

Things not yet implemented

Preliminaries

Load packages

Load the shroom package. You can download it with instal.packages("shroom") if you haven't used it ever before. See "Loading the shroom package" for more information.

library(shroom)

Load other essential packages

library(ggpubr)  #plotting
library(cowplot)

library(dplyr)   #data cleaning

    library(car)     #?

    library(data.tree)# plotting data structure
    library(treemap)

library(plotrix) #std.err function

library(bbmle)   #AIC table
library(lme4)
library(lawstat) #?
library(effsize) #Cohen's dw

Load data

Loading data directly from the shroom package

The data are internal to the package and so can be easily loaded.

data("wingscores")

Loading the data from the shroom package

For a real analysis you need to start from the .csv file with the raw data, eg data_ind_flies_expt1_2018.csv if you are working with the 2018 shroom data. This data is included when you download the package. If you search your computer for it you should be able to locate it. It is in a diretory called "extdata" within the directory for the shroom package which you downloaded.

  1. Save the file data_ind_flies_expt1_2018.csv to where you want to do your work for this project
  2. Set your working directory to the same folder where the file is saved. This can be done through "Session" option of the RStudio menu (this will occur automatically if you are using an R project).
  3. Load the data using read.csv OR the "Import Dataset" button within RStudio

Using the "Import Dataset" button is helpful because it brings up a file navigation system that is similar to a normal program and generates the code you need to load the data.

Its good to copy and paste this code into your script so you always have it.

Note that the code generated by "Import Dataset" creates an absolute path for the file name. This means that if you move the file that the original code will no longer work and you will have to got back through the "Import Dataset" process again.

Adanced: Loading data using relative file paths

Because I am familiar with the ins and outs of R I use "relative file paths"; if I move files, my code accomodates them being moved. However, this is a bit abstract and is more advanced. You are encouraged to skip this section if you are new to R.

There here() fuction of there here packages tells me where an RStudio project file is.

here::here()

I can construct a valid, flexible flie path using here::here()

# You don't need to do this
path. <- here::here("inst/extdata")

What is in the directory

list.files(path.)

Full file path

file. <- here::here("inst/extdata/data_ind_fly_wing_scores_2018.csv")

I then load the data

# load data using path and file saved in file. object
flies <- read.csv(file.)

Cleaning and subsetting the wing data

This information is covered in more detail in "Subsetting a dataframe with dplyr."

Examine raw data

dim(flies)
names(flies)
head(flies)

Clean the raw data

flies <- flies %>% select(-stock.num,
                          -loaded,
                          -file,
                          -gene,
                          -temp.C)

Data frame now has fewer columns

dim(flies)

Data analysis

We will walk through analyses comparing just two groups: female control flies versus female experimental flies. In subsequent tutorials we'll look at more complex data.

Nomeclature: "Student"" vs. "fly-level"" data

Because these data result from summarizing the work done by each student, I will refer to it as student-level data. In contrast, the raw data generated by the student is fly-level data because each data point represents a single fly.

Data from 1 student: t-test

student1.F <- flies %>% filter(student.ID ==  1 &
                                    sex == "F")

Data frame now has many fewer rows

dim(student1.F)

Data exploration

Examine the data and consider the issues raised by Consider the issues raied by Zuur et al 2010

Structure of the data

We can think of the data like this

working <- student1.F
working$i <- 1:nrow(working)

i.E <- which(working$E.or.C == "E")

i.use <- c(min(working$i[i.E])
,min(working$i[i.E])+1
,max(working$i[i.E])-1
,max(working$i[i.E])
,min(working$i[-i.E])
,min(working$i[-i.E])+1
,max(working$i[-i.E])-1
,max(working$i[-i.E]))

i.medians <- round(c(median(working$i[i.E])
,median(working$i[-i.E])),0)



working[-i.use, "i"] <- "*\n*\n*"



working$pathString <- with(working,
                              paste(allele,
                                    E.or.C,
                                    i,
                                    sep = "/"))

i <- sort(c(i.use,i.medians))
working.tree <- as.Node(working[i,])

SetGraphStyle(working.tree,
              rankdir="LR")

plot(working.tree)
Plot histogram using ggpubr::gghistogram
gghistogram(data = student1.F,
            x = "wing.score",
            fill = "E.or.C", 
            add = "mean",       #add line formean
            position = "dodge") #
Plot boxplot
ggboxplot(data = student1.F,
            y = "wing.score",
            x = "E.or.C",
            add = "mean",
            fill = "E.or.C")
Numeric summaries

Numeric summaries are useful when writing about the results and summarizing them for potential future meta-anlysts.

Step:

student1.F %>% 
  group_by(E.or.C) %>% 
  summarize(mean = mean(wing.score),
            n = n(),
            sd = sd(wing.score),
            se = plotrix::std.error(wing.score),
            CI95 = 1.96*std.error(wing.score))

Data analysis: t-test

Consider these questions after you run the test:

t.test(wing.score ~ E.or.C, 
       data = student1.F)

Cohen's D effect size

Cohen's D is an relative effect size measure. It is sometiems called a standardized mean difference (SMD) because it is the difference between the means of two groups, "standardized" by a standard deviation pooled from from the 2 groups: (mean.1 - mean.2)/pooled.SD, where pooled.SD is a "weighted" mean of the two SDs. In meta analysis it is more commone now I think to use a simliar effecti size, Hedge's g.

Cohen's d makes all the same assumptions as a t-test BUT does NOT have the ability to correct like Welch's t-test.

effsize::cohen.d(wing.score ~ E.or.C, 
       data = student1.F)

Analysis: T-test as linear regression

Fit linear models
m.H0 <- lm(wing.score ~ 1,      data = student1.F)
m.Ha <- lm(wing.score ~ E.or.C, data = student1.F)

We can also calcualte a means model by dropping the intercept

m.means <- lm(wing.score ~ -1 + E.or.C, 
              data = student1.F)
Intepret model equation

The model has two paramters

summary(m.Ha)

The general model in R-like notation is wing.score ~ treatment

THis is defined by 2 parameters the intercept and the treatment effect. The treatment effect in this case corresponds to the difference between the two treatments.

The general model with both parameters is: wing.score ~ intercept + treatment.effect

We can plug in the values we have: wing.score ~ 4.8 + -0.63*(Experimental?)

where Experimental? = 0 if its the control treatment and 1 if its the experimental treatment. TWo equations are therefore implied

Control flies: wing.score ~ 4.8 + -0.630 Experimental flies: wing.score ~ 4.8 + -0.631

Which simplfies to

Control flies: wing.score ~ 4.8 Experimental flies: wing.score ~ 4.8 + -0.63

Inference

We can compare the models using p-values or AIC.

Get p-value by comparing models using anova()

anova(m.H0, #null model
      m.Ha) #alternative model

Compare models with AIC. AICtab is from the package bbmle

bbmle::AICtab(m.H0, m.Ha)

AICc is a better choice than AIC. This requires ICtab()

bbmle::ICtab(m.H0, 
              m.Ha,
             type = "AICc")
Model diagnostics

In this context, resid subtracts the group mean (E or C) from each value. This is how much each observation differs from the group mean.

resid(m.Ha)

The mean of the residuals is approximately zero

mean(resid(m.Ha))
Histogram of residuals
hist(resid(m.Ha))
Look at normality with a qq-plot

plot(m.Ha, which = 2)

Transformations

We can wrap our y variable wing.score in log()

m.H0.log <- lm(log(wing.score) ~ 1, data = student1.F)
m.Ha.log <- lm(log(wing.score) ~ E.or.C, 
               data = student1.F)
Look at p-value for transformed data
#raw data
anova(m.H0, 
      m.Ha)

#transformed data
anova(m.H0.log, 
      m.Ha.log)
Look at residuals
Histogram of residuals for log transformation

Not much better...

hist(resid(m.Ha.log))
qqplot of residuals for log transformation

More informative that histogram of the residuals; dots should fall on the line...

plot(m.Ha.log,  2)
Look at heteroskedasticity
# column with group variables
summary(student1.F$E.or.C)

#    residuls of model       column with group var
plot(resid(m.Ha.log)    ~    student1.F$E.or.C)

Analysis Non parametric methods

Run the test

wilcox.test(wing.score ~ E.or.C, 
            data = student1.F)

You might see an error "cannot compute exact p-value with ties." The computation of the p-values depend on whether there are ties or not in the data; I doubt this is an issue ever in pratice.

Compare p-values: both are very small. The t-test p-value is a bit smaller. Often when you use a test and violate some of its assumptions, your p-values are too big.

wilcox.test(wing.score ~ E.or.C, 
            data = student1.F,
            exact = F)$p.value
t.test(wing.score ~ E.or.C, 
            data = student1.F)$p.value

Wilcox test, normality, and skew

Seperate out each group into seperate object

S1.F.E <- student1.F %>% filter(E.or.C == "E")
S1.F.C <- student1.F %>% filter(E.or.C == "C")

run test

brunner.munzel.test(x = S1.F.E$wing.score,
                    y = S1.F.C$wing.score)



brouwern/shroom documentation built on May 24, 2019, 7:54 a.m.