knitr::opts_chunk$set(echo = FALSE, warnings = F, message = F)
# Prep stuff ## libraries library(readxl) library(ggpubr) library(cowplot) library(dplyr) library(pander,warn.conflicts = F) library(plotrix) #load data dat.full <- readxl::read_xlsx("telomores.xlsx",sheet = "orig") # clean up data dat.full$cort <- as.numeric(dat.full$cort) dat.full$Group <- with(dat.full, paste(treatment,sex)) dat.full$Group <- gsub("control","Control",dat.full$Group) dat.full$Group <- gsub("disturbed","Noise exposure",dat.full$Group) dat.full$Group <- gsub(" F","\nFemales",dat.full$Group) dat.full$Group <- gsub(" M","\nMales",dat.full$Group) dat.full$cort.log <- log(dat.full$cort) dat.full$telomere.length.log <- log(dat.full$telomere.length) names(dat.full) <- gsub("treatment", "Treatment", names(dat.full)) names(dat.full) <- gsub("sex", "Sex", names(dat.full))
# Set up supset of focal columns dat.focal.cols <- dat.full[,c("Treatment","Sex","telomere.length","cort")] dat.foc.cols.rnd <- dat.focal.cols dat.foc.cols.rnd$telomere.length <- dat.foc.cols.rnd$telomere.length %>% round(2) dat.foc.cols.rnd <- dat.foc.cols.rnd %>% arrange(Treatment,telomere.length) dat.foc.cols.rnd$telo.bin.07 <- ifelse(dat.foc.cols.rnd$telomere.length < 0.8, "*","") dat.foc.cols.rnd$telo.bin.08 <- ifelse(dat.foc.cols.rnd$telomere.length < 0.9 & dat.foc.cols.rnd$telomere.length >= 0.8, "*","") dat.foc.cols.rnd$telo.bin.09 <- ifelse(dat.foc.cols.rnd$telomere.length < 1 & dat.foc.cols.rnd$telomere.length >= 0.9, "*","") # dat.foc.cols.rnd$telo.bin.10 <- "" # dat.foc.cols.rnd$telo.bin.11 <- ""
pander(dat.foc.cols.rnd)
The meaning of the columns:
We can assigned each telomere length to a "bin" of similar values. I've decided all telomeres between 0.7 and 0.799 go into the "0.7" bin, all telomeres between 0.8 and 0.899 go in the "0.8" bin, etc. Basically we're rounding down each value to the nearest 0.1.
TASK: Determine the following things for these data
#Summary stats summ.tab <- dat.foc.cols.rnd %>% group_by(Treatment) %>% summarise(sample.size = n(), min.telomere = min(telomere.length), median.telo = median(telomere.length), max.telomore = max(telomere.length), "outliers?" = "") summ.tab.blnk <- summ.tab summ.tab.blnk[,-1] <- "_____"
#output summary stats #pander(summ.tab.blnk) pander(summ.tab)
TASK: Count up the number of telomeres in each "bin". Put a star next to the mode (most common observation)
## table for tabulation tabulation.tab <- expand.grid(telomere.bin = seq(0.7,1.6,0.1), N.control = "", N.disturbed = "",stringsAsFactors = F) tabulation.tab[1:3,"N.disturbed"] <- c(0,3,4)
pander(tabulation.tab)
Histograms are useful for examining the distribution of data. We can turn our tabulations into a histograms For each bin we can fill in one box of the grid per observation in that bin. For the "disturbed treatment" I have filled in 0 boxes for the 0.7 bin (which spans from 0.7 to 0.8 on the graph), 3 boxes for the 0.8 bin, and 4 boxes for the 0.9 bins.
Task: Make the rest of the histogram for the control and disturbed group. Draw your median as a verticle line through the distribution.
gg.hist.blank <- gghistogram(data = dat.foc.cols.rnd, x = "telomere.length", desc_stat = "mean_ci", title = "Hist. of __________ treatment \n telomere lengths", #fill = "sex", xlab = "X-axis: Variable of interest\nTelomere length in bins", ylab = "Y-axis: Number of observations", color = "white", bins = 10, xlim = c(0.7,1.6)) + geom_vline(xintercept = seq(0.7,1.6,0.1)) + geom_hline(yintercept = seq(0,10,1)) + scale_x_continuous(breaks = seq(0.7,1.6,0.1)) + scale_y_continuous(breaks = seq(0,10,2)) + theme( # text = element_text(size=20), axis.text.x = element_text(size = 8)) plot_grid(gg.hist.blank,gg.hist.blank)
Boxplots are another tool for examining the distribution of data. Boxplots have the following elements:
Percentiles: Percentiles relate to the relative rank of an observation. If you have 100 observations, the 25th percentile is 25th ranked observation. The median is the 50th percentile, or middle-ranked. The 75th percentile is the 75th ranked: 75 observations are lower, and 24 are bigger. Percentiles make the most sense where there are 100 observations; the math for how it works for different numbers of observation isn't important for this course. How ties are dealt with also isn't a focus.
Important note: A confusing aspect about boxplots is that they usually remove outliers from determination of the upper and lower quartiles. The lines that extend up and down therefore don't go all the way to the true maximum and minimum.
Not always, but sometimes boxplots include points for each observation.
TASKS:
box. <- ggboxplot(data = dat.foc.cols.rnd, y = "telomere.length", x = "Treatment", #fill = "Sex", title = "Box plot of telomere length", xlab= "Predictor: Experimental group", ylab= "Response: Telomere length",add = "dotplot") box.
Histograms and boxplots are frequently used for understanding the overall distribution of the data and are very good at showing things like the amount of spread or variation in the data.
In publications scientists frequently focus just on the mean value (aka "average") and make plots showing how means vary between groups of interest, such as control vs. experimental groups. When making plots with the mean we usually represent how variable the data is using error bars which extend up and down from the mean. These are sometimes called error plots.
Its confusing, but the length of a single error bar can represent usually 1 of 3 things:
All of these things are related but they indicate different things. In order to accurately interpret scientific results you need to be familiar with each one.
The table below has the mean, standard deviation and other summaries for the telomere data. The standard deviation represents the amount of variation in the data. When lots of data is spread far away from the mean, the standard deviation (SD) is big. When most of the values are the same, the SD is small.
In some scientific fields people make plots showing the mean and standard deviation. The top of the upper error bar is at the mean + 1 SD, and the bottom of the lower error bars is at the mean - 1 SD. You see this sometimes in biology but its becoming less common.
The standard error (SE) is the standard deviation divided by the square root of the sample size: SE = SD/sqrt(N). Don't worry - we won't focus on calculating this by hand. Frequently in biology plots are made using the standard error. The top of the upper error bar is at the mean + 1 SE, and the bottom of the lower error bars is at the mean - 1 SE.
The 95% confidence interval is calculated as two times the standard error. To be specific, the interval is everything from the bottom of the lower bar to the top of the upper bar. One arm of the interval is equal to twice the standard error. So a confidence interval spans from the mean - 2SE to the mean + 2SE.
95% confidence intervals allow us to do a bit of statics just by eye. When you have the means of 2 groups and their error bars only overlap a bit, statisticians would say the groups are "significantly different" from each other. When there is a significant statistical difference between 2 groups it lends support to the argument that there are relevant biological differences between 2 groups. However, if there is a lot of overlap between the confidence intervals of 2 means, the difference between means is considered non-significant.
Example calculation:
Control upper 95% CI = mean + 2 x SE \newline = 0.95 + 2 x 0.044 \newline = 0.95 + 0.088
So the 95% confidence interval extends up from the mean to 1.038
TASK: Calculate the 95% Confidence Interval (CI) for the 2 treatments using the standard error (SE) and mean provided
#Summary stats summ.tab2 <- dat.foc.cols.rnd %>% group_by(Treatment) %>% summarise(N = n(), mean.telo = mean(telomere.length), SD = sd(telomere.length), SE = std.error(telomere.length), "95% CI top" = mean(telomere.length)+2*std.error(telomere.length), "95% CI bottom" = mean(telomere.length)-2*std.error(telomere.length)) pander(summ.tab2)
N = sample size | SD = standard deviation | SE = standard error | CI = confidence interval
TASK: Use the mean & 95% confidence interval limits you just calcualted to make an Error Plot. Do you think there is a difference between the control and disturbed treatments?
error. <- ggerrorplot(data = dat.foc.cols.rnd, y = "telomere.length", x = "Treatment", desc_stat = "mean_ci", #color = "white" , title = "Error plot: means & 95% confidence interval", xlab= "Predictor: Experimental group\n", ylab= "Response: Telomere length\n") error.
So far we have looked at a numeric variable (telomere length) split up into 2 groups. Often we want to examine the relationship between 2 numeric variables. This can be done with a scatterplots. Scatterplots can give you insight into how 2 variables are related.
TASK: Do you think there is a correlation between the 2 variables in this plot? Is the correlation (relationship) positive (+) or negative (-)?
scatter. <- ggscatter(data = na.omit(dat.foc.cols.rnd), y = "telomere.length", x = "cort", add = "reg.line", #color = white, cor.coef = T, title = "Scatter plot: Telomeres vs corticosterone", xlab = "Predictor: Blood corticosterone\nconcentration", ylab = "Response: Telomere length\n") scatter.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.