In brouwern/sparrowtelomeres: What the Package Does (One Line, Title Case)

knitr::opts_chunk$set(echo = FALSE, warnings = F, message = F)

# Prep stuff
## libraries
library(readxl)
library(ggpubr)
library(cowplot)
library(dplyr)
library(pander,warn.conflicts = F)
library(plotrix)

#load data
dat.full <- readxl::read_xlsx("telomores.xlsx",sheet = "orig")

# clean up data
dat.full$cort <- as.numeric(dat.full$cort)
dat.full$Group <- with(dat.full,
                       paste(treatment,sex))

dat.full$Group <- gsub("control","Control",dat.full$Group)
dat.full$Group <- gsub("disturbed","Noise exposure",dat.full$Group)

dat.full$Group <- gsub(" F","\nFemales",dat.full$Group)
dat.full$Group <- gsub(" M","\nMales",dat.full$Group)
dat.full$cort.log <- log(dat.full$cort)
dat.full$telomere.length.log <- log(dat.full$telomere.length)


names(dat.full) <- gsub("treatment",
                        "Treatment",
                        names(dat.full))

names(dat.full) <- gsub("sex",
                        "Sex",
                        names(dat.full))

# Set up supset of focal columns
dat.focal.cols <- dat.full[,c("Treatment","Sex","telomere.length","cort")]
dat.foc.cols.rnd <- dat.focal.cols
dat.foc.cols.rnd$telomere.length <- dat.foc.cols.rnd$telomere.length %>% round(2)
dat.foc.cols.rnd <- dat.foc.cols.rnd  %>% arrange(Treatment,telomere.length)

dat.foc.cols.rnd$telo.bin.07 <- ifelse(dat.foc.cols.rnd$telomere.length < 0.8,
                                  "*","")
dat.foc.cols.rnd$telo.bin.08 <- ifelse(dat.foc.cols.rnd$telomere.length < 0.9 &  
                                  dat.foc.cols.rnd$telomere.length >= 0.8,
                                  "*","")
dat.foc.cols.rnd$telo.bin.09 <- ifelse(dat.foc.cols.rnd$telomere.length < 1 &  
                                  dat.foc.cols.rnd$telomere.length >= 0.9,
                                  "*","")

# dat.foc.cols.rnd$telo.bin.10 <- ""
# dat.foc.cols.rnd$telo.bin.11 <- ""

Bird Telomere & Corticosterone Data

pander(dat.foc.cols.rnd)

The meaning of the columns:

Treatment = experimental treatment; "control" = exposed only to natural sounds; "disturbed" = exposed to artificial noise
Sex = sex of baby bird (M = male, F = Female)
telomere.length = length of telomeres at the end of the DNA strand (in kilobases). A length of 1.34 is 1.34*1000 = 1340 kilobases
cort = concentration of corticosterone in blood
telo.bin.07 = is the telomere length between 0.7 and 0.799?
telo.bin.08 = is the telomere length between 0.8 and 0.899?

"Binning" the data

We can assigned each telomere length to a "bin" of similar values. I've decided all telomeres between 0.7 and 0.799 go into the "0.7" bin, all telomeres between 0.8 and 0.899 go in the "0.8" bin, etc. Basically we're rounding down each value to the nearest 0.1.

Summary statistics

TASK: Determine the following things for these data

Sample size in each group (N; the number of observations)
Minimum: Smallest telomere size
Median: Approximate median value. The median is the "middle value". If you have an odd number of observation it is the exact middle. If you have an even number of observations its the mid-point between the two observations tied for the middle.
Maximum: The Largest telomere
Outerlies/Extreme observations Are there any "outliers"? (values MUCH bigger or MUCH smaller than the rest?)

#Summary stats
summ.tab <- dat.foc.cols.rnd %>% 
  group_by(Treatment) %>% 
  summarise(sample.size = n(),
            min.telomere = min(telomere.length),
            median.telo = median(telomere.length),
            max.telomore = max(telomere.length),
            "outliers?" = "")

summ.tab.blnk <- summ.tab
summ.tab.blnk[,-1] <- "_____"

#output summary stats
#pander(summ.tab.blnk)
pander(summ.tab)

Tabulation the number of observations per "bin"

TASK: Count up the number of telomeres in each "bin". Put a star next to the mode (most common observation)

## table for tabulation
tabulation.tab <- expand.grid(telomere.bin  = seq(0.7,1.6,0.1),
            N.control = "",
            N.disturbed = "",stringsAsFactors = F)

tabulation.tab[1:3,"N.disturbed"] <- c(0,3,4)

pander(tabulation.tab)

Plotting the histogram

Histograms are useful for examining the distribution of data. We can turn our tabulations into a histograms For each bin we can fill in one box of the grid per observation in that bin. For the "disturbed treatment" I have filled in 0 boxes for the 0.7 bin (which spans from 0.7 to 0.8 on the graph), 3 boxes for the 0.8 bin, and 4 boxes for the 0.9 bins.

Task: Make the rest of the histogram for the control and disturbed group. Draw your median as a verticle line through the distribution.

gg.hist.blank <- gghistogram(data = dat.foc.cols.rnd,
          x = "telomere.length",
          desc_stat = "mean_ci",
          title = "Hist. of __________ treatment \n telomere lengths",
          #fill = "sex",
          xlab  = "X-axis: Variable of interest\nTelomere length in bins",
          ylab = "Y-axis: Number of observations",
          color = "white",
          bins = 10,
          xlim = c(0.7,1.6)) + 
  geom_vline(xintercept = seq(0.7,1.6,0.1)) +
  geom_hline(yintercept = seq(0,10,1)) +
  scale_x_continuous(breaks =  seq(0.7,1.6,0.1)) +
  scale_y_continuous(breaks =  seq(0,10,2)) +
  theme( # text = element_text(size=20),
        axis.text.x = element_text(size = 8)) 

plot_grid(gg.hist.blank,gg.hist.blank)

Boxplots

Boxplots are another tool for examining the distribution of data. Boxplots have the following elements:

A single line which represents the median (middle value, aka the 50% percentile)
A box which represents observations falling between the 50th & 75th percentile (between the median and the approximate 75th ranked observation).
A box which presents observations falling 25th and 50th percentile
A vertical line which extends up from the 75th percentile to approximately to the maximum (outliers are ignored!). This line represents the upper quartile.
A vertical line which extends down from the 25th percentile approximately to the minimum value (outliers are ignored!). This line represents the lower quartile.

Percentiles: Percentiles relate to the relative rank of an observation. If you have 100 observations, the 25th percentile is 25th ranked observation. The median is the 50th percentile, or middle-ranked. The 75th percentile is the 75th ranked: 75 observations are lower, and 24 are bigger. Percentiles make the most sense where there are 100 observations; the math for how it works for different numbers of observation isn't important for this course. How ties are dealt with also isn't a focus.

Important note: A confusing aspect about boxplots is that they usually remove outliers from determination of the upper and lower quartiles. The lines that extend up and down therefore don't go all the way to the true maximum and minimum.

Not always, but sometimes boxplots include points for each observation.

TASKS:

Label the key parts of these boxplots
Write you median, min, and max values next to the appropriate places on the box plot. Did the computer decide there were outliers?
Calculate the percentage of data points that fall into the lower boxes.
Calculate the percentage of data points that fall into the upper boxes.

box. <- ggboxplot(data = dat.foc.cols.rnd,
          y = "telomere.length",
          x = "Treatment",
          #fill = "Sex",
          title = "Box plot of telomere length",
          xlab= "Predictor: Experimental group",
          ylab= "Response: Telomere length",add = "dotplot")
box.

Plotting means & error bars

Histograms and boxplots are frequently used for understanding the overall distribution of the data and are very good at showing things like the amount of spread or variation in the data.

In publications scientists frequently focus just on the mean value (aka "average") and make plots showing how means vary between groups of interest, such as control vs. experimental groups. When making plots with the mean we usually represent how variable the data is using error bars which extend up and down from the mean. These are sometimes called error plots.

Error bars

Its confusing, but the length of a single error bar can represent usually 1 of 3 things:

The standard deviation (SD)
The standard error (SE)
A 95% confidence interval (95% CI).

All of these things are related but they indicate different things. In order to accurately interpret scientific results you need to be familiar with each one.

Mean, standard deviation & standard error

The table below has the mean, standard deviation and other summaries for the telomere data. The standard deviation represents the amount of variation in the data. When lots of data is spread far away from the mean, the standard deviation (SD) is big. When most of the values are the same, the SD is small.

In some scientific fields people make plots showing the mean and standard deviation. The top of the upper error bar is at the mean + 1 SD, and the bottom of the lower error bars is at the mean - 1 SD. You see this sometimes in biology but its becoming less common.

The standard error (SE) is the standard deviation divided by the square root of the sample size: SE = SD/sqrt(N). Don't worry - we won't focus on calculating this by hand. Frequently in biology plots are made using the standard error. The top of the upper error bar is at the mean + 1 SE, and the bottom of the lower error bars is at the mean - 1 SE.

The 95% confidence interval is calculated as two times the standard error. To be specific, the interval is everything from the bottom of the lower bar to the top of the upper bar. One arm of the interval is equal to twice the standard error. So a confidence interval spans from the mean - 2SE to the mean + 2SE.

95% confidence intervals allow us to do a bit of statics just by eye. When you have the means of 2 groups and their error bars only overlap a bit, statisticians would say the groups are "significantly different" from each other. When there is a significant statistical difference between 2 groups it lends support to the argument that there are relevant biological differences between 2 groups. However, if there is a lot of overlap between the confidence intervals of 2 means, the difference between means is considered non-significant.

Example calculation:

Control upper 95% CI = mean + 2 x SE \newline = 0.95 + 2 x 0.044 \newline = 0.95 + 0.088

So the 95% confidence interval extends up from the mean to 1.038

TASK: Calculate the 95% Confidence Interval (CI) for the 2 treatments using the standard error (SE) and mean provided

#Summary stats
summ.tab2 <- dat.foc.cols.rnd %>% 
  group_by(Treatment) %>% 
  summarise(N = n(),
            mean.telo = mean(telomere.length),
            SD = sd(telomere.length),
            SE = std.error(telomere.length),
            "95% CI top" = mean(telomere.length)+2*std.error(telomere.length),
            "95% CI bottom" = mean(telomere.length)-2*std.error(telomere.length))
pander(summ.tab2)

N = sample size | SD = standard deviation | SE = standard error | CI = confidence interval

TASK: Use the mean & 95% confidence interval limits you just calcualted to make an Error Plot. Do you think there is a difference between the control and disturbed treatments?

error. <- ggerrorplot(data = dat.foc.cols.rnd,
          y = "telomere.length",
          x = "Treatment",
          desc_stat = "mean_ci",
          #color = "white"  , 
          title = "Error plot: means & 95% confidence interval",
          xlab= "Predictor: Experimental group\n",
          ylab= "Response: Telomere length\n")

error.

Scatter plots

So far we have looked at a numeric variable (telomere length) split up into 2 groups. Often we want to examine the relationship between 2 numeric variables. This can be done with a scatterplots. Scatterplots can give you insight into how 2 variables are related.

TASK: Do you think there is a correlation between the 2 variables in this plot? Is the correlation (relationship) positive (+) or negative (-)?

scatter. <- ggscatter(data = na.omit(dat.foc.cols.rnd),
          y = "telomere.length",
          x = "cort",
          add = "reg.line",
          #color = white,
          cor.coef = T,
          title = "Scatter plot: Telomeres vs corticosterone",
          xlab = "Predictor: Blood corticosterone\nconcentration",
          ylab = "Response: Telomere length\n")

scatter.