Question 1 - Simple Linear Regression

Load the relevant packages.

library("jrModellingBio")
library("broom")
library("tidyverse")

\newthought{Lets return} to the yeast data to practice some more advanced analyses.

The data can be obtained from:

data(yeast, package = "jrModellingBio")
  1. Lets fit a linear regression model as follows: Use mcg as the y or response variable and use gvh as the x or explanatory variable. Should gvh be included in the model to explain variatio in mcg?
m = lm(mcg ~ gvh, data = yeast)
tidy_m = tidy(m)
tidy_m
##The p-value for the gradient is 0.01
##This suggests gvh is useful in explaining variation in 
  1. Plot the data with gvh on the x-axis and mcg on the y-axis. You can use ggplot2 or base, but as in the notes we'll be using ggplot2 for the solutions. Add a dashed red line indicating the line of best fit.
library("ggplot2")
ggplot(yeast, aes(x = gvh, y = mcg)) +
    geom_point() +
    geom_abline(intercept = tidy_m$estimate[1],
                    slope = tidy_m$estimate[2],
                linetype = 2, colour = "red")
ggplot(yeast, aes(x = gvh, y = mcg)) +
    geom_point() +
    geom_abline(intercept = tidy_m$estimate[1],
                    slope = tidy_m$estimate[2],
                linetype = 2, colour = "red") +
    theme_bw()
  1. Plot the standardised residuals against the fitted values. Does the graph look random? Hint: Use augment()
##Model diagnosics look good
m_aug = augment(m)
ggplot(m_aug, aes(x = .fitted, y = .std.resid)) +
    geom_point() +
    geom_hline(
        yintercept = c(0, -2, 2),
        linetype = c(2, 3, 3),
        colour = c("red", "green", "green")
    )
  1. Construct a q-q plot of the standardised residuals.
ggplot(m_aug, aes(sample = .std.resid)) +
    geom_qq() +
    geom_qq_line(colour = "steelblue",
                linetype = 2) 
##Model diagnosics look good
  1. What is wrong with this model? Discuss (in less than 20 words).
# When we fit a regression model we are implicitly assuming a causal relationship.
# In this case we are talking about two measurements of signal peptides, one is
# unlikely to be caused by the other but they are correlated because some (as yet) 
# unseen variable is causally related to both.

Question 2 - Multiple Linear Regression

\newthought{Lets now} fit a more sensible model with some different data. We will look at the occurrence of infectious disease illnesses across three states in the USA caused by different microbial pathogens. Load the outbreaks dataset:

data(outbreaks, package = "jrModellingBio")

and examine its contents with your favourite functions for this job. Note you can find out a little more about this data by reading its help file ?outbreaks.

  1. First fit a simple linear model predicting illnesses from year. Examine the qq-plot of the standardised residuals. What do you notice? What should we do to fix this?
m = lm(illnesses ~ year, data = outbreaks)
m_aug = augment(m)
ggplot(m_aug, aes(sample = .std.resid)) +
    geom_qq() +
    geom_qq_line(colour = "steelblue",
                linetype = 2)
# This looks way off, sensible first step here is to transform the y variable. Lets try a log transformation.
  1. Try fitting the model again but with log transformed illnessess. Hint: use log(illnesses) in place of illnesses. Now re-exmamine the qq-plot? Have we fixed the problem?
m = lm(log(illnesses) ~ year, data = outbreaks)
m_aug = augment(m)
ggplot(m_aug, aes(sample = .std.resid)) +
    geom_qq() +
    geom_qq_line(colour = "steelblue",
                linetype = 2)
# Not perfect but much better, fine for our purposes.
  1. Now with your log-model, lets try adding another variable into our model, lets try adding species as an explanatory (x) variable. Make sure that you allow for an interaction term between year and species.
m = lm(log(illnesses) ~ year * species, data = outbreaks)
  1. Examine the model fit and parameters using the glance() and tidy() functions from broom. How much of the variation in log(illnesses) is explained by the model? Are illnesses caused by all 3 species on the decline (ignore whether this is significant or not for now)?
glance(m)
tidy(m)
# r2 is 0.06 so nearly 6% of the variation in log illnesses is captured by this model
# Salmonella is on the increase
  1. Construct a graph that shows the outputs of your linear model. Remember you can use geom_smooth() and its method argument to fit and plot a linear model.
ggplot(outbreaks, aes(x = year, y = log(illnesses))) +
    geom_smooth(aes(colour = species), method = "lm")
ggplot(outbreaks, aes(x = year, y = log(illnesses))) +
    geom_smooth(aes(colour = species), method = "lm")

Question 3 = ANOVA

\newthought{Lets take a look} at whether infectious disease illnesses vary in abundance by state.

  1. Fit an anova model with log(illnesses) explained by state. Are there differences between treatment groups (States)?
m = aov(log(illnesses) ~ state, data = outbreaks)
tidy_m = tidy(m)
tidy_m    
## The p value is small.
## This suggests a difference may exist.
  1. Now we know that States differ in the abundance of infectious disease illnesses but which differences are significant. Recall that we can use Tukeys Honest Significant Differences (see ?TukeyHSD). Perform the Tukey HSD analysis on your anova model. What does this tell you?
TukeyHSD(m)
# California has a higher incidence of illnesses while the other two are equal. 
  1. Construct a boxplot or density plot that shows the distributions of illnesses across the different States.
ggplot(outbreaks, aes(x = state, y = log(illnesses))) +
    geom_boxplot(aes(fill = state))

ggplot(outbreaks, aes(x = log(illnesses))) +
    geom_density(aes(fill = state), alpha = 0.5)
ggplot(outbreaks, aes(x = log(illnesses))) +
    geom_density(aes(fill = state), alpha = 0.5)

Question 5 - Hierarchical clustering

\newthought{Previously} when we used the yeast data set, we considered individual variables and their relationships. This time we want to analyse the relationships between all variables and discover if these lead to any natural grouping of the observations. In other words can we group loci on the basis of various properties of their amino acid sequences? We could then ask whether these potential groupings correspond to the sucellular localisations. To keep it simple we will look at 2 of the localisation classes, extracellular and membrane proteins with a cleaved signal peptide.

Lets get started.

data(yeast, package = "jrModellingBio")
# restrict the data two two localisations
yeast = filter(yeast, class == "EXC" | class == "ME1")

Next we want to separate out the sequence labels and localisation categories.

## extract the sequence labels of localisations
seq_localisations = yeast$class
yeast = yeast %>% select(mcg, gvh, alm, mit, vac, nuc)

Ok over to you...

  1. Start by carrying out a hierarchical clustering analysis (see ?hclust). You will need to first calculate a distance matrix, see ?dist. Feel free to experiment with different distance methods and clustering functions.
clusters = hclust(dist(yeast))
  1. Try using plot() to create a cluster dendogram, and use FALSE in the label argument to remove the text labels.
plot(hclust(dist(yeast)), labels = seq_localisations)

These next few sections are quite tricky, contain concepts that we didn't cover and are aimed at a very specific use case. If you don't see yourself making figures like Figure 4 then feel free to skip these and move on to question 6 on principal components analysis.

  1. We can make the dendrogram significantly more pretty and informative with a recent addition to ggplot: ggraph. Install this package if you dont have it already. We will need the ggraph package as well as one of its dependencies tidygraph, both very powerful tools for working with graph data.
library("tidygraph")
library("ggraph")
  1. Now convert your hclust object into a tbl_graph object using the as_tbl_graph() function from tidygraph.
cluster_tbl = as_tbl_graph(clusters)
  1. Next we need to curate a set of labels, to control the colour of the points, in our case the subcellular localisations. Make a tibble of the labels as follows:
seq_labels = tibble(
    label = factor(seq_along(seq_localisations)),
    localisations =  seq_localisations
) 

Now we need to join this into the nodes part of our graph data. Use the activate() function to ensure you are manipulating the nodes part of the graph data, then perform a left_join() from dplyr to add in our label data to the nodes.

cluster_tbl = cluster_tbl %>% 
    activate(nodes) %>% 
    left_join(seq_labels)
  1. Finally make the graph in figure 4. You can use standard ggplot2 syntax and the geom_edge_elbow() and geom_node_point() as layers. Just remember that we need to creat the gglot object with ggrapp() instead of ggplot()`
ggraph(cluster_tbl, "tree") + 
    geom_edge_elbow(width = 0.2) +
    geom_node_point(aes(filter = leaf, colour = localisations), size = 2)
ggraph(cluster_tbl, "tree") + 
    geom_edge_elbow(width = 0.2) +
    geom_node_point(aes(filter = leaf, colour = localisations), size = 2)

Question 6 - Principal components analysis (PCA)

PCA has several advantages;

##Round to 2dp
signif(cor(yeast), 2)
  1. Carry out a PCA on this data set. Remeber it doesn't hurt to scale the data even if we think they are on similar axes. The worst that can happen is no scaling occurs.
##Run principle components
prcomp(yeast, scale = TRUE)
  1. Construct a biplot of the data.
biplot(prcomp(yeast, scale = TRUE))

Solutions

Solutions are contained within this package:

vignette("solutions2", package = "jrModelling")


jr-packages/jrModellingBio documentation built on Dec. 11, 2019, 9:41 a.m.