Load the relevant packages.
library("jrModellingBio") library("broom") library("tidyverse")
\newthought{Lets return} to the yeast data to practice some more advanced analyses.
The data can be obtained from:
data(yeast, package = "jrModellingBio")
mcg
as the y or response variable and use gvh
as the x or explanatory variable. Should gvh
be included in the model to explain variatio in mcg
?m = lm(mcg ~ gvh, data = yeast) tidy_m = tidy(m) tidy_m ##The p-value for the gradient is 0.01 ##This suggests gvh is useful in explaining variation in
gvh
on the x-axis and mcg
on the y-axis. You can use ggplot2 or base, but as in the notes we'll be using ggplot2 for the solutions. Add a dashed red line indicating the line of best fit.library("ggplot2") ggplot(yeast, aes(x = gvh, y = mcg)) + geom_point() + geom_abline(intercept = tidy_m$estimate[1], slope = tidy_m$estimate[2], linetype = 2, colour = "red")
ggplot(yeast, aes(x = gvh, y = mcg)) + geom_point() + geom_abline(intercept = tidy_m$estimate[1], slope = tidy_m$estimate[2], linetype = 2, colour = "red") + theme_bw()
augment()
##Model diagnosics look good m_aug = augment(m) ggplot(m_aug, aes(x = .fitted, y = .std.resid)) + geom_point() + geom_hline( yintercept = c(0, -2, 2), linetype = c(2, 3, 3), colour = c("red", "green", "green") )
ggplot(m_aug, aes(sample = .std.resid)) + geom_qq() + geom_qq_line(colour = "steelblue", linetype = 2) ##Model diagnosics look good
# When we fit a regression model we are implicitly assuming a causal relationship. # In this case we are talking about two measurements of signal peptides, one is # unlikely to be caused by the other but they are correlated because some (as yet) # unseen variable is causally related to both.
\newthought{Lets now} fit a more sensible model with some different data. We will look at the occurrence of infectious disease illnesses across three states in the USA caused by different microbial pathogens. Load the outbreaks
dataset:
data(outbreaks, package = "jrModellingBio")
and examine its contents with your favourite functions for this job. Note you can find out a little more about this data by reading its help file ?outbreaks
.
illnesses
from year
. Examine the qq-plot of the standardised residuals. What do you notice? What should we do to fix this?m = lm(illnesses ~ year, data = outbreaks) m_aug = augment(m) ggplot(m_aug, aes(sample = .std.resid)) + geom_qq() + geom_qq_line(colour = "steelblue", linetype = 2) # This looks way off, sensible first step here is to transform the y variable. Lets try a log transformation.
log(illnesses)
in place of illnesses
. Now re-exmamine the qq-plot? Have we fixed the problem?m = lm(log(illnesses) ~ year, data = outbreaks) m_aug = augment(m) ggplot(m_aug, aes(sample = .std.resid)) + geom_qq() + geom_qq_line(colour = "steelblue", linetype = 2) # Not perfect but much better, fine for our purposes.
species
as an explanatory (x) variable. Make sure that you allow for an interaction term between year
and species
.m = lm(log(illnesses) ~ year * species, data = outbreaks)
glance()
and tidy()
functions from broom
. How much of the variation in log(illnesses)
is explained by the model? Are illnesses caused by all 3 species on the decline (ignore whether this is significant or not for now)?glance(m) tidy(m) # r2 is 0.06 so nearly 6% of the variation in log illnesses is captured by this model # Salmonella is on the increase
geom_smooth()
and its method
argument to fit and plot a linear model.ggplot(outbreaks, aes(x = year, y = log(illnesses))) + geom_smooth(aes(colour = species), method = "lm")
ggplot(outbreaks, aes(x = year, y = log(illnesses))) + geom_smooth(aes(colour = species), method = "lm")
\newthought{Lets take a look} at whether infectious disease illnesses vary in abundance by state.
log(illnesses)
explained by state
. Are there differences between treatment groups (States)?m = aov(log(illnesses) ~ state, data = outbreaks) tidy_m = tidy(m) tidy_m ## The p value is small. ## This suggests a difference may exist.
?TukeyHSD
). Perform the Tukey HSD analysis on your anova model. What does this tell you?TukeyHSD(m) # California has a higher incidence of illnesses while the other two are equal.
ggplot(outbreaks, aes(x = state, y = log(illnesses))) + geom_boxplot(aes(fill = state)) ggplot(outbreaks, aes(x = log(illnesses))) + geom_density(aes(fill = state), alpha = 0.5)
ggplot(outbreaks, aes(x = log(illnesses))) + geom_density(aes(fill = state), alpha = 0.5)
\newthought{Previously} when we used the yeast data set, we considered individual variables and their relationships. This time we want to analyse the relationships between all variables and discover if these lead to any natural grouping of the observations. In other words can we group loci on the basis of various properties of their amino acid sequences? We could then ask whether these potential groupings correspond to the sucellular localisations. To keep it simple we will look at 2 of the localisation classes, extracellular and membrane proteins with a cleaved signal peptide.
Lets get started.
data(yeast, package = "jrModellingBio") # restrict the data two two localisations yeast = filter(yeast, class == "EXC" | class == "ME1")
Next we want to separate out the sequence labels and localisation categories.
## extract the sequence labels of localisations seq_localisations = yeast$class yeast = yeast %>% select(mcg, gvh, alm, mit, vac, nuc)
Ok over to you...
?dist
. Feel free to experiment with different distance methods and clustering functions.clusters = hclust(dist(yeast))
plot()
to create a cluster dendogram, and use FALSE
in the label argument to remove the text labels.plot(hclust(dist(yeast)), labels = seq_localisations)
These next few sections are quite tricky, contain concepts that we didn't cover and are aimed at a very specific use case. If you don't see yourself making figures like Figure 4 then feel free to skip these and move on to question 6 on principal components analysis.
ggraph
package as well as one of its dependencies tidygraph
, both very powerful tools for working with graph data.library("tidygraph") library("ggraph")
hclust
object into a tbl_graph
object using the as_tbl_graph()
function from tidygraph.cluster_tbl = as_tbl_graph(clusters)
seq_labels = tibble( label = factor(seq_along(seq_localisations)), localisations = seq_localisations )
Now we need to join this into the nodes part of our graph data. Use the activate()
function to ensure you are manipulating the nodes part of the graph data, then perform a left_join()
from dplyr
to add in our label data to the nodes.
cluster_tbl = cluster_tbl %>% activate(nodes) %>% left_join(seq_labels)
geom_edge_elbow()
and geom_node_point()
as layers. Just remember that we need to creat the gglot object with ggrapp()
instead of ggplot()`ggraph(cluster_tbl, "tree") + geom_edge_elbow(width = 0.2) + geom_node_point(aes(filter = leaf, colour = localisations), size = 2)
ggraph(cluster_tbl, "tree") + geom_edge_elbow(width = 0.2) + geom_node_point(aes(filter = leaf, colour = localisations), size = 2)
PCA has several advantages;
(c) help identify loci similar across all aa sequence properties
Calculate the correlation matrix of the yeast
data set.
##Round to 2dp signif(cor(yeast), 2)
##Run principle components prcomp(yeast, scale = TRUE)
biplot(prcomp(yeast, scale = TRUE))
Solutions are contained within this package:
vignette("solutions2", package = "jrModelling")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.