malte.thodberg@bio.ku.dk
The plan:
cpm
function. Example datasets:
iris
dataset from datasets
(default R-package). oliveoil
dataset from the pdfCluster
package.Real datasets:
zebrafish
dataset fission
dataset pasilla
datasetAdvanced dataset for playing around:
tissues
datasetThe plot examples will mainly be based on the popular ggplot2
package, since
ggplot2
have powerful features for coloring plots:
library(ggplot2)
But feel free to use whatever package you like, i.e. base
, lattice
, etc.
iris
datasetLet's first look at the iris dataset:
# Load the data data(iris) # Size of the dataframe dim(iris) # First few lines of the dataset head(iris)
iris
dataset# Summary statistics summary(iris)
The iris
data is 4D - we will use PCA to generate a 2D representation
of the data.
We also have a some grouping of the data - the Species
column.
Let's try out PCA on the iris
dataset! As with everything in R, this is a
one-liner:
pca_iris <- prcomp(iris[,-5])
That was easy! Let's look what's inside:
summary(pca_iris)
Since the data is 4D we get 4 PCs. The first PC captures almost 92%
of the variation in the dataset, while the second PC captures 5%. Using just
these two PCs, we can make a plot containing 97% of the variation in the
dataset! The positions of each observation along each PC (The position on the
"picture") is stored in pca_iris$x
as a matrix object.
When plotting, it is usually a good idea to put all the data you want plot in a
single data.frame
, especially if you are using ggplot2
:
plot_iris <- data.frame(pca_iris$x, Species=iris$Species) head(plot_iris)
ggplot(plot_iris, aes(x=PC1, y=PC2, color=Species)) + geom_point() + coord_fixed()
ggplot(plot_iris, aes(x=PC3, y=PC4, color=Species)) + geom_point() + coord_fixed()
with(plot_iris, plot(x=PC1, y=PC2, col=Species)); grid()
oliveoil
datasetRepeat the previous analysis for the the oliveoil
dataset from the
pdfCluster
package!
PC1
& PC2
and PC2
& PC3
I highly suggest you try writing the code yourself! If you are struggling, you can look at the cheatsheets on the following slides.
We will go around the room to answer questions.
# Load the data suppressPackageStartupMessages(library(pdfCluster)) data(oliveoil) # Size of the dataframe dim(oliveoil) # First few lines of the dataset head(oliveoil)
# Summary statistics summary(oliveoil)
Example code for performing the PCA:
# PCA without the first two columns pca_oil <- prcomp(oliveoil[,-c(1,2)]) # Save the data for plotting plot_oil <- data.frame(pca_oil$x, oliveoil[,1:2])
Go back in the slides: Which object can you call the summary
function on to inspect the contained variance?
Try playing around with different combinations of PCs by editing the code below:
ggplot(plot_oil, aes(x=PC1, y=PC2, color=region, shape=macro.area)) + geom_point() + coord_fixed()
PCA is one of the most widely used methods for dimensionality reduction - but by not the only one!
We have skipped over some very important concepts in the this very shorts introduction:
prcomp(scale=TRUE)
argument).Other popular methods not covered here:
pheatmap
)zebrafish
datasetNext is a worked example on a real RNA-Seq dataset:
library(Shenzen2016) data("zebrafish")
As explained earlier, we need to trim and normalize the data.
Let's first discard genes with less than 10 reads in in three samples
above_threshold <- rowSums(zebrafish$Expression >= 10) >= 3 EM_zebra <- subset(zebrafish$Expression, above_threshold)
The threshold on which to trim is somewhat arbitrary and depends heavily on the given dataset. Often you set the threshold for expression on what can be externally experimentally validated.
We then use edgeR
to normalize the data. First we have to store the data as a DGEList
library(edgeR) # Create dge list object dge_zebra <- DGEList(EM_zebra) # Calculate normalization factors dge_zebra <- calcNormFactors(dge_zebra, method="TMM") # Normalize the expression values logTMM_zebra <- cpm(dge_zebra, log=TRUE, prior.count=3)
We perform the PCA just as before. We have to transpose the EM first, since genomic data is usually stored with samples as columns rather than rows:
pca_zebra <- prcomp(t(logTMM_zebra)) summary(pca_zebra)
plot_zebra <- data.frame(pca_zebra$x, zebrafish$Design) head(plot_zebra)
ggplot(plot_zebra, aes(x=PC1, y=PC2, color=gallein)) + geom_point() + coord_fixed()
ggplot(plot_zebra, aes(x=PC3, y=PC4, color=gallein)) + geom_point() + coord_fixed()
What does the plot mean?
- Is this what you expected to see?
- What could be the explanation?
- Experiments are not guaranteed to look like how you want them!
yeast
datasetpasilla
datasettissues
datasetRepeat the previous analysis for the the yeast
, pasilla
and, if you have time, the tissues
dataset.
PC1
& PC2
and PC2
& PC3
I highly suggest you try writing the code yourself! If you are struggling, you can look at the cheatsheets on the following slides.
We will go around the room to answer questions.
Perfrom the PCA:
# Load the data data("yeast") data("pasilla") data("tissues") # Inspect the dimensionality dim(yeast$Expression) # Inspect the design by yeast$Design
# Set a dataset: either yeast, pasilla or tissues dataset <- pasilla # Trim: Play around with numbers here EM <- subset(dataset$Expression, rowSums(dataset$Expression >= 5) >= 3) # Calculate normalization factors: Play around with the method argument (RLE, "none") dge <- calcNormFactors(DGEList(EM), method="TMM") # Normalize the expression values (Advanced, play around with prior.count) logTMM <- cpm(dge, log=TRUE, prior.count=5) # Perform the PCA and save data for plotting pca <- prcomp(t(logTMM)) plot_data <- data.frame(pca$x, dataset$Design)
Play around with different components and settings for color and shape
ggplot(plot_data, aes(x=PC1, y=PC2, color=condition, shape=type)) + geom_point() + coord_fixed()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.