Teachers and slides

Practical Outline

The plan:

Datasets

Example datasets:

Real datasets:

Advanced dataset for playing around:

Plotting libraries

The plot examples will mainly be based on the popular ggplot2 package, since ggplot2 have powerful features for coloring plots:

library(ggplot2) 

But feel free to use whatever package you like, i.e. base, lattice, etc.

The iris dataset

Let's first look at the iris dataset:

# Load the data 
data(iris)

# Size of the dataframe 
dim(iris)

# First few lines of the dataset 
head(iris) 

The iris dataset

# Summary statistics 
summary(iris) 

The iris data is 4D - we will use PCA to generate a 2D representation of the data.

We also have a some grouping of the data - the Species column.

Performing the PCA

Let's try out PCA on the iris dataset! As with everything in R, this is a one-liner:

pca_iris <- prcomp(iris[,-5]) 

That was easy! Let's look what's inside:

summary(pca_iris) 

Since the data is 4D we get 4 PCs. The first PC captures almost 92% of the variation in the dataset, while the second PC captures 5%. Using just these two PCs, we can make a plot containing 97% of the variation in the dataset! The positions of each observation along each PC (The position on the "picture") is stored in pca_iris$x as a matrix object.

Extracting the PCs

When plotting, it is usually a good idea to put all the data you want plot in a single data.frame, especially if you are using ggplot2:

plot_iris <- data.frame(pca_iris$x, Species=iris$Species) 
head(plot_iris)

Plotting the PCA in 2D

ggplot(plot_iris, aes(x=PC1, y=PC2, color=Species)) + geom_point()  + coord_fixed()

Plotting the PCA in 2D

ggplot(plot_iris, aes(x=PC3, y=PC4, color=Species)) + geom_point() + coord_fixed()

Base plotting of the same thing

with(plot_iris, plot(x=PC1, y=PC2, col=Species)); grid() 

Exercise 1: The oliveoil dataset

Repeat the previous analysis for the the oliveoil dataset from the pdfCluster package!

  1. Load the data, and determine the dimensionality.
  2. Perform the PCA
  3. Inspect the amount of variance explained by each PC
  4. Make a plot of the samples using PC1 & PC2 and PC2 & PC3
  5. Do you see any pattern? Discuss with the people near you.

I highly suggest you try writing the code yourself! If you are struggling, you can look at the cheatsheets on the following slides.

We will go around the room to answer questions.

Cheat sheet: Loading the data

# Load the data 
suppressPackageStartupMessages(library(pdfCluster))
data(oliveoil)

# Size of the dataframe 
dim(oliveoil)

# First few lines of the dataset 
head(oliveoil) 

Cheat sheet: Summarising the data

# Summary statistics 
summary(oliveoil) 

Cheat sheet: Analysing the data

Example code for performing the PCA:

# PCA without the first two columns
pca_oil <- prcomp(oliveoil[,-c(1,2)]) 

# Save the data for plotting
plot_oil <- data.frame(pca_oil$x, oliveoil[,1:2]) 

Go back in the slides: Which object can you call the summary function on to inspect the contained variance?

Plotting the PCA in 2D

Try playing around with different combinations of PCs by editing the code below:

ggplot(plot_oil, aes(x=PC1, y=PC2, color=region, shape=macro.area)) + geom_point() + coord_fixed() 

Wrapping up

PCA is one of the most widely used methods for dimensionality reduction - but by not the only one!

We have skipped over some very important concepts in the this very shorts introduction:

Other popular methods not covered here:

The zebrafish dataset

Next is a worked example on a real RNA-Seq dataset:

library(Shenzen2016)
data("zebrafish")

Trimming the data

As explained earlier, we need to trim and normalize the data.

Let's first discard genes with less than 10 reads in in three samples

above_threshold <- rowSums(zebrafish$Expression >= 10) >= 3
EM_zebra <- subset(zebrafish$Expression, above_threshold)

The threshold on which to trim is somewhat arbitrary and depends heavily on the given dataset. Often you set the threshold for expression on what can be externally experimentally validated.

Normalizing the data

We then use edgeR to normalize the data. First we have to store the data as a DGEList

library(edgeR)

# Create dge list object
dge_zebra <- DGEList(EM_zebra)

# Calculate normalization factors
dge_zebra <- calcNormFactors(dge_zebra, method="TMM")

# Normalize the expression values
logTMM_zebra <- cpm(dge_zebra, log=TRUE, prior.count=3)

Performing the PCA

We perform the PCA just as before. We have to transpose the EM first, since genomic data is usually stored with samples as columns rather than rows:

pca_zebra <- prcomp(t(logTMM_zebra))
summary(pca_zebra)

Extracting the PCs

plot_zebra <- data.frame(pca_zebra$x, zebrafish$Design) 
head(plot_zebra) 

Plotting the PCA in 2D

ggplot(plot_zebra, aes(x=PC1, y=PC2, color=gallein)) + geom_point() + coord_fixed() 

Plotting the PCA in 2D

ggplot(plot_zebra, aes(x=PC3, y=PC4, color=gallein)) + geom_point() + coord_fixed()

Interpreting the plot

What does the plot mean?

  • Is this what you expected to see?
  • What could be the explanation?
  • Experiments are not guaranteed to look like how you want them!

Exercise 2: The yeast dataset

Exercise 2: The pasilla dataset

Exercise 2: The tissues dataset

Exercise 2: PCA using R

Repeat the previous analysis for the the yeast, pasilla and, if you have time, the tissues dataset.

  1. Load the data, and determine the dimensionality.
  2. Perform the PCA
  3. Inspect the amount of variance explained by each PC
  4. Make a plot of the samples using PC1 & PC2 and PC2 & PC3
  5. Do you see any pattern? Discuss with the people near you.

I highly suggest you try writing the code yourself! If you are struggling, you can look at the cheatsheets on the following slides.

We will go around the room to answer questions.

Cheatsheet: Loading the data

Perfrom the PCA:

# Load the data
data("yeast")
data("pasilla")
data("tissues")

# Inspect the dimensionality
dim(yeast$Expression)

# Inspect the design by
yeast$Design

Cheatsheet: Analysing the data

# Set a dataset: either yeast, pasilla or tissues
dataset <- pasilla

# Trim: Play around with numbers here 
EM <- subset(dataset$Expression, rowSums(dataset$Expression >= 5) >= 3)

# Calculate normalization factors: Play around with the method argument (RLE, "none")
dge <- calcNormFactors(DGEList(EM), method="TMM")

# Normalize the expression values (Advanced, play around with prior.count)
logTMM <- cpm(dge, log=TRUE, prior.count=5)

# Perform the PCA and save data for plotting
pca <- prcomp(t(logTMM))
plot_data <- data.frame(pca$x, dataset$Design) 

Cheatsheet: plotting

Play around with different components and settings for color and shape

ggplot(plot_data, aes(x=PC1, y=PC2, color=condition, shape=type)) + geom_point() + coord_fixed() 


MalteThodberg/Shenzen2016 documentation built on May 7, 2019, 2:09 p.m.