In MalteThodberg/Shenzen2016: Datasets for Introduction to Advanced Topics in Bioinformatics 2016

Teachers and slides

Malte Thodberg
PhD Student ~ 1.5 / 3 years
Focus: Clinical Bioinformatics
Mail: malte.thodberg@bio.ku.dk
Background: BSc Molecular Biomedicine & MSc Bioinformatics

Practical Outline

The plan:

Brief mention of data dimensionality.
Brief example of PCA using small example datasets.
Normalizing expression using edgeR's cpm function.
Exploring RNA-Seq data with PCA.

Datasets

Example datasets:

iris dataset from datasets (default R-package).
oliveoil dataset from the pdfCluster package.

Real datasets:

zebrafish dataset
fission dataset
pasilla dataset

Advanced dataset for playing around:

tissues dataset

Plotting libraries

The plot examples will mainly be based on the popular ggplot2 package, since ggplot2 have powerful features for coloring plots:

library(ggplot2)

But feel free to use whatever package you like, i.e. base, lattice, etc.

The `iris` dataset

Let's first look at the iris dataset:

# Load the data 
data(iris)

# Size of the dataframe 
dim(iris)

# First few lines of the dataset 
head(iris)

The `iris` dataset

# Summary statistics 
summary(iris)

The iris data is 4D - we will use PCA to generate a 2D representation of the data.

We also have a some grouping of the data - the Species column.

Performing the PCA

Let's try out PCA on the iris dataset! As with everything in R, this is a one-liner:

pca_iris <- prcomp(iris[,-5])

That was easy! Let's look what's inside:

summary(pca_iris)

Since the data is 4D we get 4 PCs. The first PC captures almost 92% of the variation in the dataset, while the second PC captures 5%. Using just these two PCs, we can make a plot containing 97% of the variation in the dataset! The positions of each observation along each PC (The position on the "picture") is stored in pca_iris$x as a matrix object.

Extracting the PCs

When plotting, it is usually a good idea to put all the data you want plot in a single data.frame, especially if you are using ggplot2:

plot_iris <- data.frame(pca_iris$x, Species=iris$Species) 
head(plot_iris)

Plotting the PCA in 2D

ggplot(plot_iris, aes(x=PC1, y=PC2, color=Species)) + geom_point()  + coord_fixed()

Plotting the PCA in 2D

ggplot(plot_iris, aes(x=PC3, y=PC4, color=Species)) + geom_point() + coord_fixed()

Base plotting of the same thing

with(plot_iris, plot(x=PC1, y=PC2, col=Species)); grid()

Exercise 1: The `oliveoil` dataset

Repeat the previous analysis for the the oliveoil dataset from the pdfCluster package!

Load the data, and determine the dimensionality.
Perform the PCA
Inspect the amount of variance explained by each PC
Make a plot of the samples using PC1 & PC2 and PC2 & PC3
Do you see any pattern? Discuss with the people near you.

I highly suggest you try writing the code yourself! If you are struggling, you can look at the cheatsheets on the following slides.

We will go around the room to answer questions.

Cheat sheet: Loading the data

# Load the data 
suppressPackageStartupMessages(library(pdfCluster))
data(oliveoil)

# Size of the dataframe 
dim(oliveoil)

# First few lines of the dataset 
head(oliveoil)

Cheat sheet: Summarising the data

# Summary statistics 
summary(oliveoil)

Cheat sheet: Analysing the data

Example code for performing the PCA:

# PCA without the first two columns
pca_oil <- prcomp(oliveoil[,-c(1,2)]) 

# Save the data for plotting
plot_oil <- data.frame(pca_oil$x, oliveoil[,1:2])

Go back in the slides: Which object can you call the summary function on to inspect the contained variance?

Plotting the PCA in 2D

Try playing around with different combinations of PCs by editing the code below:

ggplot(plot_oil, aes(x=PC1, y=PC2, color=region, shape=macro.area)) + geom_point() + coord_fixed()

Wrapping up

PCA is one of the most widely used methods for dimensionality reduction - but by not the only one!

We have skipped over some very important concepts in the this very shorts introduction:

The relation between PCA and Singular Value Decomposition (SVD): how the PCA is actually calculated.
Loading of features: How "important" features are for the different PCs
Scaling of the data: How to combine measurements on different scales (Play around with this using the prcomp(scale=TRUE) argument).

Other popular methods not covered here:

Multidimensional scaling (MDS)
t-distributed Stochastic Neighbor Embedding (t-SNE)
Clustered heatmaps (see package pheatmap)

The `zebrafish` dataset

Next is a worked example on a real RNA-Seq dataset:

Dataset from the paper: Silencing of odorant receptor genes by G protein βγ signaling ensures the expression of one odorant receptor per olfactory sensory neuron by Ferreira et al.
URL to paper: https://www.ncbi.nlm.nih.gov/pubmed/24559675
RNA-Seq data for 6 Zebrafish samples.
Samples grouped by Gallein treated (Trt) or untreated controls (Ctl)
The data contains raw counts for both genes and so-called experimental spike-ins (which we won't use)

library(Shenzen2016)
data("zebrafish")

Trimming the data

As explained earlier, we need to trim and normalize the data.

Let's first discard genes with less than 10 reads in in three samples

above_threshold <- rowSums(zebrafish$Expression >= 10) >= 3
EM_zebra <- subset(zebrafish$Expression, above_threshold)

The threshold on which to trim is somewhat arbitrary and depends heavily on the given dataset. Often you set the threshold for expression on what can be externally experimentally validated.

Normalizing the data

We then use edgeR to normalize the data. First we have to store the data as a DGEList

library(edgeR)

# Create dge list object
dge_zebra <- DGEList(EM_zebra)

# Calculate normalization factors
dge_zebra <- calcNormFactors(dge_zebra, method="TMM")

# Normalize the expression values
logTMM_zebra <- cpm(dge_zebra, log=TRUE, prior.count=3)

Performing the PCA

We perform the PCA just as before. We have to transpose the EM first, since genomic data is usually stored with samples as columns rather than rows:

pca_zebra <- prcomp(t(logTMM_zebra))
summary(pca_zebra)

Extracting the PCs

plot_zebra <- data.frame(pca_zebra$x, zebrafish$Design) 
head(plot_zebra)

Plotting the PCA in 2D

ggplot(plot_zebra, aes(x=PC1, y=PC2, color=gallein)) + geom_point() + coord_fixed()

Plotting the PCA in 2D

ggplot(plot_zebra, aes(x=PC3, y=PC4, color=gallein)) + geom_point() + coord_fixed()

Interpreting the plot

What does the plot mean?

Is this what you expected to see?

What could be the explanation?

Experiments are not guaranteed to look like how you want them!

Exercise 2: The `yeast` dataset

Dataset from the paper: A global non-coding RNA system modulates fission yeast protein levels in response to stress. by Leong et al.
URL to paper: https://www.ncbi.nlm.nih.gov/pubmed/24559675
RNA-Seq data for 12 fission samples.
Contains wildtype (wt) and mutant (mut) strains, with a transcription factor mutated.
Contains samples at 0 and 30 minutes after stress stimulation.
The data contains raw counts for genes.

Exercise 2: The `pasilla` dataset

Dataset from the paper: Conservation of an RNA regulatory map between Drosophila and mammals. by Brooks et al.
URL to paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3032923/
RNA-Seq data for 7 drosophila samples.
Contains treated and untreated (knock down) samples, prepared using two different protocols.
The data contains raw counts for genes.

Exercise 2: The `tissues` dataset

Dataset from the paper: Alternative isoform regulation in human tissue transcriptomes.. by Wang et al.
URL to paper: http://www.ncbi.nlm.nih.gov/pubmed?term=18978772
RNA-Seq data for 22 human tissue samples.
Samples sorted by cell type or tissue type.
The data contains raw counts for genes.

Exercise 2: PCA using R

Repeat the previous analysis for the the yeast, pasilla and, if you have time, the tissues dataset.

Load the data, and determine the dimensionality.
Perform the PCA
Inspect the amount of variance explained by each PC
Make a plot of the samples using PC1 & PC2 and PC2 & PC3
Do you see any pattern? Discuss with the people near you.

I highly suggest you try writing the code yourself! If you are struggling, you can look at the cheatsheets on the following slides.

We will go around the room to answer questions.

Cheatsheet: Loading the data

Perfrom the PCA:

# Load the data
data("yeast")
data("pasilla")
data("tissues")

# Inspect the dimensionality
dim(yeast$Expression)

# Inspect the design by
yeast$Design

Cheatsheet: Analysing the data

# Set a dataset: either yeast, pasilla or tissues
dataset <- pasilla

# Trim: Play around with numbers here 
EM <- subset(dataset$Expression, rowSums(dataset$Expression >= 5) >= 3)

# Calculate normalization factors: Play around with the method argument (RLE, "none")
dge <- calcNormFactors(DGEList(EM), method="TMM")

# Normalize the expression values (Advanced, play around with prior.count)
logTMM <- cpm(dge, log=TRUE, prior.count=5)

# Perform the PCA and save data for plotting
pca <- prcomp(t(logTMM))
plot_data <- data.frame(pca$x, dataset$Design)

Cheatsheet: plotting

Play around with different components and settings for color and shape

ggplot(plot_data, aes(x=PC1, y=PC2, color=condition, shape=type)) + geom_point() + coord_fixed()

MalteThodberg/Shenzen2016 documentation built on May 7, 2019, 2:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MalteThodberg/Shenzen2016
Datasets for Introduction to Advanced Topics in Bioinformatics 2016

In MalteThodberg/Shenzen2016: Datasets for Introduction to Advanced Topics in Bioinformatics 2016

Teachers and slides

Practical Outline

Datasets

Plotting libraries

The `iris` dataset

The `iris` dataset

Performing the PCA

Extracting the PCs

Plotting the PCA in 2D

Plotting the PCA in 2D

Base plotting of the same thing

Exercise 1: The `oliveoil` dataset

Cheat sheet: Loading the data

Cheat sheet: Summarising the data

Cheat sheet: Analysing the data

Plotting the PCA in 2D

Wrapping up

The `zebrafish` dataset

Trimming the data

Normalizing the data

Performing the PCA

Extracting the PCs

Plotting the PCA in 2D

Plotting the PCA in 2D

Interpreting the plot

Exercise 2: The `yeast` dataset

Exercise 2: The `pasilla` dataset

Exercise 2: The `tissues` dataset

Exercise 2: PCA using R

Cheatsheet: Loading the data

Cheatsheet: Analysing the data

Cheatsheet: plotting

R Package Documentation

Browse R Packages

We want your feedback!

MalteThodberg/Shenzen2016 Datasets for Introduction to Advanced Topics in Bioinformatics 2016

In MalteThodberg/Shenzen2016: Datasets for Introduction to Advanced Topics in Bioinformatics 2016

Teachers and slides

Practical Outline

Datasets

Plotting libraries

The iris dataset

The iris dataset

Performing the PCA

Extracting the PCs

Plotting the PCA in 2D

Plotting the PCA in 2D

Base plotting of the same thing

Exercise 1: The oliveoil dataset

Cheat sheet: Loading the data

Cheat sheet: Summarising the data

Cheat sheet: Analysing the data

Plotting the PCA in 2D

Wrapping up

The zebrafish dataset

Trimming the data

Normalizing the data

Performing the PCA

Extracting the PCs

Plotting the PCA in 2D

Plotting the PCA in 2D

Interpreting the plot

Exercise 2: The yeast dataset

Exercise 2: The pasilla dataset

Exercise 2: The tissues dataset

Exercise 2: PCA using R

Cheatsheet: Loading the data

Cheatsheet: Analysing the data

Cheatsheet: plotting

R Package Documentation

Browse R Packages

We want your feedback!

MalteThodberg/Shenzen2016
Datasets for Introduction to Advanced Topics in Bioinformatics 2016

The `iris` dataset

The `iris` dataset

Exercise 1: The `oliveoil` dataset

The `zebrafish` dataset

Exercise 2: The `yeast` dataset

Exercise 2: The `pasilla` dataset

Exercise 2: The `tissues` dataset