working_dir <- '/home/phillipl/projects/toolmania/code/toolmania'
setwd(working_dir)
library(toolmania)
input_aa_file <- '/home/phillipl/projects/toolmania/data/CAP177prot/all_timesv3_prot.fasta'
input_dna_file <- '/home/phillipl/projects/toolmania/data/CAP177/all_timesv3.fasta'
output_dir <- '/home/phillipl/projects/toolmania/results/pentropy'
opts_chunk$set(echo=FALSE)
library(knitr)
library(devtools)
working_dir <- '/home/phillipl/projects/toolmania/code/toolmania'
setwd(working_dir)
document()
build()
install()
load_all()
knit2html(file.path(working_dir, 'inst/pentropy.Rmd'))
browseURL(paste('file://', file.path(getwd(),'pentropy.html'), sep=''))

Input Data

seq_dna_data <- readDNAStringSet(input_dna_file)
seq_aa_data <- readAAStringSet(input_aa_file)
seq_data <- seq_aa_data
print(seq_data)

Entropy over postion at time point 2000V3

seq_at_time_data <- get_data_of('timepoint', seq_data, '_2000V3_')
pentropy <- compute_pentropy(seq_at_time_data)
p <- ggplot(pentropy, aes(x = pos, y = value)) + geom_line()
print(p)

Number of sequences in fasta files per time point

time_data <- count_sequences_per_point_of('time', seq_data, '_', 2)
p <- plot_sequences_per_point_of('time', time_data)
print(p)

Experimenting with grid of entropy plots

all_pos_data <- compute_all_pentropy(seq_data)

p <- ggplot(all_pos_data, aes(x = pos, y = value)) + 
  geom_line() +
  facet_wrap( ~ time_point)
print(p)

Experimenting with a heat map for entropy plots

p <- ggplot(all_pos_data, aes(x = pos, y = time_point, fill = value)) + 
  geom_tile()
print(p)

Some summary stats for the entropy by position aggregated over time

aggre_pos_data <- aggregate_over_points(all_pos_data)
entropy_stats <- dcast(aggre_pos_data, pos ~ variable)

kable(head(round(entropy_stats,3)))

See the Appendix for the rest of the table.

Average entropy per position aggregated over time

This can easily be done for any of the metrics. This is just an example.

p <- ggplot(entropy_stats, aes(x = pos, y = mean_entropy)) + geom_line()
print(p)

Clustering positions based on entropy

Work in progress

We want to assign each position in the sequence to a cluster so that if a position is in a cluster, then it means that if that position undergoes a changes, then most of the other postions in that same cluster will also under go a change.

entropy_matrix <- dcast(all_pos_data, pos ~ time_point)
entropy_matrix <- entropy_matrix[, 2:ncol(entropy_matrix)]
entropy_matrix <- as.matrix(entropy_matrix)
entropy_matrix <- t(entropy_matrix)

heatmap.2(entropy_matrix, trace = 'none')

Ok, so this is turning out to be much harder than I anticipated. There are so many different things that you can do. I am going to use this kind of like a lab book and just record all my thoughts. We can later delete and cleanup. It should also be pretty helpful to have a log of the development process when we write it up for a paper.

DNA vs AA

It works much better with amino acids. Can you think of any situation in which we will want to look at DNA instead?

Scaling the heatmap

The heatmap must be scaled so that the clustering works properly. Now, the goal of the clustering is to cluster positions together, that means that we need to scale accross positions for each time point so that the total entropy at each time point is scaled to be the same.

When this is done, the plot changes like this

sem <- t(scale(t(entropy_matrix)))
print('manual scaling')
heatmap.2(sem, trace = 'none')
print('automatic scaling')
heatmap.2(entropy_matrix, trace = 'none', scale = 'row')

The first plot was manually scaled and the second plot was scaled by the heatmap function. I am not sure why they yield different results. I need to investigate the heatmap.2 documentation further.

I am going to assume that I did something wrong with the manual scaling and just use the automatic scales for now.

Clustering the time points

I do not think that we need to cluster the timepoints, so I switched it off:

heatmap.2(entropy_matrix, trace = 'none', scale = 'row', Rowv = FALSE, key = FALSE)

I realize that the plot now have a large unused margin on the left. I don't know how to fix it, but I can figure it out in a couple of hours - if we need to have it done.

Statistical Support for the clusters
#sfInit(par = TRUE, cpus = 6)
#cl <- sfGetCluster()
#pvalue_cluster <- parPvclust(cl, sem, nboot = 10000)
# Computationally expensive operation, store result on hard drive
load("pvclust_10000.pvclust")
plot(pvalue_cluster)
pvrect(pvalue_cluster)

Note the differences with the clustering used for the heatmap. This is becuase there are all kinds of funny settings that need to be optimized and played with.

Down Selection before clustering
cutoff_percentage <- 0.75

Since we expect a large portion of the protein to be highly conserved, a case can be made for discarding the most conserved r cutoff_percentage*100% (for example) of the protein before performing the clustering.

cutoff_value <- quantile(entropy_stats$max_entropy, cutoff_percentage)

p <- ggplot(entropy_stats, aes(x = max_entropy)) + 
  geom_bar() +
  geom_vline(xintercept = cutoff_value)
print(p)

active_positions <- entropy_stats$pos[entropy_stats$max_entropy > cutoff_value]
active_matrix <- entropy_matrix[,active_positions]
attr(active_matrix, 'dimnames')[[2]] <- as.character(active_positions)
heatmap.2(active_matrix, trace = 'none', scale = 'row', Rowv = FALSE, key = FALSE)

Ahhh, much better I think.

Appendix

Some summary stats for the entropy by position aggregated over time

kable(round(entropy_stats,3))


philliplab/toolmania documentation built on May 25, 2019, 5:06 a.m.