knitr::opts_chunk$set(echo = TRUE)
Figure 4.14 in Pevsner is an idealized curve based on equations that exactly describe the the normal distribution and the extreme value distribution (EVD). The following code will run simluations to create an EVD based based on on amino acids.
library(Biostrings)
We need a list of all the 1-letter codes for amino acids. We can make this from the LETTERS object in are. Below I use negative indexing to do this. I make a vector of letters that are NOT amino acids, and then remove them.
#1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 #A, C,D,E,F,G,H,I, K,L,M,N, P,Q,R,S,T, V,W, Y #A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z LETTERS[c(2,10,15,21,24,26)] not.an.aa <- c(2,10,15,21,24,26) aas <- LETTERS[-not.an.aa]
I need to create random sequences of polypeptides. One way to do this is using the runif() function (random-uniform) and set the range from 0.5 to 20.5, then round off the result
First, make up some random numbers
rand.num <- runif(n = 100,min = 0.5, max = 20.5) rand.num <- round(rand.num) rand.num
Note that all of the random numbers occur at least once.
hist(rand.num)
Now use those random number as indices to pull out the amino acid codes. If a letter occurs more than once, it gets pulled out each time.
aas[rand.num]
I slightly more direct way to do this is with the sample() function
sample(x = aas, size = 100, replace = TRUE)
sample() allows you to vary the probabilities for each possibility. As we know from studying PAM matrices amino acids vary in how often they occur in nature.
This code sets up the frequencies from Dayhoff's original paper. (The use of the function order()) puts it in alphabetical order
frequencies <- c(0.089, 0.087, 0.085, 0.081, 0.070, 0.065, 0.058, 0.051, 0.050, 0.047, 0.041, 0.040, 0.040, 0.038, 0.037, 0.034, 0.033, 0.030, 0.015, 0.010) aa <- c("Gly", "Ala", "Leu", "Lys", "Ser", "Val", "Thr", "Pro", "Glu", "Asp", "Arg", "Asn", "Phe", "Gln", "Ile", "His", "Cys", "Tyr", "Met", "Trp") freqs <- data.frame(aa = aa[order(aa)], frequencies = frequencies[order(aa)]) hist(freqs$frequencies, breaks = 20)
I can set up my sampling based on the relative frequencies like this
sample(x = freqs$aa, size = 100, replace = TRUE, prob = freqs$frequencies)
I can summarize the results using table().
x <- sample(x = freqs$aa, size = 100, replace = TRUE, prob = freqs$frequencies) table(x)
I can encapsulate this in a custom function I'll call rand_seq_aa()
rand_seq_aa <- function(n = 100, aas, probs){ sample(x = aas, size = n, replace = TRUE, probs = probs) }
Every single time I run it, it gives me a random polypeptide.
rand_seq_aa()
Now I generate a database of random sequences
#I want each polypeptide to be 1000 aa seq.length <- 120 # I want 1000 fake sequences seq.n <- 1000 # I'll store them here random.seqs <- matrix(data = NA, nrow = seq.n, ncol =seq.length) rownames(random.seqs) <- paste0("r.seq",1:seq.n) random.seqs[1:10,1:10] # Use a loop to run my rand_seq_aa() function 1000 times for(i in 1:nrow(random.seqs)){ random.seqs[i,] <- rand_seq_aa(n = seq.length) } # Each row represents a sequence ## first 10 aa of first 10 sequences random.seqs[1:10,1:10] # first 10 aa of last 10 seqs random.seqs[990:1000,1:10]
I will compare a real Shroom sequence against my database of 1000 fake sequences.
shrm <- c("MKTPENLEEPSATPNPSRTPTERFVYLEALLEGGAPWGFTLKGGLERGEPLIISKIEEGGKADSVSSGLQAGDEVIHINEVALSSPRREAVSLVKGSYKTLRLVVRRDVCAAPGHADPGT")
Check that everything is the same size
nchar(shrm) nchar(shrm) == seq.length
I am going to do a local, pairwise alignment between my shroom sequence and each random sequence. I need a place to keep my alignment scores.
random.seqs.score <- rep(NA,seq.length)
Now I will loop through each row of my random sequences, grab the sequence, format it into a character string using paste(), and align it with pairwiseAlignment().
(If you don't want all the text printe out you can comment out the cat() and print() commands with a hash tag).
for(i in 1:nrow(random.seqs)){ # turn row of matrix into a character vector rand.seq.i <- paste(random.seqs[i,], sep = "",collapse = "") # print some output cat("\n") cat("Iteration", i) print(substr(rand.seq.i,1,20)) # do alignment score.i <- pairwiseAlignment(shrm, rand.seq.i, type = "local", gapOpening=1000, substitutionMatrix = "BLOSUM50", gapExtension=1, scoreOnly = T) #store score random.seqs.score[i] <- score.i }
What we have is a set of scores between a real sequence and a random sequence. We'd expact this all to be really lower socres, but due to random chance we might get a big one. Let's look at the distribution
Most scores are around 30, but there are a few that are pretty good.
hist(random.seqs.score)
Let's characterize this distribution of scores and compare it to a normal distribution
First, we want the mean, median, min, max and sd of the scores
scores.mean <- mean(random.seqs.score)
To make the plotting easier I am going to re-center the distribution at zero. I do this by subtracting the mean from the scores
random.seqs.score.centered <- random.seqs.score-scores.mean
This doesn't change the shape of the histogram, just puts the mean at 0. This is very common in stats.
hist(random.seqs.score.centered)
Now I want the median, min, max and sd of the centered data.
scores.med <- median(random.seqs.score.centered) score.min <- min(random.seqs.score.centered) score.max <- max(random.seqs.score.centered) scores.sd <- sd(random.seqs.score.centered)
I'm going to do my blotting with a wider range of x values. This is set with the xlim arguement. The abline() function puts reference lines for the mean and median
x.axis.range <- c(-1*score.max,score.max) hist(random.seqs.score.centered, breaks = 20, xlim = x.axis.range) abline(v = 0, col = 2) abline(v = scores.med, col = 3)
Now I'm going to overlay a normal distribution on the data. This code is a bit dense. See https://www.statmethods.net/graphs/density.html for another example, though its not well annotated.
First, I need a sequence of numbers for my x-axis range
scores.seq <- seq(-1*score.max, score.max)
I save my histogram as an object
my.hist <- hist(random.seqs.score.centered, breaks = 20, xlim = x.axis.range)
Now I calcualte the data the represents the normal curve
normal.curve.ht <-dnorm(scores.seq, mean = 0, sd=scores.sd)
I need to tweak the height
normal.curve.ht <- normal.curve.ht*diff(my.hist$mids[1:2])*length(random.seqs.score.centered)
Now plot the histogram of the scores I simualted and a normal curve. Note how on the left the curve is above the histogram bars, and on the right it below it. Moreover, the tail on the left of the historgram stops short, while on the right large values occur even when the blue line of the normal distribution is basically flat. What this means is that a normal distribution woudl have more small values than we see in our simulation, and fewer large values.
hist(random.seqs.score.centered, breaks = 20, xlim = x.axis.range) abline(v = 0, col = 2) abline(v = scores.med, col = 3) lines(scores.seq, normal.curve.ht, col="blue", lwd=2)
There are statistical functions that can test for normality. They tell us that our data are most definitely not normal!
shapiro.test(random.seqs.score)
Log transformations can make things more normal - not here!
shapiro.test(log(random.seqs.score))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.