knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(blaststats)
This script replicates the regression analysis discussed in Altschul and Gish (1996). It assumed you have Tabe 1 already loaded into memory OR load it using the code below.
Regression analysis is used to calculate K and lambda from mu (u) and the sequence lengths. The starting point is the following equation:
Plain text version of equation:
mu = (ln(m X n X K))/lambda
Rendered version of equation.
$$\mu = \frac{ \ln(mnK)}{\lambda}$$
Altschul and Gish (1996) indicate that this equation can be re-arranged into a linear equation of the form y = M X x + b
where
This gives us y = M X x + b mu = (1 / lambda) X ln(m X n) + ln(K)/lambda
From our simulations we will input sequences of length m and length n, and we will get out a distribution of scores. From the distribution of scores we'll be able to estimate mu. We can then make a scatter plot of mu versus ln(m X n) and use linear regression to estimate a line of best fit of the form y = M X x + b. The parameters for this line, M and b, can then be re-arranged algebracially to estimate lambda and ln(K)
M = 1/lambda
which rearranges to
M*lambda = 1
and then
lambda = 1/M.
So, the inverse of the slope from our regression will give us lambda.
Once we have lambda, we can solve for K
b = ln(K)/lambda
b X lambda = ln(K)
exp(b X lambda) = exp(ln(K))
exp(b X lambda) = K
We just got lambda above, so we plug that value into the equation and get K.
We can see how the original linear equation gets converted be recalling that ln(a X b) = ln(a) + ln(b)
Therefore, ln(m X n X K) can be re-arragned to ln(k) + ln(m X n)
So, starting with mu = (ln(m X n X K))/lambda
We apply the addition rule of logs we just outlined mu = (ln(m X n) + ln(K))/lambda
and we get mu = ln(m X n)/lambda + ln(K)/lambda
which is equivalent to
mu = (1/lambda) X ln(m X n)+ ln(K)/lambda
Rendered equation
$\mu = \frac{1}{\lambda}ln(mn) + \frac{ln(K)}{\lambda}$
data(table1)
There is a non-linear relationship betwen length of the simulated sequences and the estimate of mu
plot(table1$u ~ table1$mn, xlab = "sequence length (m=n)", ylab = "mu (u)")
Taking the log of m*n (the search space)
plot(table1$u ~ table1$ln.nm, xlab = "log(m*n)", ylab = "mu (u)")
Plot log(m*n) vs u, with the axes adjusted so the origin (x = 0, y = 0) is visible.
plot(table1$u ~ table1$ln.nm, xlim = c(0,16), ylim = c(-15,50), xlab = "log(m*n)", ylab = "mu (u)") abline(h = 0) abline(v = 0)
Linear regression using lm()
# linear regression lin.reg <- lm(u ~ ln.nm, data = table1) # look at output summary(lin.reg) # look at intercept and slope coef(lin.reg)
Plot regression line on plot scaled normally
# plot data; no adjusted to xlim or ylim plot(table1$u ~ table1$ln.nm) # add regression line abline(lin.reg, lty = 2)
Plot regression line on plot scaled with origin visible
# plot data plot(table1$u ~ table1$ln.nm, xlim = c(0,16), ylim = c(-15,50), xlab = "log(m*n)", ylab = "mu (u)") #add regression line abline(lin.reg, lty = 2) #add reference lines abline(h = 0) abline(v = 0) # show where y intercept is arrows(x1 =0, x0 = 1, y0 = -13.69782, y1 = -13.69782 ,length = 0.2, lwd = 3, col = 2)
Extract parameters from linear regression
#y = M*x + b #y = M*ln(m*n) + b # M = slope # b = intercept b <- coef(lin.reg)[1] #intercept; ln(K)/lambda M <- coef(lin.reg)[2] #slope; 1/lambda # M = 1 / lambda # re-arrange to: # M*lambda = 1 # lambda = 1/M lambda <- 1/M # 0.261 reported in original paper lambda # b = ln(K)/lambda # b*lambda = ln(K) # exp(b*lambda) = K K <- exp(b*lambda) # 0.026 reported in Table 2 K
This is same idea as above, except using the "edge-corrected" distances discussed by Altschul and Gish. This reduces the discrepencies between the results of their experiment and theoretical predictions.
lin.reg2 <- lm(u ~ ln.n.m., data = table1) summary(lin.reg2) b2 <- coef(lin.reg2)[1] #intercept; ln(K)/lambda M2 <- coef(lin.reg2)[2] #slope; 1/lambda lambda2 <- 1/M2 # 0.261 reported in original paper lambda2 K2 <- exp(b2*lambda2) # 0.026 reported in Table 2 K2
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.