In gobbios/cfp: Christof's functions

library(cfp)

slope of elements

```rTwo example sequences and the slope of the occurence of A in response to position in the sequence. In the left panel the slope is steeper because A is much more likely to occur at the beginning than at the end. In the right panel, the slope is flatter, i.e. A might occur both at the beginning and at the end of the sequence."} par(mfrow=c(1, 2)) S <- c("AAABBAABBBBB") S <- unlist(strsplit(S, "")) x <- data.frame(pos=1:length(S), isA=as.numeric(S=="A")) plot(x$pos, x$isA, pch=S, xlab="position", ylab="prob of A occured", xlim=c(1,12), cex.axis=0.7, cex.lab=0.7, cex=0.7) mod <- lm(isA ~ pos, x) points(x$pos, predict(mod), "l", col="red", lwd=2) mod <- glm(isA ~ pos, x, family = binomial) points(x$pos, predict(mod, type="response"), "l", col="blue", lwd=2, lty=3) legend(x = 1, y = 0.4, lty=c(1,3), col=c("red", "blue"), legend=c("linear", "logistic"), cex=0.5)

S <- c("AAABBAA") S <- unlist(strsplit(S, "")) x <- data.frame(pos=1:length(S), isA=as.numeric(S=="A")) plot(x$pos, x$isA, pch=S, xlab="position", ylab="prob of A occured", xlim=c(1,12), cex.axis=0.7, cex.lab=0.7, cex=0.7) mod <- lm(isA ~ pos, x) points(x$pos, predict(mod), "l", col="red", lwd=2) mod <- glm(isA ~ pos, x, family = binomial) points(x$pos, predict(mod, type="response"), "l", col="blue", lwd=2, lty=3)

legend("bottomleft", lty=c(1,3), col=c("red", "blue"), legend=c("linear", "logistic"), cex=0.5)

legend(x = 1, y = 0.4, lty=c(1,3), col=c("red", "blue"), legend=c("linear", "logistic"), cex=0.5)

# transition metrics


```r
smat <- matrix(1, ncol=26, nrow=26); rownames(smat) <- colnames(smat) <- letters


templ <- as.matrix(expand.grid(letters, letters ))

# random names
testnames <- sample(authornames, 400)
# create results
res <- transitions(S = testnames[1], strucmat = smat, xstart = F, xstop = F, out = "bayes")
res <- mat2long(res)[,1:2]

i=1
for(i in 1:length(testnames)) {
  tempres <- mat2long(transitions(S = testnames[i], strucmat = smat, xstart = F, xstop = F, out = "bayes"))
  res <- cbind(res, round(tempres$val, 4))
  colnames(res)[ncol(res)] <- paste0("nm", i)
}

res$RS <- rowMeans(res[, 3:ncol(res)], na.rm = T)

res <- res[res$RS > 0, ]
res <- res[order(res$RS, decreasing = TRUE), ]
res <- res[, c("Rx", "Cx", "RS")]
head(res, 10)

entropy

One simulation to illustrate entropy is the following. We assume Kershenbaum's [-@kershenbaum2015a, p.1455] description of entropy as measure of "the unpredictability of a sequence, or the lack of uniformity of a sequence, so that a completely predictable sequence (e.g. consisting of the same element repeated over and over) would have an entropy of zero, whereas a completely unpredictable (random) sequence would have an entropy of one". We can generate sequences of 100 elements, randomly chosen from ten letters. But we make sure that the random choice of letters is 'systematic' such that at one extreme, all letters are equally likely to occur, while at the other extreme only A should be chosen. For this we vary the shape-2 paramter of a beta distribution.

So, at large shape values, we have random sequences and hence larger entropy than at small shape values (more biased towards repeating earlier letters (e.g. only A)) and hence small entropy values.

N <- 1000
shp2 <- runif(N, 0.05, 1)
LET <- rev(LETTERS[1:10])
en1 <- en2 <- numeric(N)
i=1
for(i in 1:N) {
  probs <- as.numeric(table(cut(rbeta(1000, shape1 = 1, shp2[i]), breaks=seq(0,1,0.1) )))/1000
  s <- sample(LET, 100, replace = T, prob = probs)
  en1[i] <- entropy(s, 1)
  en2[i] <- entropy(s, 2)
}

par(mfrow=c(1,2))
plot(shp2, en1, pch=16, col=grey(0.3, 0.1))
plot(shp2, en2, pch=16, col=grey(0.3, 0.1))

Levenshtein distance

This is essentially a measure of how many steps are necessary to convert one string into another. If two sequences are similar to each other, few steps are necessary (or none at all if two sequences are identical). If two sequences are very different a large number of edits is necessary. Possible edits of sequences/strings are insertions, deletions and substitutions. The classic example is the transition from kitten to sitting.

kitten -> sitten (substitute "s" for "k")
sitten -> sittin (substitute "i" for "e")
sittin -> sitting (insert "g")

Now, we can adapt this measure to sequences of communication signals. For example, a squences of two grunts and a scream is very similar to a sequence of grunt and scream, which requires only one deletion:

s1 <- c("grunt", "grunt", "scream")
s2 <- c("grunt", "scream")
lss(s1, s2, normalize = FALSE)

In contrast, a sequence of grunt, scream is more different from the sequence lip smack, wave, moan, grunt:

s1 <- c("grunt", "scream")
s2 <- c("lip smack", "wave", "moan", "grunt")
lss(s1, s2, normalize = FALSE)

Finally, we can normalize the LD so that it ranges between 0 and 1. This is because the theoretical maximum LD is the length of the longer string/sequence. However, during this normalization step, we actually also reverse the interpretation, so the measure become a similarity index. Hence, very similar sequences will have values close to 1 and very different sequences will have values closer to 0.

s1 <- c("grunt", "grunt", "scream")
s2 <- c("grunt", "scream")
lss(s1, s2, normalize = TRUE)

s1 <- c("grunt", "scream")
s2 <- c("lip smack", "wave", "moan", "grunt")
lss(s1, s2, normalize = TRUE)

Things to be included unary

composition: proportion of A calls in the sequence
does the seq start with an A (should be similar if not identical to transition start -> A)
mean call intervals
N-gram distribution: two possibilities: 1) four 2-grams (AA, BB, AB, BA) or the slope sorted by value
transitions (normalise row-wise, and requires structure matrix!!! (which combinations are actually allowed or not))
Shannon entropy (1st order (proportion of A and B), and 2nd order)