Exploring Genetic Distances in R

Distances between genetic material in R is a complicated topic. There are two main packages that implements distance functions: ape::dist.dna Biostrings::stringDist and Biostrings::pairwiseAlignment

The problems with these functions are: ape::dist.dna * Only for DNA not for AA * Cannot handle gap opening / gap extention penalties? (Not sure) * Always computes a distance matrix for a vector of strings. Computationally inefficient for computing a distance between a single string and a vector (or need some coding to make it efficient) Biostrings::stringDist * Works fine for like an edit distance * Uses Biostrings::pairwiseAlignment to compute alignment based distances. Thus all the problems of that function applies * Always computes a distance matrix for a vector of strings. Computationally inefficient for computing a distance between a single string and a vector (or need some coding to make it efficient) * Biostrings::pairwiseAlignment * Aligns a single pattern to a vector of subjects. * Can return various statistics * When using a substitution matrix, the similarity score is returned and not the distance - hence it still need to be normalized. * Does not return a matrix of distances, so some code is needed to convert the vectors into a matrix of distances (or a distance object)

We need to compute distances based mostly on AA with custom substitution matrics. This means that pairwiseAlignment / stringDist are the only options.

Furthermore, two different functions are needed:

dist(BStringSet) -> distance matrix dist2(BStringSet, BStringSet) -> matrix with distances between all elements in first stringset and the second stringset, but not the distances between the individuals in either stringset and other members of the same stringset

This can be simplified to just one function where the second BStringSet can be null indicating that the 'normal' distance matrix must be computed. Or will this lead to confusion? -> write two separate functions and later wrap them if needed.

Naming convention for the functions: Really hard to come up with names that maintain consistancy with the other distance functions already in use. Drop consistancy with ape - since Im using Biostrings, there is no point in trying to stay consistent with ape. * stringDistVec and stringDistQuery for a distance matrix between a vector of strings and a distance to some query sequence? This is not working for me, but this can be the internal names which them get wrapped by stringDist2

stringDist2(pattern, subject = NULL, ...) if (is.null(subject)) stringDistVec(pattern, ...) else stringDistQuery(pattern, subject, ...)

stringDistVec <- function(pattern){ x <- stringDist(pattern) y <- maxAlignmentScores(patter) x / y # check elementwize behaviour }

stringDistQuery <- function(pattern, subject, ...){ all_sim_scores <- matrix(-1, nrow = length(pattern), ncol = length(subject)) for (subject_id in seq_along(subject)){ the_subject <- subject[subject_id] all_sim_scores[, subject_id] <- score(pairwiseAlignment(pattern, the_subject, ...)) }

max_sim_scores <- maxAlignmentScores(pattern, subject)

all_sim_scores / max_sim_scores # check elementwize behaviour }

maxAlignmentScores(pattern, subject = NULL, ...) if (is.null(subject)) maxAlignmentScoresVec(pattern, ...) else maxAlignmentScoresQuery(pattern, subject, ...)

apply_max <- function(x, y){ # Very large optimization possible? z <- data.frame(x = x, y = y) return(apply(z, 1, max)) }

maxAlignmentScoresVec(pattern, ...) max_scores <- maxAlignemntScores1d(pattern) outer(max_scores, max_scores, apply_max)

maxAlignmentScoresQuery(pattern, subject, ...) max_scores_x <- maxAlignemntScores1d(pattern) max_scores_y <- maxAlignemntScores1d(subject) outer(max_scores_x, max_scores_y, weird_max)

maxAlignmentScore <- function(x, scores = diag(BLOSUM50), alphabet = AA_ALPHABET){ # self_alignment_scores is a names vector counts <- letterFrequency(x, alphabet) sum(scores[match(names(counts), names(scores))]*counts) }

maxAlignmentScore1d <- Vectorize(maxAlignmentScore) # Figure out how vectorize # works



philliplab/ViralHaplotyper documentation built on May 25, 2019, 5:05 a.m.