Nothing
#' @name fluidity
#' @title Computing genomic fluidity for a pan-genome
#'
#' @description Computes the genomic fluidity, which is a measure of population diversity.
#'
#' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
#' @param n.sim An integer specifying the number of random samples to use in the computations.
#'
#' @details The genomic fluidity between two genomes is defined as the number of unique gene
#' families divided by the total number of gene families (Kislyuk et al, 2011). This is averaged
#' over \samp{n.sim} random pairs of genomes to obtain a population estimate.
#'
#' The genomic fluidity between two genomes describes their degree of overlap with respect to gene
#' cluster content. If the fluidity is 0.0, the two genomes contain identical gene clusters. If it
#' is 1.0 the two genomes are non-overlapping. The difference between a Jaccard distance (see
#' \code{\link{distJaccard}}) and genomic fluidity is small, they both measure overlap between
#' genomes, but fluidity is computed for the population by averaging over many pairs, while Jaccard
#' distances are computed for every pair. Note that only presence/absence of gene clusters are
#' considered, not multiple occurrences.
#'
#' The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}.
#'
#' @return A vector with two elements, the mean fluidity and its sample standard deviation over
#' the \samp{n.sim} computed values.
#'
#' @references Kislyuk, A.O., Haegeman, B., Bergman, N.H., Weitz, J.S. (2011). Genomic fluidity:
#' an integrative view of gene diversity within microbial populations. BMC Genomics, 12:32.
#'
#' @author Lars Snipen and Kristian Hovde Liland.
#'
#' @seealso \code{\link{panMatrix}}, \code{\link{distJaccard}}.
#'
#' @examples
#' # Loading a pan-matrix in this package
#' data(xmpl.panmat)
#'
#' # Fluidity based on this pan-matrix
#' fluid <- fluidity(xmpl.panmat)
#'
#' @importFrom stats sd
#'
#' @export fluidity
#'
fluidity <- function(pan.matrix, n.sim = 10){
pan.matrix[which(pan.matrix > 0, arr.ind=T)] <- 1
flu <- rep(0, n.sim)
for(i in 1:n.sim){
ii <- sample(nrow(pan.matrix), 2)
flu[i] <- (sum(pan.matrix[ii[1],] > 0 & pan.matrix[ii[2],] == 0)
+ sum(pan.matrix[ii[1],] == 0 & pan.matrix[ii[2],] > 0)) / (sum(pan.matrix[ii[1],]) + sum(pan.matrix[ii[2],]))
}
flu.vec <- c(Mean = mean(flu), Std = sd(flu))
return(flu.vec)
}
#' @name distJaccard
#' @title Computing Jaccard distances between genomes
#'
#' @description Computes the Jaccard distances between all pairs of genomes.
#'
#' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
#'
#' @details The Jaccard index between two sets is defined as the size of the intersection of
#' the sets divided by the size of the union. The Jaccard distance is simply 1 minus the Jaccard index.
#'
#' The Jaccard distance between two genomes describes their degree of overlap with respect to gene
#' cluster content. If the Jaccard distance is 0.0, the two genomes contain identical gene clusters.
#' If it is 1.0 the two genomes are non-overlapping. The difference between a genomic fluidity (see
#' \code{\link{fluidity}}) and a Jaccard distance is small, they both measure overlap between genomes,
#' but fluidity is computed for the population by averaging over many pairs, while Jaccard distances are
#' computed for every pair. Note that only presence/absence of gene clusters are considered, not multiple
#' occurrences.
#'
#' The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}.
#'
#' @return A \code{dist} object (see \code{\link{dist}}) containing all pairwise Jaccard distances
#' between genomes.
#'
#' @author Lars Snipen and Kristian Hovde Liland.
#'
#' @seealso \code{\link{panMatrix}}, \code{\link{fluidity}}, \code{\link{dist}}.
#'
#' @examples
#' # Loading a pan-matrix in this package
#' data(xmpl.panmat)
#'
#' # Jaccard distances
#' Jdist <- distJaccard(xmpl.panmat)
#'
#' # Making a dendrogram based on the distances,
#' # see example for distManhattan
#'
#' @export distJaccard
#'
distJaccard <- function(pan.matrix){
pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1
D <- matrix(0, nrow = nrow(pan.matrix), ncol = nrow(pan.matrix))
rownames(D) <- colnames(D) <- rownames(pan.matrix)
for(i in 1:(nrow(pan.matrix) - 1)){
for(j in (i+1):nrow(pan.matrix)){
cs <- pan.matrix[i,] + pan.matrix[j,]
D[j,i] <- D[i,j] <- 1 - sum(cs > 1)/sum(cs > 0)
}
}
return(as.dist(D))
}
#' @name distManhattan
#' @title Computing Manhattan distances between genomes
#'
#' @description Computes the (weighted) Manhattan distances beween all pairs of genomes.
#'
#' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
#' @param scale An optional scale to control how copy numbers should affect the distances.
#' @param weights Vector of optional weights of gene clusters.
#'
#' @details The Manhattan distance is defined as the sum of absolute elementwise differences between
#' two vectors. Each genome is represented as a vector (row) of integers in \samp{pan.matrix}. The
#' Manhattan distance between two genomes is the sum of absolute difference between these rows. If
#' two rows (genomes) of the \samp{pan.matrix} are identical, the corresponding Manhattan distance
#' is \samp{0.0}.
#'
#' The \samp{scale} can be used to control how copy number differences play a role in the distances
#' computed. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome,
#' and going from 1 to 2 (or more) copies is less. Prior to computing the Manhattan distance, the
#' \samp{pan.matrix} is transformed according to the following affine mapping: If the original value in
#' \samp{pan.matrix} is \samp{x}, and \samp{x} is not 0, then the transformed value is \samp{1 + (x-1)*scale}.
#' Note that with \samp{scale=0.0} (default) this will result in 1 regardless of how large \samp{x} was.
#' In this case the Manhattan distance only distinguish between presence and absence of gene clusters.
#' If \samp{scale=1.0} the value \samp{x} is left untransformed. In this case the difference between 1
#' copy and 2 copies is just as big as between 1 copy and 0 copies. For any \samp{scale} between 0.0 and
#' 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still
#' present. In this way you can decide if the distances between genomes should be affected, and to what
#' degree, by differences in copy numbers beyond 1. Notice that as long as \samp{scale=0.0} (and no
#' weighting) the Manhattan distance has a nice interpretation, namely the number of gene clusters that
#' differ in present/absent status between two genomes.
#'
#' When summing the difference across gene clusters we can also up- or downweight some clusters compared
#' to others. The vector \samp{weights} must contain one value for each column in \samp{pan.matrix}. The
#' default is to use flat weights, i.e. all clusters count equal. See \code{\link{geneWeights}} for
#' alternative weighting strategies.
#'
#' @return A \code{dist} object (see \code{\link{dist}}) containing all pairwise Manhattan distances
#' between genomes.
#'
#' @author Lars Snipen and Kristian Hovde Liland.
#'
#' @seealso \code{\link{panMatrix}}, \code{\link{distJaccard}}, \code{\link{geneWeights}}.
#'
#' @examples
#' # Loading a pan-matrix in this package
#' data(xmpl.panmat)
#'
#' # Manhattan distances between genomes
#' Mdist <- distManhattan(xmpl.panmat)
#'
#' \dontrun{
#' # Making a dendrogram based on shell-weighted distances
#' library(ggdendro)
#' weights <- geneWeights(xmpl.panmat, type = "shell")
#' Mdist <- distManhattan(xmpl.panmat, weights = weights)
#' ggdendrogram(dendro_data(hclust(Mdist, method = "average")),
#' rotate = TRUE, theme_dendro = FALSE) +
#' labs(x = "Genomes", y = "Shell-weighted Manhattan distance", title = "Pan-genome dendrogram")
#' }
#'
#' @importFrom stats dist
#'
#' @export distManhattan
#'
distManhattan <- function(pan.matrix, scale = 0.0, weights = rep(1, ncol(pan.matrix))){
if((scale > 1) | (scale < 0)){
warning( "scale should be between 0.0 and 1.0, using scale = 0.0" )
scale <- 0.0
}
idx <- which(pan.matrix > 0, arr.ind = T)
pan.matrix[idx] <- 1 + (pan.matrix[idx] - 1) * scale
pan.matrix <- t(t(pan.matrix) * weights)
return(dist(pan.matrix, method = "manhattan"))
}
#' @name geneWeights
#' @title Gene cluster weighting
#'
#' @description This function computes weights for gene cluster according to their distribution in a pan-genome.
#'
#' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
#' @param type A text indicating the weighting strategy.
#'
#' @details When computing distances between genomes or a PCA, it is possible to give weights to the
#' different gene clusters, emphasizing certain aspects.
#'
#' As proposed by Snipen & Ussery (2010), we have implemented two types of weighting: The default
#' \samp{"shell"} type means gene families occuring frequently in the genomes, denoted shell-genes, are
#' given large weight (close to 1) while those occurring rarely are given small weight (close to 0).
#' The opposite is the \samp{"cloud"} type of weighting. Genes observed in a minority of the genomes are
#' referred to as cloud-genes. Presumeably, the \samp{"shell"} weighting will give distances/PCA reflecting
#' a more long-term evolution, since emphasis is put on genes who have just barely diverged away from the
#' core. The \samp{"cloud"} weighting emphasizes those gene clusters seen rarely. Genomes with similar
#' patterns among these genes may have common recent history. A \samp{"cloud"} weighting typically gives
#' a more erratic or \sQuote{noisy} picture than the \samp{"shell"} weighting.
#'
#' @return A vector of weights, one for each column in \code{pan.matrix}.
#'
#' @references Snipen, L., Ussery, D.W. (2010). Standard operating procedure for computing pangenome
#' trees. Standards in Genomic Sciences, 2:135-141.
#'
#' @author Lars Snipen and Kristian Hovde Liland.
#'
#' @seealso \code{\link{panMatrix}}, \code{\link{distManhattan}}.
#'
#' @examples
#' # See examples for distManhattan
#'
#' @export geneWeights
#'
geneWeights <- function(pan.matrix, type = c("shell", "cloud")){
ng <- dim( pan.matrix )[1]
nf <- dim( pan.matrix )[2]
pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1
cs <- colSums(pan.matrix)
midx <- grep(type[1], c("shell", "cloud"))
if(length(midx) == 0){
warning("Unknown weighting:", type, ", using shell weights")
midx <- 1
}
W <- rep(1, ncol(pan.matrix))
x <- 1:nrow(pan.matrix)
ww <- 1 / (1 + exp(((x - 1) - (max(x) - 1)/2) / ((max(x) - 1) / 10)))
if(midx == 1) ww <- 1 - ww
for(i in x) W[cs == i] <- ww[i]
return(W)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.