README.md

ProKlust (“Prokaryotic Clusters”) was written with a focus on taxonomical data. It obtains, filters and visualizes clusters from multiple identity/similarity matrices.

ProKlust could be employed to analyze any identity/similarity matrix, such as ANI or barcoding gene identity. Additionally, it contains useful filter options to deal with taxonomical data.

How to install

library(devtools)
install_github("camilagazolla/ProKlust") # Install this package
library(ProKlust)

Package dependencies

How to use

Inputs

IMPORTANT: if the user wishes to send a list of matrices (instead of a vector of file names), he MUST convert it to a list, e.g.

percentage <- read.table(file = "ANIb_percentage_identity.tab", header = T, row.names = 1, sep = "\t")
coverage <- read.table(file = "ANIb_alignment_coverage.tab", header = T, row.names = 1, sep = "\t")
filesList <- list(percentage, coverage)
thresholds <- c(0.95, 0.70)
basicResult <- prokluster(files = filesList, cutoffs = thresholds)
plotc(basicResult$graph)

Filters

Outputs

A genome/gene that is part of a component does not necessarily share identity/similarity values above the established cut-off with all the other genomes/genes of that component, but it must share an identity/similarity value above the cut-off for at least one other genome/gene. Cliques, instead, are formed by genome/gene that all share identity/similarity values above the chosen criteria. A genome/gene could belong at the same time to different cliques within the same component.

Examples

#Example 1.1
basicResult1.1 <- prokluster(files = "ANIb_percentage_identity.tab", cutoffs = 0.9)
basicResult1.1
plotc(basicResult1.1$graph)

#Example 1.2
percentage <- read.table(file = "ANIb_percentage_identity.tab", header = T, row.names = 1, sep = "\t")
basicResult1.2 <- prokluster(files = percentage, cutoffs = 0.9)

#Example 2.1
files <- c("ANIb_percentage_identity.tab", "ANIb_alignment_coverage.tab")
thresholds <- c(0.95, 0.70)
renamedResults1.1 <- prokluster(files = files, cutoffs = thresholds, nodesDictionary = "dictionary.tab", filterRemoveIsolated = TRUE)

#Example 2.2
coverage <- read.table(file = "ANIb_alignment_coverage.tab", header = T, row.names = 1, sep = "\t")
filesList <- list(percentage, coverage)
basicResult2.2 <- prokluster(files = filesList, cutoffs = thresholds)

#Example 3
renamedResults2 <- prokluster(files = files, cutoffs = thresholds, nodesDictionary = "dictionary.tab", filterDifferentNamesConnected = TRUE)

#Example 4
nodesNames <- read.table(file= "dictionary.tab", sep = "\t", header = F, stringsAsFactors=FALSE)
renamedResults3 <- prokluster(files = files, cutoffs = thresholds, nodesPreviousNames = nodesNames$V1, nodesTranslatedNames = nodesNames$V2, filterSameNamesNotConnected = T)

Example for FastANI:

$ cd bins #dir with genomes
$ mkdir out
$ ls *fna > list
$ mv list out
$ for f in *fna; do fastANI -q "${f}" --rl out/list -o "${f}.fastANI" --minFraction 0; mv "${f}.fastANI" out; done
$ cd out/
$ cat *ANI > fastANIout.txt

Generating a tabbed-delimited "pairwise" identity matrix on R:

library(ProKlust)
library(tidyr)

# Importing fastANI results
identity <- (read.table(file = "fastANIout.txt", sep = "\t")) [1:3]
identity <- pivot_wider(identity, names_from =V1, values_from = V3)
identity <- as.data.frame(identity)
rownames(identity) <- identity$V2
identity.sorted <- identity[order(identity["V2"]),]
identity.sorted[,1] <- NULL

basicResult <- prokluster(file = identity.sorted, cutoffs = 95)
basicResult
plotc(basicResult$graph)

Workflow

A) The average of each pair from the pairwise input matrix/matrices is/are obtained. A Boolean matrix/matrices is/are obtained according to the cut-off values chosen by the user. If more than one matrix is used as input, the final generated matrix is obtained by multiplying the elements of the matrices. A graph is formed by connecting the nodes which present the positive values. In this example, nodes correspond to genomes and edges correspond to ANI ≥ 95% with coverage alignment ≥ 50%. The data could be filtered to retain components containing more than one species name or unconnected nodes containing the same species names.

B) Overview of the hierarchical-based clustering approach. These approaches return tree-shaped diagrams with non-overlapping clusters.

Citation

If you use ProKlust in your research please cite:

Volpiano CG, Sant’Anna FH, Ambrosini A, de São José JFB, Beneduzi A, Whitman WB, de Souza EM, Lisboa BB, Vargas LK and Passaglia LMP (2021) Genomic Metrics Applied to Rhizobiales (Hyphomicrobiales): Species Reclassification, Identification of Unauthentic Genomes and False Type Strains. Front. Microbiol. 12:614957. https://doi.org/10.3389/fmicb.2021.614957



camilagazolla/ProKlust documentation built on May 22, 2021, 11:10 p.m.