bDist: Computes distances between sequences
In micropan: Microbial Pan-Genome Analysis

Description Usage Arguments Details Value Author(s) See Also Examples

Computes distance between all sequences based on the BLAST bit-scores.

1	bDist(blast.files = NULL, blast.tbl = NULL, e.value = 1, verbose = TRUE)

`blast.files`	A text vector of BLAST result filenames.
`blast.tbl`	A table with BLAST results.
`e.value`	A threshold E-value to immediately discard (very) poor BLAST alignments.
`verbose`	Logical, indicating if textual output should be given to monitor the progress.

The essential input is either a vector of BLAST result filenames (blast.files) or a table of the BLAST results (blast.tbl). It is no point in providing both, then blast.tbl is ignored.

For normal sized data sets (e.g. less than 100 genomes), you would provide the BLAST filenames as the argument blast.files to this function. Then results are read, and distances are computed. Only if you have huge data sets, you may find it more efficient to read the files using readBlastSelf and readBlastPair separately, and then provide as the argument blast.tbl] the table you get from binding these results. In all cases, the BLAST result files must have been produced by blastpAllAll.

Setting a small e.value threshold can speed up the computation and reduce the size of the output, but you may loose some alignments that could produce smallish distances for short sequences.

The distance computed is based on alignment bitscores. Assume the alignment of query A against hit B has a bitscore of S(A,B). The distance is D(A,B)=1-2*S(A,B)/(S(A,A)+S(B,B)) where S(A,A) and S(B,B) are the self-alignment bitscores, i.e. the scores of aligning against itself. A distance of 0.0 means A and B are identical. The maximum possible distance is 1.0, meaning there is no BLAST between A and B.

This distance should not be interpreted as lack of identity! A distance of 0.0 means 100% identity, but a distance of 0.25 does not mean 75% identity. It has some resemblance to an evolutinary (raw) distance, but since it is based on protein alignments, the type of mutations plays a significant role, not only the number of mutations.

The function returns a table with columns Dbase, Query, Bitscore and Distance. Each row corresponds to a pair of sequences (Dbase and Query sequences) having at least one BLAST hit between them. All pairs not listed in the output have distance 1.0 between them.

Lars Snipen and Kristian Hovde Liland.

blastpAllAll, readBlastSelf, readBlastPair, bClust, isOrtholog.

# Using BLAST result files in this package...
prefix <- c("GID1_vs_GID1_",
            "GID2_vs_GID1_",
            "GID3_vs_GID1_",
            "GID2_vs_GID2_",
            "GID3_vs_GID2_",
            "GID3_vs_GID3_")
bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz"))

# We need to uncompress them first...
blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz")
ok <- file.copy(from = bf, to = blast.files)
blast.files <- unlist(lapply(blast.files, xzuncompress))

# Computing pairwise distances
blast.dist <- bDist(blast.files)

# Read files separately, then use bDist
self.tbl <- readBlastSelf(blast.files)
pair.tbl <- readBlastPair(blast.files)
blast.dist <- bDist(blast.tbl = bind_rows(self.tbl, pair.tbl))

# ...and cleaning...
ok <- file.remove(blast.files)

# See also example for blastpAl