mash: MASH distance estimation.

View source: R/mash.R

mashR Documentation

MASH distance estimation.

Description

MASH (Fast genome and metagenome distance estimation using MinHash) is a fast sequence distance estimator that uses the MinHash algorithm and is designed to work with genomes and metagenomes in the form of assemblies or reads (https://mash.readthedocs.io/). This function is a wrapper to execute mash in the background and import to R as a mash object.

Usage

mash(file_list, n_cores = 4, sketch = 1000, kmer = 21, type = "prot")

Arguments

file_list

Data frame with the full path to the genome files (gene or protein multi-fasta).

n_cores

Number of cores to use.

sketch

Number of sketches to use for distance estimation.

kmer

Kmer size.

type

Type of sequence 'nucl' (nucleotides) or 'prot' (aminoacids)

Value

A mash object

Note

A mash is a list of two element.

The first one contains a rectangular and simetric matrix with the distances among genomes. As a matrix has genomes as rownames and colnames

The second one is a data.table/data.frame with all the distancies as list. The table has the columns c("Source","Target","Dist")

References

Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

Mash Screen: High-throughput sequence containment estimation for genome discovery. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. BioRxiv. 2019 Mar. doi: 10.1101/557314


irycisBioinfo/PATO documentation built on Oct. 19, 2023, 3:07 p.m.