This R package allows you to easily determine the Multi Locus Sequence Type (MLST) of your genomes. It also works as an interface between PubMLST through their RESTful API, so you don't have to bother downloading and collecting files: the application does it automatically.
The first step in your analysis should be to check the in pubmlst.org database if your organism of interest is available. So, first load the package and then run listPubmlst_orgs()
function, printing only the first 50 elements:
library(quickMLST) listPubmlst_orgs()[1:50]
Lets say we are interested in Leptospira genus, which is in the place 43 in the list above. So:
listPubmlst_orgs() -> lst lst[43]
Now, lets check for available MLST schemes for this organism:
listPubmlst_schemes(org = lst[43])
As you can see, listPubmlst_schemes
return a list with the loci names corresponding to each scheme. As an attribute of each list element there is information about each mlst scheme.
Now you can choose between two ways: the easy way and the hard way.
The hard way implies calling downloadPubmlst_seq(org = lst[43], scheme = 1)
and then downloadPubmlst_profile(org = lst[43], scheme = 1)
functions included in this package to download the scheme fasta files and the profile tab file for the organism and the scheme of interest, and then passing the files to the subsequent doMLST()
function to schemeFastas
and schemeProfile
arguments.
The easy way is to left those arguments NULL
(default), and let the doMLST()
function do it for you.
Let see an example with toy data attached on this package:
#First we list the atteched tar.gz file system.file('extdata', 'toyExample.tar.gz', package = 'quickMLST') -> tgz untar(tarfile = tgz, exdir = getwd(), list = T) -> genomes #Decompress them untar(tarfile = tgz,exdir = getwd()) genomes
In this example we have 3 pathogenic leptospira genomes, in fasta format.
Lets determine the MLST for the scheme 3.
doMLST(infiles = genomes, # The fasta files org = lst[43], # The organism, in this case is "leptospira" scheme = 3, # Scheme id number write.new = FALSE, # Don't write fasta files for new alleles found dir = getwd(), # Put MLST allele files in this dir n_threads = 3) -> res # Use 3 threads #Output: res
As you can see, a data.frame
is returned. Each row is a genome, and each column is a scheme locus. The number refers to the allele number id.
A "u"
means that a new allele was found, e.g. res$lip32_3[1]
: this allele is not yet reported in the pubmlst database. If option write.new
is set to TRUE
, then a fasta file is written in dir
with this new allele.
A <NA>
means that no allele was found, i.e. no blastn local alignment pass the inclusion threshold (by default, this threshold are a percentage identity grater or equal to 90, and a subject coverage greater or equal to 0.9). In this example this was no the case for any of the screened genomes.
The last column refers to the Sequence Type (ST). If possible, the function identifies the ST of each genome, otherwise a NA
is returned (e.g. res$ST[1]
).
An easy way of obtaining the composition of the 3 mlst schemes available for this organism would be:
lapply(1:3,function(x){ doMLST(infiles = genomes, # The fasta files org = lst[43], # The organism, in this case is "leptospira" scheme = x, # Scheme id number. Will iterate between 1 and 3. write.new = FALSE, # Don't write fasta files for new alleles found dir = getwd(), # Put MLST allele files in this dir n_threads = 3) }) -> allres allres
That's it. Now we have the MLST of our genomes for the 3 available schemes.
You should check the files downloaded from PubMLST on your working directory .
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.