This R package allows you to easily determine the Multi Locus Sequence Type (MLST) of your genomes. It also works as an interface between PubMLST through their RESTful API, so you don't have to bother downloading and collecting files: the application does it automatically.

Quick standard workflow

The first step in your analysis should be to check the in pubmlst.org database if your organism of interest is available. So, first load the package and then run listPubmlst_orgs() function, printing only the first 50 elements:

library(quickMLST)
listPubmlst_orgs()[1:50]

Lets say we are interested in Leptospira genus, which is in the place 43 in the list above. So:

listPubmlst_orgs() -> lst
lst[43]

Now, lets check for available MLST schemes for this organism:

listPubmlst_schemes(org = lst[43])

As you can see, listPubmlst_schemes return a list with the loci names corresponding to each scheme. As an attribute of each list element there is information about each mlst scheme.

Now you can choose between two ways: the easy way and the hard way.

The hard way implies calling downloadPubmlst_seq(org = lst[43], scheme = 1) and then downloadPubmlst_profile(org = lst[43], scheme = 1) functions included in this package to download the scheme fasta files and the profile tab file for the organism and the scheme of interest, and then passing the files to the subsequent doMLST() function to schemeFastas and schemeProfile arguments.

The easy way is to left those arguments NULL (default), and let the doMLST() function do it for you.

Let see an example with toy data attached on this package:

#First we list the atteched tar.gz file
system.file('extdata', 'toyExample.tar.gz', package = 'quickMLST') -> tgz
untar(tarfile = tgz, exdir = getwd(), list = T) -> genomes
#Decompress them
untar(tarfile = tgz,exdir = getwd())
genomes

In this example we have 3 pathogenic leptospira genomes, in fasta format.

Lets determine the MLST for the scheme 3.

doMLST(infiles = genomes, # The fasta files
       org = lst[43], # The organism, in this case is "leptospira"
       scheme = 3, # Scheme id number
       write.new = FALSE, # Don't write fasta files for new alleles found
       dir = getwd(), # Put MLST allele files in this dir
       n_threads = 3) -> res # Use 3 threads

#Output:
res

As you can see, a data.frame is returned. Each row is a genome, and each column is a scheme locus. The number refers to the allele number id.

A "u" means that a new allele was found, e.g. res$lip32_3[1]: this allele is not yet reported in the pubmlst database. If option write.new is set to TRUE, then a fasta file is written in dir with this new allele.

A <NA> means that no allele was found, i.e. no blastn local alignment pass the inclusion threshold (by default, this threshold are a percentage identity grater or equal to 90, and a subject coverage greater or equal to 0.9). In this example this was no the case for any of the screened genomes.

The last column refers to the Sequence Type (ST). If possible, the function identifies the ST of each genome, otherwise a NA is returned (e.g. res$ST[1]).

An easy way of obtaining the composition of the 3 mlst schemes available for this organism would be:

lapply(1:3,function(x){

  doMLST(infiles = genomes, # The fasta files
         org = lst[43], # The organism, in this case is "leptospira"
         scheme = x, # Scheme id number. Will iterate between 1 and 3.
         write.new = FALSE, # Don't write fasta files for new alleles found
         dir = getwd(), # Put MLST allele files in this dir
         n_threads = 3)

}) -> allres

allres

That's it. Now we have the MLST of our genomes for the 3 available schemes.

You should check the files downloaded from PubMLST on your working directory .



iferres/quickMLST documentation built on Dec. 22, 2020, 5:10 p.m.