knitr::opts_chunk$set(echo = TRUE)

This R package allows you to easily determine the Multi Locus Sequence Type (MLST) of your genomes. It also works as an interface between PubMLST through their RESTful API, so you don't have to bother downloading and collecting files: the application does it automatically.

Installation

The easiest way to install this package is using devtools:

devtools::install_github('iferres/MLSTar')

Requirements

MLSTar depends on BLAST+ software, and must be installed in your $PATH prior to run this package.

Standard workflow

The first step in your analysis should be to check the in pubmlst.org database if your organism of interest is available. So, first load the package and then run listPubmlst_orgs() function, printing only the first 50 elements:

library(MLSTar)
listPubmlst_orgs()[1:50]

Lets say we are interested in Leptospira genus, which is in the place 47 in the list above. So:

lst <- listPubmlst_orgs() 
lst[47]

Now, lets check for available MLST schemes for this organism:

listPubmlst_schemes(org = lst[47])

As you can see, listPubmlst_schemes return a list with the loci names corresponding to each scheme. As an attribute of each list element there is information about each mlst scheme.

Now you can choose between two ways: the easy way and the hard way.

The hard way implies calling downloadPubmlst_seq(org = lst[43], scheme = 1) and then downloadPubmlst_profile(org = lst[43], scheme = 1) functions included in this package to download the scheme fasta files and the profile tab file for the organism and the scheme of interest, and then passing the files to the subsequent doMLST() function to schemeFastas and schemeProfile arguments.

The easy way is to left those arguments NULL (default), and let the doMLST() function do it for you.

Let see an example with toy data attached on this package:

#First we list the atteched tar.gz file
tgz <- system.file('extdata', 'example.tar.gz', package = 'MLSTar')
genomes <- untar(tarfile = tgz, list = T)
#Decompress them
untar(tarfile = tgz,exdir = tempdir())
genomes <- list.files(path = tempdir(), 
                      pattern = paste0(genomes, collapse = '|'), 
                      full.names = TRUE)
genomes

In this example we have 10 pathogenic Leptospira borgpetersenii genomes( * ), in fasta format. ( * : Because of portability, just the corresponding alleles of each genomes are written in the fasta files for the scheme 1, and not the whole genomes. The purpose is to show the functions and not to provide a real case example.)

Lets determine the MLST for the scheme 1.

x <- doMLST(
  infiles = genomes, # The fasta files
  org = lst[47], # The organism, in this case is "leptospira"
  scheme = 1, # Scheme id number
  write = "none", # Don't write fasta files for alleles found
  ddir = paste0(tempdir(),'pubmlst_lep_sch1', collapse = '/') # Write downloaded sequences and profiles here.
  )   

x

The result is an object of class "mlst" which is a list of 2 data.frames and a series of attributes. The first data.frame show the result for this set of genomes, and the second is the MLST profile used:

# Attributes
attributes(x)

# Result
x$result

# Profile (leptospira, scheme 1)
head(x$profile)

As you can see, each row is a genome, and each column is a scheme locus. The number refers to the allele number id.

A "u" plus an integer in the result data.frame means that a new allele was found, e.g. the mreA_1 gene in the last genome: this allele has not been yet reported in the pubmlst database. New alleles are named according to whether are the same or not, for example, if the same mreA_1 allele had been found in other genome, we would see another "u1" in the correspondig cell of the data.frame. If a different, but not reported nethier, allele had been found, we will see a "u1" in one of the cell, and a "u2" in the other. If option write is set to "new", then a fasta file is written with this new allele. If this option is set to "all", all alleles are written (both reported and not reported at pubmlst.org).

A <NA> means that no allele was found, i.e. no blastn local alignment pass the inclusion threshold (by default, this threshold are a percentage identity grater or equal to 90, and a subject coverage greater or equal to 0.9). In this example this was no the case for any of the screened genomes.

The last column refers to the Sequence Type (ST). If possible, the function identifies the ST of each genome, otherwise a NA is returned.

Minimum Spanning Tree

The mlst class defined in this package include a plot method which uses APE for compute a minimum spanning tree (mst), and igraph to build and object of class igraph and to plot it. In this case red nodes corresponds to the pubmlst reported STs, blue nodes are the ones detected, and white ones are those for which no ST was possible to assign.

set.seed(4)
plot(x)

The function plot returns invisibly an object of class igraph which can be further analized using that package.

set.seed(4)
g <- plot(x, plot = FALSE)
g

Beware with plotting the whole profile: a extensive MLST profile with, for instance, a model organism could consume a lot of resources and take a log time. The distance and mst computation scale exponentially with the number of elements. In this cases you can choose to plot just your isolates:

plot(x, what = 'result')

Binary Tree

The mst is the default plot method, but a binary tree can also be plotted, and in this case an object of class phylo (see APE package) is returned invisibly:

layout(cbind(1,2))
plot(x, type = 'phylo')
plot(x, type = 'phylo', 
     plot.phylo.args = list(type = 'fan'), 
     tiplabels.args = list(offset = 0.1))

See ?plot.mlst for more information about the plotting methods. New methods will be added in the future. Any suggestions are welcome.

# System cleanup:
file.remove(genomes)
unlink(paste0(tempdir(), 'pubmlst_lep_sch1'))


iferres/quickMLST documentation built on Dec. 22, 2020, 5:10 p.m.