Making BLAST search all against all genomes
Runs a reciprocal all-against-all BLAST search to look for similarity of proteins within and across genomes. The main job is done by the BLAST+ software.
A text vector with the names of the FASTA files where the protein sequences of each genome is found.
The name of the folder where the result files should end up.
The chosen E-value threshold in BLAST. Default is e.value=1, a smaller value will speed up the search at the cost of less sensitivity.
An integer to separate multiple jobs. You may want to run several jobs in parallell, and each job should have different number here to avoid confusion on databases. Default is job=1.
A basic step in pangenomics and many other comparative studies is to cluster proteins into groups or families. One commonly used approach is based on reciprocal BLASTing. This function uses the BLAST+ software available for free from NCBI (Camacho et al, 2009).
A vector listing FASTA files of protein sequences is given as input in in.files. These files must have the GID-tag in the first token of every header, and in their filenames as well, i.e. all input files should first be prepared by
panPrep to ensure this. Note that only protein sequences are considered here. If your coding genes are stored as DNA, please translate them to protein prior to using this function, see
A BLAST database is made from each genome in turn. Then all genomes are queried against this database, and for every pair of genomes a result file is produced. If two genomes have GID-tags GID111, and GID222 then both result file GID111_vs_GID222.txt and GID222_vs_GID111.txt will be found in out.folder after the completion of this search. This reciprocal (two-way) search is required because of the heuristics of BLAST.
The out.folder is scanned for already existing result files, and
blastAllAll never overwrites an existing result file. If a file with the name GID111_vs_GID222.txt already exists in the out.folder, this particular search is skipped. This makes it possible to run multiple jobs in parallell, writing to the same out.folder. It also makes it possible to add new genomes, and only BLAST the new combinations without repeating previous comparisons.
This search can be slow if the genomes contain many proteins and it scales quadratically in the number of input files. It is best suited for the study of a smaller number of genomes (less than say 100). By starting multiple R sessions, you can speed up the search by running
blastAllAll from each R session, using the same out.folder but different integers for the
The result files are text files, and can be read into R using
readBlastTable, but more commonly they are used as input to
bDist to compute distances between sequences for subsequent clustering.
The function produces N*N result files if in.files lists N sequence files. These result files are located in
out.folder. Existing files are never overwritten by
blastAllAll, if you want to re-compute something, delete the corresponding result files first.
The BLAST+ software must be installed on the system for this function to work, i.e. the commands makeblastdb and blastp must be recognized as valid commands if you run them in a terminal window.
Lars Snipen and Kristian Hovde Liland.
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## Not run: # Using FASTA files in this package. # We need to uncompress them first... extdata.path <- file.path(path.package("micropan"),"extdata") filenames <- c("Mpneumoniae_M129_GID1.fsa", "Mpneumoniae_309_GID2.fsa", "Mpneumoniae_FH_GID3.fsa") pth <- lapply( file.path( extdata.path, paste( filenames, ".xz", sep="" ) ), xzuncompress ) #...blasting, assuming the BLAST+ software is properly installed # NB! This will take some minute(s)! blastAllAll(in.files=file.path(extdata.path,filenames),out.folder=".") # ...and compressing them again... pth <- lapply( file.path( extdata.path, filenames ), xzcompress ) ## End(Not run)