Scanning FASTA formatted protein files against a database of pHMMs using the HMMER3 software.
A character vector of file names.
The full name of the database to scan.
The name of the folder to put the result files.
Number of CPU's to use.
Logical indicating if textual output should be given to monitor the progress.
The HMMER3 software is purpose-made for handling profile Hidden Markov Models (pHMM) describing patterns in biological sequences (Eddy, 2008). This function will make calls to the HMMER3 software to scan FASTA files of proteins against a pHMM database.
The files named in in.files must contain FASTA formatted protein sequences. These files
should be prepared by
panPrep to make certain each sequence, as well as the file name,
has a GID-tag identifying their genome. The database named in db must be a HMMER3 formatted
database. It is typically the Pfam-A database, but you can also make your own HMMER3 databases, see
the HMMER3 documentation for help.
hmmerScan will query every input file against the named database. The database contains
profile Hidden Markov Models describing position specific sequence patterns. Each sequence in every
input file is scanned to see if some of the patterns can be matched to some degree. Each input file
results in an output file with the same GID-tag in the name. The result files give tabular output, and
are plain text files. See
readHmmer for how to read the results into R.
Scanning large databases like Pfam-A takes time, usually several minutes per genome. The scan is set
up to use only 1 cpu per scan by default. By increasing
threads you can utilize multiple CPUs, typically
on a computing cluster.
Our experience is that from a multi-core laptop it is better to start this function in default mode
from mutliple R-sessions. This function will not overwrite an existing result file, and multiple parallel
sessions can write results to the same folder.
This function produces files in the folder specified by out.folder. Existing files are
never overwritten by
hmmerScan, if you want to re-compute something, delete the
corresponding result files first.
The HMMER3 software must be installed on the system for this function to work, i.e. the command system("hmmscan") must be recognized as a valid command if you run it in the Console window.
Lars Snipen and Kristian Hovde Liland.
Eddy, S.R. (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Computational Biology, 4(5).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## Not run: # Using a FASTA file in the micropan package # We need to uncompress it first... extdata.path <- file.path(path.package("micropan"),"extdata") filenames <- "Mpneumoniae_309_GID2.fsa" pth <- lapply( file.path( extdata.path, paste( filenames, ".xz", sep="" ) ), xzuncompress ) # Using a miniature pHMM database in the micropan package # We need to uncompress its datafiles first... db <- "microfam0.hmm" pth <- lapply( file.path( extdata.path, paste( db, c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz"), sep="" ) ), xzuncompress ) # ...and scanning the FASTA-file against microfam0... hmmerScan(in.files=file.path(extdata.path,filenames), db=file.path(extdata.path,db),out.folder=".") # ...and compressing all files again... pth <- lapply( file.path( extdata.path, filenames ), xzcompress ) pth <- lapply( file.path( extdata.path, paste( db, c(".h3f",".h3i",".h3m",".h3p"), sep="" ) ), xzcompress ) ## End(Not run)