hmmerScan: Scanning a profile Hidden Markov Model database
In micropan: Microbial Pan-Genome Analysis

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Scanning FASTA formatted protein files against a database of pHMMs using the HMMER3 software.

1	hmmerScan(in.files, dbase, out.folder, threads = 0, verbose = TRUE)

`in.files`	A character vector of file names.
`dbase`	The full path-name of the database to scan (text).
`out.folder`	The name of the folder to put the result files.
`threads`	Number of CPU's to use.
`verbose`	Logical indicating if textual output should be given to monitor the progress.

The HMMER3 software is purpose-made for handling profile Hidden Markov Models (pHMM) describing patterns in biological sequences (Eddy, 2008). This function will make calls to the HMMER3 software to scan FASTA files of proteins against a pHMM database.

The files named in in.files must contain FASTA formatted protein sequences. These files should be prepared by panPrep to make certain each sequence, as well as the file name, has a GID-tag identifying their genome. The database named in db must be a HMMER3 formatted database. It is typically the Pfam-A database, but you can also make your own HMMER3 databases, see the HMMER3 documentation for help.

hmmerScan will query every input file against the named database. The database contains profile Hidden Markov Models describing position specific sequence patterns. Each sequence in every input file is scanned to see if some of the patterns can be matched to some degree. Each input file results in an output file with the same GID-tag in the name. The result files give tabular output, and are plain text files. See readHmmer for how to read the results into R.

Scanning large databases like Pfam-A takes time, usually several minutes per genome. The scan is set up to use only 1 cpu per scan by default. By increasing threads you can utilize multiple CPUs, typically on a computing cluster. Our experience is that from a multi-core laptop it is better to start this function in default mode from mutliple R-sessions. This function will not overwrite an existing result file, and multiple parallel sessions can write results to the same folder.

This function produces files in the folder specified by out.folder. Existing files are never overwritten by hmmerScan, if you want to re-compute something, delete the corresponding result files first.

The HMMER3 software must be installed on the system for this function to work, i.e. the command system("hmmscan -h") must be recognized as a valid command if you run it in the Console window.

Lars Snipen and Kristian Hovde Liland.

Eddy, S.R. (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Computational Biology, 4(5).

panPrep, readHmmer.

## Not run: 
# This example require the external software HMMER
# Using example files in this package
pf <- file.path(path.package("micropan"), "extdata", "xmpl_GID1.faa.xz")
dbf <- file.path(path.package("micropan"), "extdata",
                 str_c("microfam.hmm", c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz")))

# We need to uncompress them first...
prot.file <- tempfile(pattern = "GID1.faa", fileext=".xz")
ok <- file.copy(from = pf, to = prot.file)
prot.file <- xzuncompress(prot.file)
db.files <- str_c(tempfile(), c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz"))
ok <- file.copy(from = dbf, to = db.files)
db.files <- unlist(lapply(db.files, xzuncompress))
db.name <- str_remove(db.files[1], "\\.[a-z0-9]+$")

# Scanning the FASTA file against microfam.hmm...
hmmerScan(in.files = prot.file, dbase = db.name, out.folder = ".")

# Reading results
hmm.file <- file.path(".", str_c("GID1_vs_", basename(db.name), ".txt"))
hmm.tbl <- readHmmer(hmm.file)

# ...and cleaning...
ok <- file.remove(prot.file)
ok <- file.remove(str_remove(db.files, ".xz"))

## End(Not run)