Scanning a profile Hidden Markov Model database

Description

Scanning FASTA formatted protein files against a database of pHMMs using the HMMER3 software.

Usage

1
hmmerScan(in.files,db,out.folder,verbose=TRUE)

Arguments

in.files

A character vector of file names.

db

The full name of the database to scan.

out.folder

The name of the folder to put the result files.

verbose

Logical indicating if textual output should be given to monitor the progress.

Details

The HMMER3 software is purpose-made for handling profile Hidden Markov Models (pHMM) describing patterns in biological sequences (Eddy, 2008). This function will make calls to the HMMER3 software to scan FASTA files of proteins against a pHMM database.

The files named in in.files must contain FASTA formatted protein sequences. These files should be prepared by panPrep to make certain each sequence, as well as the file name, has a GID-tag identifying their genome. The database named in db must be a HMMER3 formatted database. It is typically the Pfam-A database, but you can also make your own HMMER3 databases, see the HMMER3 documentation for help.

hmmerScan will query every input file against the named database. The database contains profile Hidden Markov Models describing position specific sequence patterns. Each sequence in every input file is scanned to see if some of the patterns can be matched to some degree. Each input file results in an output file with the same GID-tag in the name. The result files give tabular output, and are plain text files. See readHmmer for how to read the results into R.

Scanning large databases like Pfam-A takes time, usually several minutes per genome. The scan is set up to use only 1 cpu per scan. To increase speed, start this function from mutliple R-sessions (Console windows). This function will not overwrite an existing result file, and multiple parallel sessions can write results to the same folder.

Value

This function produces files in the folder specified by out.folder. Existing files are never overwritten by hmmerScan, if you want to re-compute something, delete the corresponding result files first.

Note

The HMMER3 software must be installed on the system for this function to work, i.e. the command hmmscan must be recognized as a valid command if you run it in a terminal window.

Author(s)

Lars Snipen and Kristian Hovde Liland.

References

Eddy, S.R. (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Computational Biology, 4(5).

See Also

panPrep, readHmmer.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## Not run: 
	# Using a FASTA file in the micropan package
  # We need to uncompress it first...
	extdata.path <- file.path(path.package("micropan"),"extdata")
	filenames <- "Mpneumoniae_309_GID2.fsa"
  pth <- lapply( file.path( extdata.path, paste( filenames, ".xz", sep="" ) ), xzuncompress )
	
  # Using a miniature pHMM database in the micropan package
  # We need to uncompress its datafiles first...
  db <- "microfam0.hmm"
  pth <- lapply( file.path( extdata.path,
          paste( db, c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz"), sep="" ) ), xzuncompress )
	
  # ...and scanning the FASTA-file against microfam0...
	hmmerScan(in.files=file.path(extdata.path,filenames), 
		db=file.path(extdata.path,db),out.folder=".")
  
  # ...and compressing all files again...
  pth <- lapply( file.path( extdata.path, filenames ), xzcompress )
  pth <- lapply( file.path( extdata.path,
      paste( db, c(".h3f",".h3i",".h3m",".h3p"), sep="" ) ), xzcompress )

## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.