blast_best_hit: Retrieve only the best BLAST hit for each query
In drostlab/metablastr: Perform Massive Local BLAST Searches

blast_best_hit

R Documentation

Retrieve only the best BLAST hit for each query

Description

This function performs a BLAST search between query and subject sequences and returns only the best hit based on the following criteria.

A best blast hit is defined as:

the hit with the smallest e-value
if e-values are identical then the hit with the longest alignment length is chosen

Usage

blast_best_hit(
  query,
  subject,
  search_type = "nucleotide_to_nucleotide",
  strand = "both",
  output.path = NULL,
  is.subject.db = FALSE,
  task = "blastn",
  db.import = FALSE,
  postgres.user = NULL,
  evalue = 0.001,
  out.format = "csv",
  cores = 1,
  max.target.seqs = 10000,
  db.soft.mask = FALSE,
  db.hard.mask = FALSE,
  blast.path = NULL
)

Arguments

`query`	path to input file in fasta format.
`subject`	path to subject file in fasta format or blast-able database.
`search_type`	type of query and subject sequences that will be compared via BLAST search. Options are: `search_type = "nucleotide_to_nucleotide"` `search_type = "nucleotide_to_protein"` `search_type = "protein_to_nucleotide"` `search_type = "protein_to_protein"`
`strand`	Query DNA strand(s) to search against database/subject. Options are: `strand = "both"` (Default): query against both DNA strands. `strand = "minus"` : query against minus DNA strand. `strand = "plus"` : query against plus DNA strand.
`output.path`	path to folder at which BLAST output table shall be stored. Default is `output.path = NULL` (hence `getwd()` is used).
`is.subject.db`	logical specifying whether or not the `subject` file is a file in fasta format (`is.subject.db = FALSE`; default) or a blast-able database that was formatted with `makeblastdb` (`is.subject.db = TRUE`).
`task`	BLAST search task option (depending on the selected `search_type`). Options are: `search_type = "nucleotide_to_nucleotide"` `task = "blastn"` : Standard nucleotide-nucleotide comparisons (default) - Traditional BLASTN requiring an exact match of 11. `task = "blastn-short"` : Optimized nucleotide-nucleotide comparisons for query sequences shorter than 50 nucleotides. `task = "dc-megablast"` : Discontiguous megablast used to find somewhat distant sequences. `task = "megablast"` : Traditional megablast used to find very similar (e.g., intraspecies or closely related species) sequences. `task = "rmblastn"` `search_type = "nucleotide_to_protein"` `task = "blastx"` : Standard nucleotide-protein comparisons (default). `task = "blastx-fast"` : Optimized nucleotide-protein comparisons. `search_type = "protein_to_nucleotide"` `task = "tblastn"` : Standard protein-nucleotide comparisons (default). `task = "tblastn-fast"` : Optimized protein-nucleotide comparisons. `search_type = "protein_to_protein"` `task = "blastp"` : Standard protein-protein comparisons (default). `task = "blast-fast"` : Improved BLAST searches using longer words for protein seeding. `task = "blastp-short"` : Optimized protein-protein comparisons for query sequences shorter than 30 residues.
`db.import`	shall the BLAST output be stored in a PostgresSQL database and shall a connection be established to this database? Default is `db.import = FALSE`. In case users wish to to only generate a BLAST output file without importing it to the current R session they can specify `db.import = NULL`.
`postgres.user`	when `db.import = TRUE` and `out.format = "postgres"` is selected, the BLAST output is imported and stored in a PostgresSQL database. In that case, users need to have PostgresSQL installed and initialized on their system. Please consult the Installation Vignette for details.
`evalue`	Expectation value (E) threshold for saving hits (default: `evalue = 0.001`).
`out.format`	a character string specifying the format of the file in which the BLAST results shall be stored. Available options are: `out.format = "pair"` : Pairwise `out.format = "qa.ident"` : Query-anchored showing identities `out.format = "qa.nonident"` : Query-anchored no identities `out.format = "fq.ident"` : Flat query-anchored showing identities `out.format = "fq.nonident"` : Flat query-anchored no identities `out.format = "xml"` : XML `out.format = "tab"` : Tabular separated file `out.format = "tab.comment"` : Tabular separated file with comment lines `out.format = "ASN.1.text"` : Seqalign (Text ASN.1) `out.format = "ASN.1.binary"` : Seqalign (Binary ASN.1) `out.format = "csv"` : Comma-separated values `out.format = "ASN.1"` : BLAST archive (ASN.1) `out.format = "json.seq.aln"` : Seqalign (JSON) `out.format = "json.blast.multi"` : Multiple-file BLAST JSON `out.format = "xml2.blast.multi"` : Multiple-file BLAST XML2 `out.format = "json.blast.single"` : Single-file BLAST JSON `out.format = "xml2.blast.single"` : Single-file BLAST XML2 `out.format = "SAM"` : Sequence Alignment/Map (SAM) `out.format = "report"` : Organism Report
`cores`	number of cores for parallel BLAST searches.
`max.target.seqs`	maximum number of aligned sequences that shall be retained. Please be aware that `max.target.seqs` selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
`db.soft.mask`	shall low complexity regions be soft masked? Default is `db.soft.mask = FALSE`.
`db.hard.mask`	shall low complexity regions be hard masked? Default is `db.hard.mask = FALSE`.
`blast.path`	path to BLAST executables.

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
blast_best_test <- blast_best_hit(
                 query   = system.file('seqs/qry_nn.fa', package = 'metablastr'),
                 subject = system.file('seqs/sbj_nn_best_hit.fa', package = 'metablastr'),
                 search_type = "nucleotide_to_nucleotide",
                 output.path = tempdir(),
                 db.import  = FALSE)
                 
 # look at results
 blast_best_test

## End(Not run)

drostlab/metablastr documentation built on Sept. 14, 2023, 10:43 a.m.