blast_best_hit | R Documentation |
Retrieve only the best BLAST hit for each query
Description
This function performs a BLAST search between query and subject sequences and
returns only the best hit based on the following criteria.
A best blast hit is defined as:
Usage
blast_best_hit(
query,
subject,
search_type = "nucleotide_to_nucleotide",
strand = "both",
output.path = NULL,
is.subject.db = FALSE,
task = "blastn",
db.import = FALSE,
postgres.user = NULL,
evalue = 0.001,
out.format = "csv",
cores = 1,
max.target.seqs = 10000,
db.soft.mask = FALSE,
db.hard.mask = FALSE,
blast.path = NULL
)
Arguments
query |
path to input file in fasta format.
|
subject |
path to subject file in fasta format or blast-able database.
|
search_type |
type of query and subject sequences that will be compared via BLAST search.
Options are:
-
search_type = "nucleotide_to_nucleotide"
-
search_type = "nucleotide_to_protein"
-
search_type = "protein_to_nucleotide"
-
search_type = "protein_to_protein"
|
strand |
Query DNA strand(s) to search against database/subject.
Options are:
-
strand = "both" (Default): query against both DNA strands.
-
strand = "minus" : query against minus DNA strand.
-
strand = "plus" : query against plus DNA strand.
|
output.path |
path to folder at which BLAST output table shall be stored.
Default is output.path = NULL (hence getwd() is used).
|
is.subject.db |
logical specifying whether or not the subject file is a file in fasta format (is.subject.db = FALSE ; default)
or a blast-able database that was formatted with makeblastdb (is.subject.db = TRUE ).
|
task |
BLAST search task option (depending on the selected search_type ). Options are:
-
search_type = "nucleotide_to_nucleotide"
-
task = "blastn" : Standard nucleotide-nucleotide comparisons (default) - Traditional BLASTN requiring an exact match of 11.
-
task = "blastn-short" : Optimized nucleotide-nucleotide comparisons for query sequences shorter than 50 nucleotides.
-
task = "dc-megablast" : Discontiguous megablast used to find somewhat distant sequences.
-
task = "megablast" : Traditional megablast used to find very similar (e.g., intraspecies or closely related species) sequences.
-
task = "rmblastn"
-
search_type = "nucleotide_to_protein"
-
search_type = "protein_to_nucleotide"
-
search_type = "protein_to_protein"
-
task = "blastp" : Standard protein-protein comparisons (default).
-
task = "blast-fast" : Improved BLAST searches using longer words for protein seeding.
-
task = "blastp-short" : Optimized protein-protein comparisons for query sequences shorter than 30 residues.
|
db.import |
shall the BLAST output be stored in a PostgresSQL database and shall a connection be established to this database? Default is db.import = FALSE .
In case users wish to to only generate a BLAST output file without importing it to the current R session they can specify db.import = NULL .
|
postgres.user |
when db.import = TRUE and out.format = "postgres" is selected, the BLAST output is imported and stored in a
PostgresSQL database. In that case, users need to have PostgresSQL installed and initialized on their system.
Please consult the Installation Vignette for details.
|
evalue |
Expectation value (E) threshold for saving hits (default: evalue = 0.001 ).
|
out.format |
a character string specifying the format of the file in which the BLAST results shall be stored.
Available options are:
-
out.format = "pair" : Pairwise
-
out.format = "qa.ident" : Query-anchored showing identities
-
out.format = "qa.nonident" : Query-anchored no identities
-
out.format = "fq.ident" : Flat query-anchored showing identities
-
out.format = "fq.nonident" : Flat query-anchored no identities
-
out.format = "xml" : XML
-
out.format = "tab" : Tabular separated file
-
out.format = "tab.comment" : Tabular separated file with comment lines
-
out.format = "ASN.1.text" : Seqalign (Text ASN.1)
-
out.format = "ASN.1.binary" : Seqalign (Binary ASN.1)
-
out.format = "csv" : Comma-separated values
-
out.format = "ASN.1" : BLAST archive (ASN.1)
-
out.format = "json.seq.aln" : Seqalign (JSON)
-
out.format = "json.blast.multi" : Multiple-file BLAST JSON
-
out.format = "xml2.blast.multi" : Multiple-file BLAST XML2
-
out.format = "json.blast.single" : Single-file BLAST JSON
-
out.format = "xml2.blast.single" : Single-file BLAST XML2
-
out.format = "SAM" : Sequence Alignment/Map (SAM)
-
out.format = "report" : Organism Report
|
cores |
number of cores for parallel BLAST searches.
|
max.target.seqs |
maximum number of aligned sequences that shall be retained. Please be aware that max.target.seqs selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
|
db.soft.mask |
shall low complexity regions be soft masked? Default is db.soft.mask = FALSE .
|
db.hard.mask |
shall low complexity regions be hard masked? Default is db.hard.mask = FALSE .
|
blast.path |
path to BLAST executables.
|
Author(s)
Hajk-Georg Drost
See Also
blast_nucleotide_to_nucleotide
, blast_protein_to_protein
,
blast_nucleotide_to_protein
, blast_protein_to_nucleotide
,
blast_best_reciprocal_hit
Examples
## Not run:
blast_best_test <- blast_best_hit(
query = system.file('seqs/qry_nn.fa', package = 'metablastr'),
subject = system.file('seqs/sbj_nn_best_hit.fa', package = 'metablastr'),
search_type = "nucleotide_to_nucleotide",
output.path = tempdir(),
db.import = FALSE)
# look at results
blast_best_test
## End(Not run)