diamond_best_hits: Perform a DIAMOND best hit search
In drostlab/homologr: Homology inference across multiple species

Description Usage Arguments Details Value Author(s) See Also Examples

This function performs a DIAMOND search (best hit) of a given set of protein sequences against a given database.

diamond_best_hits(
  query,
  subject,
  is_subject_db = FALSE,
  format = "fasta",
  sensitivity_mode = "ultra-sensitive",
  out_format = "csv",
  evalue = "1E-5",
  max_target_seqs = 5000,
  cores = 1,
  hard_mask = TRUE,
  diamond_exec_path = NULL,
  add_makedb_options = NULL,
  add_diamond_options = NULL,
  output_path = NULL
)

`query`	a character string specifying the path to the protein sequence file of interest (query organism).
`subject`	a character string specifying the path to the protein sequence file of interest (subject organism).
`is_subject_db`	logical specifying whether or not the `subject` file is a file in `fasta` format (`is_subject_db = FALSE`; default) or a `fasta` file that was previously converted into a blast-able database using `diamond makedb` (`is_subject_db = TRUE`).
`format`	a character string specifying the file format of the sequence file, e.g. `format` = `"fasta"`. Default is `format` = `"fasta"`.
`sensitivity_mode`	specify the level of alignment sensitivity. The higher the sensitivity level, the more deep homologs can be found, but at the cost of reduced computational speed. `sensitivity_mode = "fast"` : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70 `sensitivity_mode = "mid-sensitive"` : fast alignments between the `fast` mode and the sensitive mode in sensitivity. `sensitivity_mode = "sensitive"` : fast alignments, but full sensitivity for hits >40 `sensitivity_mode = "more-sensitive"` : more sensitive than the `sensitive` mode. `sensitivity_mode = "very-sensitive"` : sensitive alignment mode. `sensitivity_mode = "ultra-sensitive"` : most sensitive alignment mode (sensitivity as high as BLASTP).
`out_format`	a character string specifying the format of the file in which the DIAMOND results shall be stored. Available options are: `out_format = "pair"` : Pairwise `out_format = "xml"` : XML `out_format = "csv"` : Comma-separated file
`evalue`	Expectation value (E) threshold for saving hits (default: `evalue = 0.001`).
`max_target_seqs`	maximum number of aligned sequences that shall be retained. Please be aware that `max_target_seqs` selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
`cores`	number of cores for parallel DIAMOND searches.
`hard_mask`	shall low complexity regions be hard masked with TANTAN? Default is `db_hard_mask = TRUE`.
`diamond_exec_path`	a path to the DIAMOND executable or `conda/miniconda` folder.
`add_makedb_options`	a character string specifying additional makedb options that shall be passed on to the diamond makedb command line call, e.g. `add_make_options = "--taxonnames"` (Default is `add_diamond_options = NULL`).
`add_diamond_options`	a character string specifying additional diamond options that shall be passed on to the diamond command line call, e.g. `add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits"` (Default is `add_diamond_options = NULL`).
`output_path`	a path to the location were the DIAMOND best hit output shall be stored. E.g. `output_path` = `getwd()` to store it in the current working directory, or `output_path` = `file.path("put", "your", "path", "here")`.

Given a set of protein sequences (query sequences), a best hit diamond search (DBH) is being performed.

A tibble as returned by the diamond_best_hits function, storing the query_ids in the first column and the subject_ids (best hit homologs) in the second column.

Hajk-Georg Drost

diamond_reciprocal_best_hits, diamond_protein_to_protein

## Not run: 
# performing homology inference using the diamond best hit (DBH) method using protein sequences
best_hits <- diamond_best_hits(
query   = system.file('seqs/ortho_thal_aa.fasta', package = 'homologr'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'homologr'),
seq_type     = "protein")
# look at results
best_hits

# store the DIAMOND output file to the current working directory
best_hits <- diamond_best_hits(
query  = system.file('seqs/ortho_thal_aa.fasta', package = 'homologr'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'homologr'),
seq_type     = "protein",
output_path  = getwd())
# look at results
best_hits

# run diamond_best_hits() with multiple cores
best_hits <- diamond_best_hits(
query   = system.file('seqs/ortho_thal_aa.fasta', package = 'homologr'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'homologr'),
cores   = 2)
# look at results
best_hits

# performing homology inference using the diamond best hit (DBH) method and
# specifying the path to the DIAMOND executable (here miniconda path)
best_hits <- diamond_best_hits(
query  = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr'),
diamond_exec_path = "/opt/miniconda3/bin/")
# look at results
best_hits

## End(Not run)