diamond_protein_to_protein: Perform Protein to Protein DIAMOND2 Searches (BLASTP)
In drostlab/rdiamond: Seamless Integration of DIAMOND2 Sequence Searches in R

View source: R/diamond_protein_to_protein.R

diamond_protein_to_protein

R Documentation

Perform Protein to Protein DIAMOND2 Searches (BLASTP)

Description

Run protein to protein DIAMOND2 of reference sequences against a blast-able database or fasta file.

Usage

diamond_protein_to_protein(
  query,
  subject,
  output_path = NULL,
  is_subject_db = FALSE,
  task = "blastp",
  sensitivity_mode = "ultra-sensitive",
  use_arrow_duckdb_connection = FALSE,
  evalue = 0.001,
  out_format = "csv",
  cores = 1,
  max_target_seqs = 500,
  hard_mask = TRUE,
  diamond_exec_path = NULL,
  add_makedb_options = NULL,
  add_diamond_options = NULL
)

Arguments

`query`	path to input file in fasta format.
`subject`	path to subject file in fasta format or blast-able database.
`output_path`	path to folder at which DIAMOND2 output table shall be stored. Default is `output_path = NULL` (hence `getwd()` is used).
`is_subject_db`	logical specifying whether or not the `subject` file is a file in fasta format (`is_subject_db = FALSE`; default) or a `fasta` file that was previously converted into a blast-able database using `diamond makedb` (`is_subject_db = TRUE`).
`task`	protein search task option. Options are: `task = "blastp"` : Standard protein-protein comparisons (default).
`sensitivity_mode`	specify the level of alignment sensitivity. The higher the sensitivity level, the more deep homologs can be found, but at the cost of reduced computational speed. `sensitivity_mode = "faster"` : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70 `sensitivity_mode = "default"` : Default mode. Designed for finding hits of >70 `sensitivity_mode = "fast"` : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70 `sensitivity_mode = "mid-sensitive"` : fast alignments between the `fast` mode and the sensitive mode in sensitivity. `sensitivity_mode = "sensitive"` : fast alignments, but full sensitivity for hits >40 `sensitivity_mode = "more-sensitive"` : more sensitive than the `sensitive` mode. `sensitivity_mode = "very-sensitive"` : sensitive alignment mode. `sensitivity_mode = "ultra-sensitive"` : most sensitive alignment mode (sensitivity as high as BLASTP).
`use_arrow_duckdb_connection`	shall DIAMOND2 hit output table be transformed to an in-process (big data disk-processing) arrow connection to DuckDB? This is useful when the DIAMOND2 output table to too large to fit into memory. Default is `use_arrow_duckdb_connection = FALSE`. Please consult the Installation Vignette for details.
`evalue`	Expectation value (E) threshold for saving hits (default: `evalue = 0.001`).
`out_format`	a character string specifying the format of the file in which the DIAMOND results shall be stored. Available options are: `out_format = "pair"` : Pairwise `out_format = "xml"` : XML `out_format = "csv"` : Comma-separated file
`cores`	number of cores for parallel DIAMOND searches.
`max_target_seqs`	maximum number of aligned sequences that shall be retained. Please be aware that `max_target_seqs` selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
`hard_mask`	shall low complexity regions be hard masked with TANTAN? Default is `db_hard_mask = TRUE`.
`diamond_exec_path`	a path to the DIAMOND executable or `conda/miniconda` folder.
`add_makedb_options`	a character string specifying additional makedb options that shall be passed on to the diamond makedb command line call, e.g. `add_make_options = "--taxonnames"` (Default is `add_diamond_options = NULL`).
`add_diamond_options`	a character string specifying additional diamond options that shall be passed on to the diamond command line call, e.g. `add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits"` (Default is `add_diamond_options = NULL`).

Author(s)

Hajk-Georg Drost

Examples

## Not run: 
# run diamond assuming that the diamond executable is available
# via the system path ('diamond_exec_path = NULL') and using
# sensitivity_mode = "ultra-sensitive"
diamond_example <- diamond_protein_to_protein(
              query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
              subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
              sensitivity_mode = "ultra-sensitive",
              output_path = tempdir(),
              use_arrow_duckdb_connection  = FALSE)

# look at DIAMOND results
diamond_example

# run diamond assuming that the diamond executable is available
# via the miniconda path ('diamond_exec_path = "/opt/miniconda3/bin/"')
# and using 2 cores as well as sensitivity_mode = "ultra-sensitive"
diamond_example_conda <- diamond_protein_to_protein(
query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
sensitivity_mode = "ultra-sensitive", diamond_exec_path = "/opt/miniconda3/bin/",
output_path = tempdir(),
use_arrow_duckdb_connection  = FALSE, cores = 2)

# look at DIAMOND results
diamond_example_conda

# run diamond assuming that the diamond executable is available
# via the system path ('diamond_exec_path = NULL') and using
# sensitivity_mode = "ultra-sensitive" and adding command line options:
# "--block-size 4.0 --compress 1 --no-self-hits"
diamond_example_ultra_sensitive_add_diamond_options <- diamond_protein_to_protein(
query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
sensitivity_mode = "ultra-sensitive",
max_target_seqs = 500,
output_path = tempdir(),
use_arrow_duckdb_connection  = FALSE,
add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits",
cores = 1
)

# look at DIAMOND results
diamond_example_ultra_sensitive_add_diamond_options

# run diamond assuming that the diamond executable is available
# via the system path ('diamond_exec_path = NULL') and using
# sensitivity_mode = "ultra-sensitive" and adding makedb command line options:
# "--taxonnames"
diamond_example_ultra_sensitive_add_makedb_options <- diamond_protein_to_protein(
query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
sensitivity_mode = "ultra-sensitive",
max_target_seqs = 500,
output_path = tempdir(),
use_arrow_duckdb_connection  = FALSE,
add_makedb_options = "--taxonnames",
cores = 1
)

# look at DIAMOND results
diamond_example_ultra_sensitive_add_makedb_options

## End(Not run)

drostlab/rdiamond documentation built on Oct. 23, 2023, 1 p.m.