cd_hit_est: Cluster DNA sequences.
In joelnitta/baitfindR: Find Baits for Sequence Capture

Description Usage Arguments Value Author(s) References Examples

This is a wrapper for the CD-HIT-EST algorithm. According to the CD-HIT user's guide, "CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity." cd-hit-est comes bundled with transdecoder, so it is run from there.

cd_hit_est(
  input,
  output,
  wd = here::here(),
  other_args = NULL,
  echo = pkgconfig::get_config("baitfindR::echo", fallback = FALSE),
  ...
)

`input`	Character vector of length one; the path to the input file for cd-hit-est. Should be DNA or AA sequences in fasta format.
`output`	Character vector of length one; the name to assign to the output. Can include a path, in which case the output will be written there.
`wd`	Character vector of length one; the directory where the command will be run.
`other_args`	Character vector; other arguments to pass to cd-hit-est. Each should be an element of the vector.
`echo`	Logical; should the standard output and error be printed to the screen?
`...`	Additional other arguments. Not used by this function, but meant to be used by `drake_plan` for tracking during workflows.

Within the R environment, a list with components specified in run.

Externally, two files will be written: according to the CD-HIT user's guide, "The output are two files: a fasta file of representative sequences and a text file of list of clusters."

The fasta file will be named with the value of output; the list of clusters will be the same, with .clstr appended.

Joel H Nitta, joelnitta@gmail.com

http://www.bioinformatics.org/cd-hit/, http://transdecoder.github.io

## Not run: 
library(ape)
library(baitfindR)

# Make temp dir for storing output
temp_dir <- fs::dir_create(fs::path(tempdir(), "baitfindR_example"))
data("PSKY")

# Write downsized transcriptome to temp dir
write.FASTA(PSKY, fs::path(temp_dir, "PSKY"))

# Get CDS
transdecoder_long_orfs(
  transcriptome_file = fs::path(temp_dir, "PSKY"),
  wd = temp_dir
  )

# Cluster similar genes in CDS
cd_hit_est(
  input = fs::path(temp_dir, "PSKY.transdecoder_dir", "longest_orfs.cds"),
  output = fs::path(temp_dir, "PSKY.cd-hit-est"),
  wd = temp_dir,
  echo = TRUE
)

# Check output
list.files(temp_dir)
head(readr::read_lines(fs::path(temp_dir, "PSKY.cd-hit-est")))
head(readr::read_lines(fs::path(temp_dir, "PSKY.cd-hit-est.clstr")))

# Cleanup
fs::file_delete(temp_dir)

## End(Not run)