cd_hit_est: Cluster DNA sequences.

Description Usage Arguments Value Author(s) References Examples

Description

This is a wrapper for the CD-HIT-EST algorithm. According to the CD-HIT user's guide, "CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity." cd-hit-est comes bundled with transdecoder, so it is run from there.

Usage

1
2
3
4
5
6
7
8
cd_hit_est(
  input,
  output,
  wd = here::here(),
  other_args = NULL,
  echo = pkgconfig::get_config("baitfindR::echo", fallback = FALSE),
  ...
)

Arguments

input

Character vector of length one; the path to the input file for cd-hit-est. Should be DNA or AA sequences in fasta format.

output

Character vector of length one; the name to assign to the output. Can include a path, in which case the output will be written there.

wd

Character vector of length one; the directory where the command will be run.

other_args

Character vector; other arguments to pass to cd-hit-est. Each should be an element of the vector.

echo

Logical; should the standard output and error be printed to the screen?

...

Additional other arguments. Not used by this function, but meant to be used by drake_plan for tracking during workflows.

Value

Within the R environment, a list with components specified in run.

Externally, two files will be written: according to the CD-HIT user's guide, "The output are two files: a fasta file of representative sequences and a text file of list of clusters."

The fasta file will be named with the value of output; the list of clusters will be the same, with .clstr appended.

Author(s)

Joel H Nitta, joelnitta@gmail.com

References

http://www.bioinformatics.org/cd-hit/, http://transdecoder.github.io

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## Not run: 
library(ape)
library(baitfindR)

# Make temp dir for storing output
temp_dir <- fs::dir_create(fs::path(tempdir(), "baitfindR_example"))
data("PSKY")

# Write downsized transcriptome to temp dir
write.FASTA(PSKY, fs::path(temp_dir, "PSKY"))

# Get CDS
transdecoder_long_orfs(
  transcriptome_file = fs::path(temp_dir, "PSKY"),
  wd = temp_dir
  )

# Cluster similar genes in CDS
cd_hit_est(
  input = fs::path(temp_dir, "PSKY.transdecoder_dir", "longest_orfs.cds"),
  output = fs::path(temp_dir, "PSKY.cd-hit-est"),
  wd = temp_dir,
  echo = TRUE
)

# Check output
list.files(temp_dir)
head(readr::read_lines(fs::path(temp_dir, "PSKY.cd-hit-est")))
head(readr::read_lines(fs::path(temp_dir, "PSKY.cd-hit-est.clstr")))

# Cleanup
fs::file_delete(temp_dir)

## End(Not run)

joelnitta/baitfindR documentation built on May 7, 2020, 6:21 p.m.