runClustering: Run CD-HIT clustering
In timjeske/DEUS: Differential expression of unique sequences

Description Usage Arguments Value

Sequentially, each sequency is either assinged to an existing cluster or is classified as a new cluster representative if no matching cluster can be found.

1 2	runClustering(cdhit_path, sequences, out_dir, identity_cutoff, length_cutoff, wordlength, map, write_fastas = FALSE, optional = "")

`cdhit_path`	Path to cd-hit-est executable
`sequences`	Vector of sequences in FASTA style generated by the sequencesAsFasta
`out_dir`	Directory to save output files of clustering
`identity_cutoff`	Sequence identity cutoff used for clustering
`length_cutoff`	Length difference cutoff
`wordlength`	CD-Hit word length
`map`	A data frame with sequences as row names and sequence identifiers in first column. Can be generated by createMap
`write_fastas`	Boolean that indicates whether a fasta file will be generated for each cluster
`optional`	Optional execution parameters

A data frame with the columns 'SequenceID' and 'ClusterID' assigning each sequence to a cluster of similar sequences via their identifiers. Additionally, a file CD-HIT.fa, CD-HIT.fa.clstr and a folder Clusters is generated in the given output directory. The CD-HIT.fa file is the FASTA file of all cluster representatives. The CD-HIT.fa.clustr file lists all identified clusters and the assigned sequence identifiers together with the percentage of overlapping sequence with the cluster representative. In the Clusters directory there is a FASTA file for each cluster.

timjeske/DEUS documentation built on June 6, 2019, 12:59 p.m.