Description Usage Arguments Value Methods (by class) References See Also Examples
This grouping algorithm partly mimicks the approach used by Roary, but instead of using BLAST in the second pass it uses cosine similarity of kmer feature vectors, thus providing an even greater speedup. The algorithm uses the CD-HIT algorithm to precluster highly similar sequences and then groups these clusters by extracting a representative and clustering these using the standard FindMyFriends kmer cosine similarity.
| 1 2 3 4 5 6 | cdhitGrouping(object, ...)
## S4 method for signature 'pgVirtual'
cdhitGrouping(object, kmerSize, lowerLimit,
  maxLengthDif, geneChunkSize, cdhitOpts, cdhitIter = TRUE, nrep = 1,
  from = 0.9, by = 0.05)
 | 
| object | A pgVirtual subclass | 
| ... | parameters passed on. | 
| kmerSize | The size of the kmer's used for the comparison. If two values are given the first will be used for the CD-HIT algorithm and the second will be used for the cosine similarity calculations. | 
| lowerLimit | A numeric giving the lower bounds of similarity below which it will be set to zero. | 
| maxLengthDif | The maximum deviation in sequence length to allow during preclustering with CD-HIT. Below 1 it describes a percentage. Above 1 it describes a fixed length. | 
| geneChunkSize | The maximum number of genes to pass to the CD-HIT algorithm. If object contains more genes than this, CD-HIT will be run in chunks and combined with a second CD-HIT pass before the final cosine similarity grouping. | 
| cdhitOpts | Additional arguments passed on to CD-HIT. It should be a named list with names corresponding to the arguments expected in the CD-HIT algorithm (without the dash). i, n and s/S will be overwritten based on the other parameters given to this function and all values in cdhitOpts will be converted to character using as.character | 
| cdhitIter | Logical. Should the preclustered groups be grouped by gradually lowering the threshold in CD-Hit or by directly calculating kmer similarities between all preclusters and group by that. Defaults to TRUE | 
| nrep | If  | 
| from | The start similarity threshold to use for the iterative CD-Hit grouping. Together with by and nrep it defines the number of times and levels CD-Hit is run. Defaults to 0.9 | 
| by | The step size to use for the iterative CD-Hit grouping. Defaults to 0.05 | 
An object of the same class as 'object'.
pgVirtual: Grouping using cdhit for all pgVirtual subclasses
Page, A. J., Cummins, C. A., Hunt, M., Wong, V. K., Reuter, S., Holden, M. T. G., et al. (2015). Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, btv421.
Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W. (2012). CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, 28 (23), 3150–3152.
Li, W. and Godzik, A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–9.
Other grouping algorithms: gpcGrouping,
graphGrouping, manualGrouping
| 1 2 3 | testPG <- .loadPgExample()
testPG <- cdhitGrouping(testPG)
 | 
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.