cdhitGrouping: Gene grouping by preclustering with CD-HIT

Description Usage Arguments Value Methods (by class) References See Also Examples

Description

This grouping algorithm partly mimicks the approach used by Roary, but instead of using BLAST in the second pass it uses cosine similarity of kmer feature vectors, thus providing an even greater speedup. The algorithm uses the CD-HIT algorithm to precluster highly similar sequences and then groups these clusters by extracting a representative and clustering these using the standard FindMyFriends kmer cosine similarity.

Usage

1
2
3
4
5
6
cdhitGrouping(object, ...)

## S4 method for signature 'pgVirtual'
cdhitGrouping(object, kmerSize, lowerLimit,
  maxLengthDif, geneChunkSize, cdhitOpts, cdhitIter = TRUE, nrep = 1,
  from = 0.9, by = 0.05)

Arguments

object

A pgVirtual subclass

...

parameters passed on.

kmerSize

The size of the kmer's used for the comparison. If two values are given the first will be used for the CD-HIT algorithm and the second will be used for the cosine similarity calculations.

lowerLimit

A numeric giving the lower bounds of similarity below which it will be set to zero.

maxLengthDif

The maximum deviation in sequence length to allow during preclustering with CD-HIT. Below 1 it describes a percentage. Above 1 it describes a fixed length.

geneChunkSize

The maximum number of genes to pass to the CD-HIT algorithm. If object contains more genes than this, CD-HIT will be run in chunks and combined with a second CD-HIT pass before the final cosine similarity grouping.

cdhitOpts

Additional arguments passed on to CD-HIT. It should be a named list with names corresponding to the arguments expected in the CD-HIT algorithm (without the dash). i, n and s/S will be overwritten based on the other parameters given to this function and all values in cdhitOpts will be converted to character using as.character

cdhitIter

Logical. Should the preclustered groups be grouped by gradually lowering the threshold in CD-Hit or by directly calculating kmer similarities between all preclusters and group by that. Defaults to TRUE

nrep

If cdhitIter = TRUE, controls how many iterations should be performed at each threshold level. Defaults to 1.

from

The start similarity threshold to use for the iterative CD-Hit grouping. Together with by and nrep it defines the number of times and levels CD-Hit is run. Defaults to 0.9

by

The step size to use for the iterative CD-Hit grouping. Defaults to 0.05

Value

An object of the same class as 'object'.

Methods (by class)

References

Page, A. J., Cummins, C. A., Hunt, M., Wong, V. K., Reuter, S., Holden, M. T. G., et al. (2015). Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, btv421.

Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W. (2012). CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, 28 (23), 3150–3152.

Li, W. and Godzik, A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–9.

See Also

Other grouping algorithms: gpcGrouping, graphGrouping, manualGrouping

Examples

1
2
3
testPG <- .loadPgExample()

testPG <- cdhitGrouping(testPG)

thomasp85/FindMyFriends documentation built on April 25, 2020, 1:06 p.m.