neighborhoodSplit: Split gene groups by neighborhood synteny

Description Usage Arguments Value Methods (by class) See Also Examples

Description

This function evaluates already created gene groups and splits the members into new groups based on the synteny of the flanking genes and the similarity of the sequences. In general the splitting is based on multiple stages that all gene pairs must pass in order to remain in the same group. First the link between the genes is removed if they are part of the same organism. Then the synteny of the flanking genes are assessed and if it doesn't passes the defined threshold the link between the gene pair is removed. Then the kmer similarity of the two sequences are compared and if below a certain threshold the link is removed. Lastly the length of the two sequences are compared and if below a certain threshold the link is removed. Based on this new graph cliques are detected and sorted based on the lowest within-clique sequence similarity and neighborhood synteny. The cliques are then added as new groups if the members are not already members of a new group until all members are part of a new group. This approach ensures that all members of the new groupings passes certain conditions when compared to all other members of the same group. After the splitting a refinement step is done where gene groups with high similarity and sharing a neighbor either up- or downstream are merged together to avoid spurius errors resulting from the initial grouping.

Usage

1
2
3
4
5
6
neighborhoodSplit(object, ...)

## S4 method for signature 'pgVirtualLoc'
neighborhoodSplit(object, flankSize,
  forceParalogues, kmerSize, lowerLimit, maxLengthDif,
  guideGroups = NULL, cdhitOpts = list())

Arguments

object

A pgVirtualLoc subclass

...

parameters passed on.

flankSize

The number of flanking genes on each side of the gene to use for comparison.

forceParalogues

Force similarity of paralogue genes to 0

kmerSize

The length of kmers used for sequence similarity

lowerLimit

The lower limit of sequence similarity below which it will be set to 0

maxLengthDif

The maximum deviation in sequence length to allow. Between 0 and 1 it describes a percentage. Above 1 it describes a fixed length

guideGroups

An integer vector with prior grouping that, all else being equal, should be prioritized. Used internally.

cdhitOpts

A list of options to pass on to CD-Hit during the merging step. "l", "n" and "s"/"S" will be overridden.

Value

An object with the same class as object containing the new grouping.

Methods (by class)

See Also

Other group-splitting: kmerSplit

Examples

1
2
3
4
5
6
7
testPG <- .loadPgExample(geneLoc=TRUE, withGroups=TRUE)

# Too heavy to run
## Not run: 
testPG <- neighborhoodSplit(testPG, lowerLimit=0.75)

## End(Not run)

FindMyFriends documentation built on Nov. 8, 2020, 6:46 p.m.