gintools: Genomic DNA Integration Analysis Tools

Description Usage Arguments Value Note Author(s) See Also Examples

Given a group of discrete factors (i.e. position ids) and integer values, the function tries to correct/cluster the integer values based on their frequency in a defined windowsize.

1
2
3

.clusterSites(posID = NULL, value = NULL, grouping = NULL,
  psl.rd = NULL, weight = NULL, windowSize = 5L, byQuartile = FALSE,
  quartile = 0.7, parallel = TRUE, sonicAbund = FALSE)

`posID`	a vector of groupings for the value parameter (i.e. Chr,strand). Required if psl.rd parameter is not defined.
`value`	a vector of integer with values that needs to corrected or clustered (i.e. Positions). Required if psl.rd parameter is not defined.
`grouping`	additional vector of grouping of length posID or psl.rd by which to pool the rows (i.e. samplenames). Default is NULL.
`psl.rd`	a GRanges object returned from `getIntegrationSites`. Default is NULL.
`weight`	a numeric vector of weights to use when calculating frequency of value by posID and grouping if specified. Default is NULL.
`windowSize`	size of window within which values should be corrected or clustered. Default is 5.
`byQuartile`	flag denoting whether quartile based technique should be employed. See notes for details. Default is TRUE.
`quartile`	if byQuartile=TRUE, then the quartile which serves as the threshold. Default is 0.70.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Process is split by the grouping the column.
`sonicAbund`	calculate breakpoint abundance using `getSonicAbund`. Default is FALSE.

a data frame with clusteredValues and frequency shown alongside with the original input. If psl.rd parameter is defined then a GRanges object is returned with three new columns appended at the end: clusteredPosition, clonecount, and clusterTopHit (a representative for a given cluster chosen by best scoring hit!).

The algorithm for clustering when byQuartile=TRUE is as follows: for all values in each grouping, get a distribution and test if their frequency is >= quartile threshold. For values below the quartile threshold, test if any values overlap with the ones that passed the threshold and is within the defined windowSize. If there is a match, then merge with higher value, else leave it as is. This is only useful if the distribution is wide and polynodal. When byQuartile=FALSE, for each group the values within the defined window are merged with the next highest frequently occuring value, if freuquencies are tied then lowest value is used to represent the cluster. When psl.rd is passed, then multihits are ignored and only unique sites are clustered. All multihits will be tagged as a good 'clusterTopHit'.

Nirav Malani

findIntegrations, getIntegrationSites, otuSites, isuSites, crossOverCheck, pslToRangedObject, getSonicAbund

.clusterSites(posID=c('chr1-','chr1-','chr1-','chr2+','chr15-',
'chr16-','chr11-'), value=c(rep(1000,2),5832,1000,12324,65738,928042),
grouping=c('a','a','a','b','b','b','c'))
data(psl)
psl <- psl[sample(nrow(psl),100),]
psl.rd <- getIntegrationSites(pslToRangedObject(psl))
psl.rd$grouping <- sub("(.+)-.+","\\1",psl.rd$qName)
.clusterSites(grouping=psl.rd$grouping, psl.rd=psl.rd)