clusterSites: Cluster/Correct values within a window based on their...

Description Usage Arguments Value Note See Also Examples

View source: R/hiReadsProcessor.R

Description

Given a group of discrete factors (i.e. position ids) and integer values, the function tries to correct/cluster the integer values based on their frequency in a defined windowsize.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
clusterSites(
  posID = NULL,
  value = NULL,
  grouping = NULL,
  psl.rd = NULL,
  weight = NULL,
  windowSize = 5L,
  byQuartile = FALSE,
  quartile = 0.7,
  parallel = TRUE,
  sonicAbund = FALSE
)

Arguments

posID

a vector of groupings for the value parameter (i.e. Chr,strand). Required if psl.rd parameter is not defined.

value

a vector of integer with values that needs to corrected or clustered (i.e. Positions). Required if psl.rd parameter is not defined.

grouping

additional vector of grouping of length posID or psl.rd by which to pool the rows (i.e. samplenames). Default is NULL.

psl.rd

a GRanges object returned from getIntegrationSites. Default is NULL.

weight

a numeric vector of weights to use when calculating frequency of value by posID and grouping if specified. Default is NULL.

windowSize

size of window within which values should be corrected or clustered. Default is 5.

byQuartile

flag denoting whether quartile based technique should be employed. See notes for details. Default is TRUE.

quartile

if byQuartile=TRUE, then the quartile which serves as the threshold. Default is 0.70.

parallel

use parallel backend to perform calculation with BiocParallel. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using SerialParam. Process is split by the grouping the column.

sonicAbund

calculate breakpoint abundance using getSonicAbund. Default is FALSE.

Value

a data frame with clusteredValues and frequency shown alongside with the original input. If psl.rd parameter is defined then a GRanges object is returned with three new columns appended at the end: clusteredPosition, clonecount, and clusterTopHit (a representative for a given cluster chosen by best scoring hit!).

Note

The algorithm for clustering when byQuartile=TRUE is as follows: for all values in each grouping, get a distribution and test if their frequency is >= quartile threshold. For values below the quartile threshold, test if any values overlap with the ones that passed the threshold and is within the defined windowSize. If there is a match, then merge with higher value, else leave it as is. This is only useful if the distribution is wide and polynodal. When byQuartile=FALSE, for each group the values within the defined window are merged with the next highest frequently occuring value, if freuquencies are tied then lowest value is used to represent the cluster. When psl.rd is passed, then multihits are ignored and only unique sites are clustered. All multihits will be tagged as a good 'clusterTopHit'.

See Also

findIntegrations, getIntegrationSites, otuSites, isuSites, crossOverCheck, pslToRangedObject, getSonicAbund

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
clusterSites(posID = c('chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 
'chr16-','chr11-'), value = c(rep(1000, 2), 5832, 1000, 12324, 65738, 928042), 
grouping = c('a', 'a', 'a', 'b', 'b', 'b', 'c'), parallel = FALSE)
## Not run: 
data(psl)
psl <- psl[sample(nrow(psl), 100), ]
psl.rd <- getIntegrationSites(pslToRangedObject(psl))
psl.rd$grouping <- sub("(.+)-.+", "\\1", psl.rd$qName)
clusterSites(grouping = psl.rd$grouping, psl.rd = psl.rd)

## End(Not run)

malnirav/hiReadsProcessor documentation built on July 29, 2021, 6:33 a.m.