clusterSites: Assigns CpG cluster memberships on CpG sites within 'BSraw'...
In BiSeq: Processing and analyzing bisulfite sequencing data

Description Usage Arguments Details Value Author(s) See Also Examples

Within a BSraw object clusterSites searches for agglomerations of CpG sites across all samples. In a first step the data is reduced to CpG sites covered in round(perc.samples*ncol(object)) samples, these are called 'frequently covered CpG sites'. In a second step regions are detected where not less than min.sites frequently covered CpG sites are sufficiantly close to each other (max.dist). Note, that the frequently covered CpG sites are considered to define the boundaries of the CpG clusters only. For the subsequent analysis the methylation data of all CpG sites within these clusters are used.

1 2	clusterSites(object, groups, perc.samples, min.sites, max.dist, mc.cores, ...)

`object`	A `BSraw`.
`groups`	OPTIONAL. A factor specifying two or more sample groups within the given object. See Details.
`perc.samples`	A numeric between 0 and 1. Is passed to `filterBySharedRegions`.
`min.sites`	A numeric. Clusters should comprise at least `min.sites` CpG sites which are covered in at least `perc.samples` of samples, otherwise clusters are dropped.
`max.dist`	A numeric. CpG sites which are covered in at least `perc.samples` of samples within a cluster should not be more than `max.dist` bp apart from their nearest neighbors.
`mc.cores`	Passed to `mclapply` Default is 1.
`...`	Further arguments passed to the `filterBySharedRegions` function. closer than

There are three parameters that are important: perc.samples, min.sites and max.dist. For example, if perc.samples=0.5, the algorithm detects all CpG sites that are covered in at least 50% of the samples. Those CpG sites are called frequently covered CpG sites. In the next step the algorithm determines the distances between neighboured frequently covered CpG sites. When they are closer than (or close as) max.dist base pairs to each other, those frequently covered CpG sites and all other, less frequently covered CpG sites that are in between, belong to the same cluster. In the third step, each cluster is checked for the number of frequently covered CpG sites. If this number is less than min.sites, the cluster is discarded.

In other words: 1. The perc.samples parameter defines which are the frequently covered CpG sites. 2. The frequently covered CpG sites determine the boundaries of the clusters, depending on their distance to each other. 3. Clusters are discarded if they have too less frequently covered CpG sites.

If argument group is given, perc.samples, or no.samples, are applied for all group levels.

A BSraw object reduced to CpG sites within CpG cluster regions. A cluster.id metadata column on the rowRanges assigns cluster memberships per CpG site.

Katja Hebestreit

filterBySharedRegions, mclapply

data(rrbs)
rrbs.clust <- clusterSites(object = rrbs, groups = colData(rrbs)$group,
                           perc.samples = 4/5, min.sites = 20,
                           max.dist = 100)