cluster_peak-method: Clustering the peaks with the k-mean alignment algorithm
In FunChIP: Clustering and Alignment of ChIP-Seq peaks based on their shapes

Description Usage Arguments Details Value Author(s) References See Also Examples

It classifies and aligns the peaks stored in the GRanges object. The method applies the k-mean alignment algorithm with shift of the peaks and distance based on the convex combination of the L^p distances between the spline-smoothed peaks and their derivatives. The order p can be one of 1, 2 and ∞.

## S4 method for signature 'GRanges'
cluster_peak(object, parallel = FALSE, num.cores = NULL,
    n.clust = NULL,  seeds = NULL, shift.peak = NULL, weight = NULL,
    subsample.weight = 100, alpha = 1, p = 1, t.max = 0.5,
    plot.graph.k = TRUE, verbose = TRUE, rescale = FALSE )

`object`	GRanges object of length N. It must contain the metadata columns `spline`, `spline_der`, `width_spline`, computed by smooth_peak.
`parallel`	logical. If `TRUE`, the clustering for different values of the parameter k in `n.clust` are run in parallel. Default is `FALSE`.
`num.cores`	integer. If `parallel` is `TRUE`, it indicates the number of cores used in the parallelization. If `NULL` (default), the number of cores is automatically identified.
`n.clust`	integer vector (or scalar). Number of clusters in which the data set is divided (possibly one, if `n.clust` is a scalar). For each value of the vector, the cpp function `kmean_function` is called.
`seeds`	vector. Indices of the initial centers of the clusters, needed to initialize the k-mean procedure. The k-mean alignment, like all the k-mean-like algorithms, is dependent on the choice of the initial centers of the clusters, and each initialization of the seeds can generate slightly different results. The values must be included in 1, …, N. The length of the vector must be equal to the maximum number of clusters analyzed (`max(n.clust)`), otherwise it is truncated to this value, or the missing values are randomly generated. If `NULL` (default), the seeds are detected as the most central values (i.e. peaks with minimum distance from the others) of the set of peaks. If `seeds='random'`, the centers are randomly generated.
`shift.peak`	logical. It indicates whether the alignment via a translation of the abscissae is performed (`shift.peak = TRUE`) or not (`shift.peak = FALSE`). If no value is provided (`shift.peak = NULL`, default), both analyses are performed.
`weight`	real. Weight w of the distance function (see Details for the definitions of the distance function), needed to make the distance between splines and derivatives comparable. If no value is provided (default is `NULL`), it is computed as the median of the ratio between the pairwise distances of the data (d_0 (i,j)) and of the derivatives (d_1(i,j)) w = median d0(i,j)/d1(i,j) with i, j = 1: … N.
`subsample.weight`	integer value. Number of data points used to define the `weight`, if not assigned. Using all the peaks to define the weight can be computationally expensive and therefore a subsampling is suggested. If `subsample.weight=NULL` all the data will be used. Default is 100, which is a reasonable trade off between running time and reliability of the estimation.
`alpha`	real value between 0 and 1. Value of the convex weight α of the distance to balance the distance between data and derivatives. See details for the definition. Default is 1.
`p`	integer value in {0, 1 , 2}. Order of the L^p distance used. In particular `p = 0` stands for the L^{∞} distance, `p = 1` for L^1 and `p = 2` for L^2
`t.max`	real value. It tunes the maximum shift allowed. In particular the maximum shift at each iteration is computed as max_shift = t.max range(object)* and the optimum registration coefficient will be identified between - max_shift and + max_shift. range(`object`) is the maximum amplitude of the peaks. Default is 0.5.
`plot.graph.k`	logical. If `TRUE` the graph of the average distance between the data and corresponding center of the cluster, varying the number of clusters is plotted. If `align=NULL`, both the analysis with and without alignment are performed, two lines are drown to show the decrease of the global distance introduced by the alignment procedure. Default is `TRUE`
`verbose`	logical. If `TRUE`, some parameters of the algorithm and the progress of the iterations are shown, if `FALSE` no information is provided. Default is `TRUE`, but consider to set the parameter to `FALSE` in case of parallel runs, to avoid the overlap of their outputs.
`rescale`	logical. If `TRUE` clustering is performed on scaled peaks. For the definition of scaled peaks see smooth_peak.

See [Sangalli et al., 2010] and the package vignette for the complete description of the algorithm. The algorithm is completely defined once we fix the family of the warping function for the alignment and the distance function. In this function we focus only on the specific case of

warping functions: shifts with integer coefficients

h(t) = t + c,

with c an integer value;
distance: convex combination of the L^p distance between data and derivatives. The distance between f and g is

d(f, g) = (1 - α) || f - g ||_p + α w || f' - g' ||_p

The choice of || . ||_p corresponds to the value of p in input. In particular p = 0 stands for ||.||_L^∞, p = 1 for || . ||_L^1 and p = 2 for || . ||_L^2

the GRanges object with new metadata columns:

if align is TRUE or NULL, i.e. the clustering with alignment is performed the following metadata columns are added:
- cluster_shift: for each peak, a vector of length equal to the maximum number of chosen clusters, containing at each position k the label of the cluster the peak is assigned to, when the total number of clusters is k and alignment is performed during the clustering. If k is not present in the n.clust vector, the corresponding value is NA.
- coef_shift: for each peak, a vector of length equal to the maximum number of chosen clusters, containing at each position k the shift coefficient assigned to the peak, when the total number of clusters is k and alignment is performed during clustering. If k is not present in the vector n.clust the corresponding value is NA.
- dist_shift: for each peak, a vector of length equal to the maximum number of chosen clusters, containing at each position k the distance of the specific peak from the corresponding center of the cluster, when the total number of clusters is k and alignment is performed during clustering. If k is not present in the vector n.clust the corresponding value is NA.
if shift.peak is FALSE or NULL, i.e. clustering is performed without alignment, the following metadata columns are added:
- cluster_NOshift: for each peak, a vector of length equal to the maximum number of chosen clusters, containing at each position k the label of the cluster the peak is assigned to, when the total number of clusters is k and no alignment is performed during clustering. If k is not present in the vector n.clust the corresponding value is NA.
- dist_NOshift: for each peak, vector of length equal to the maximum number of chosen cluster, containing at each position k the distance of the peak from the corresponding center of the cluster , when the total number of clusters is k and no alignment is performed during clustering. If k is not present in the vector n.clust the corresponding value is NA.

Alice Parodi, Marco J. Morelli, Laura M. Sangalli, Piercesare Secchi, Simone Vantini

Sangalli, L. M., Secchi, P., Vantini, S. and Vitelli, V., 2010. K-mean alignment for curve clustering. Computational Statistics and Data Analysis, 54 1219 - 1233.

choose_k

# load the data
data(peaks)

# cluster and align the data as a 
# function of the
# number of cluster (from 1 to 5)
# with and without alignment.
# The automathically generated plot 
# can be used to detect the
# optimal number of clusters and the
# classification method to be used
# (with or without alignment)

clustered_peaks <- cluster_peak ( peaks.data.summit, parallel = FALSE ,
                                  n.clust = 1:5, shift.peak = NULL,
                                  weight = 1, alpha = 1, p = 2,
                                  plot.graph.k = TRUE, verbose = TRUE )