cluster_peak-method: Clustering the peaks with the k-mean alignment algorithm

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

It classifies and aligns the peaks stored in the GRanges object. The method applies the k-mean alignment algorithm with shift of the peaks and distance based on the convex combination of the L^p distances between the spline-smoothed peaks and their derivatives. The order p can be one of 1, 2 and .

Usage

1
2
3
4
5
## S4 method for signature 'GRanges'
cluster_peak(object, parallel = FALSE, num.cores = NULL,
    n.clust = NULL,  seeds = NULL, shift.peak = NULL, weight = NULL,
    subsample.weight = 100, alpha = 1, p = 1, t.max = 0.5,
    plot.graph.k = TRUE, verbose = TRUE, rescale = FALSE )

Arguments

object

GRanges object of length N. It must contain the metadata columns spline, spline_der, width_spline, computed by smooth_peak.

parallel

logical. If TRUE, the clustering for different values of the parameter k in n.clust are run in parallel. Default is FALSE.

num.cores

integer. If parallel is TRUE, it indicates the number of cores used in the parallelization. If NULL (default), the number of cores is automatically identified.

n.clust

integer vector (or scalar). Number of clusters in which the data set is divided (possibly one, if n.clust is a scalar). For each value of the vector, the cpp function kmean_function is called.

seeds

vector. Indices of the initial centers of the clusters, needed to initialize the k-mean procedure. The k-mean alignment, like all the k-mean-like algorithms, is dependent on the choice of the initial centers of the clusters, and each initialization of the seeds can generate slightly different results. The values must be included in 1, …, N. The length of the vector must be equal to the maximum number of clusters analyzed (max(n.clust)), otherwise it is truncated to this value, or the missing values are randomly generated. If NULL (default), the seeds are detected as the most central values (i.e. peaks with minimum distance from the others) of the set of peaks. If seeds='random', the centers are randomly generated.

shift.peak

logical. It indicates whether the alignment via a translation of the abscissae is performed (shift.peak = TRUE) or not (shift.peak = FALSE). If no value is provided (shift.peak = NULL, default), both analyses are performed.

weight

real. Weight w of the distance function (see Details for the definitions of the distance function), needed to make the distance between splines and derivatives comparable. If no value is provided (default is NULL), it is computed as the median of the ratio between the pairwise distances of the data (d_0 (i,j)) and of the derivatives (d_1(i,j))

w = median d0(i,j)/d1(i,j)

with i, j = 1: … N.

subsample.weight

integer value. Number of data points used to define the weight, if not assigned. Using all the peaks to define the weight can be computationally expensive and therefore a subsampling is suggested. If subsample.weight=NULL all the data will be used. Default is 100, which is a reasonable trade off between running time and reliability of the estimation.

alpha

real value between 0 and 1. Value of the convex weight α of the distance to balance the distance between data and derivatives. See details for the definition. Default is 1.

p

integer value in {0, 1 , 2}. Order of the L^p distance used. In particular p = 0 stands for the L^{∞} distance, p = 1 for L^1 and p = 2 for L^2

t.max

real value. It tunes the maximum shift allowed. In particular the maximum shift at each iteration is computed as

max_shift = t.max * range(object)

and the optimum registration coefficient will be identified between - max_shift and + max_shift. range(object) is the maximum amplitude of the peaks. Default is 0.5.

plot.graph.k

logical. If TRUE the graph of the average distance between the data and corresponding center of the cluster, varying the number of clusters is plotted. If align=NULL, both the analysis with and without alignment are performed, two lines are drown to show the decrease of the global distance introduced by the alignment procedure. Default is TRUE

verbose

logical. If TRUE, some parameters of the algorithm and the progress of the iterations are shown, if FALSE no information is provided. Default is TRUE, but consider to set the parameter to FALSE in case of parallel runs, to avoid the overlap of their outputs.

rescale

logical. If TRUE clustering is performed on scaled peaks. For the definition of scaled peaks see smooth_peak.

Details

See [Sangalli et al., 2010] and the package vignette for the complete description of the algorithm. The algorithm is completely defined once we fix the family of the warping function for the alignment and the distance function. In this function we focus only on the specific case of

Value

the GRanges object with new metadata columns:

Author(s)

Alice Parodi, Marco J. Morelli, Laura M. Sangalli, Piercesare Secchi, Simone Vantini

References

Sangalli, L. M., Secchi, P., Vantini, S. and Vitelli, V., 2010. K-mean alignment for curve clustering. Computational Statistics and Data Analysis, 54 1219 - 1233.

See Also

choose_k

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# load the data
data(peaks)

# cluster and align the data as a 
# function of the
# number of cluster (from 1 to 5)
# with and without alignment.
# The automathically generated plot 
# can be used to detect the
# optimal number of clusters and the
# classification method to be used
# (with or without alignment)

clustered_peaks <- cluster_peak ( peaks.data.summit, parallel = FALSE ,
                                  n.clust = 1:5, shift.peak = NULL,
                                  weight = 1, alpha = 1, p = 2,
                                  plot.graph.k = TRUE, verbose = TRUE )

FunChIP documentation built on May 2, 2018, 3:14 a.m.