GCCclus | R Documentation |
Clustering of time series using the Generalized Cross Correlation (GCC) measure of linear dependency proposed in Alonso and Peña (2019).
GCCclus(x, lag, rs, thres, plot, printSummary = TRUE, lag.set, silh = 1)
x |
T by k data matrix: T data points in rows with each row being data at a given time point, and k time series in columns. |
lag |
Selected lag for computing the GCC between the pairs of series. Default value is computed inside the program. |
rs |
Relative size of the minimum group considered. Default value is 0.05. |
thres |
Percentile in the distribution of distances that define observations that are not considered outliers. Default value is 0.9. |
plot |
If the value is TRUE, a clustermatrix plot of distances and a dendogram are presented. Default is FALSE. |
printSummary |
If the value is TRUE, the function prints a summary table of the clustering. Default is TRUE. |
lag.set |
If lag is not specified and the user wants to use instead of lags from 1 to 'lag' a non consecutive set of lags they can be defined as lag.set = c(1, 4, 7). |
silh |
If silh = 1 standard silhoutte statistics and if silh = 2 modified procedure. Default value is 1. |
First, the matrix of Generalized Cross correlation (GCC) is built by using the subrutine GCCmatrix, then a hierarchical grouping is constructed and the number of clusters is selected by either the silhouette statistics or a modified silhouette statistics The modified silhouette statistics is as follows:
(1) Series that join the groups at a distance larger than a given threshold of the distribution of the distances are disregarded.
(2) A minimum size for the groups is defined by rs, relative size, groups smaller than rs are disregarded.
(3) The final groups are obtained in two steps:
First the silhouette statistics is applied to the set of time series that verify conditions (1) and (2).
Second, the series disregarded in steps (1) and (2) are candidates to be assigned to its closest group. It is checked using the median and the MAD of the group if the point is or it is not an outlier with respect to the group. If it is an outlier it is included in a group 0 of outlier series. The distance between a series and a group is usually to the closest in the group (simple linkage) but could be to the mean of the group.
A list containing:
- Table of number of clusters found and number of observations in each cluster. Group 0 indicates the outlier group in the case it exists.
- sal: A list with four objects
labels: assignments of time series to each of the groups.
groups: is a list of matrices. Each matrix corresponds to the set of time series that make up each group. For example, $groups[[i]] contains the set of time series that belong to the ith group.
matrix: GCC distance matrix.
gmatrix: GCC distance matrices in each group.
Two plots are included (1) A clustermatrix plot with the distances inside each group in the diagonal boxes and the distances between series in two groups in off-diagonal boxes (2) the dendogram.
Alonso, A. M. and Peña, D. (2019). Clustering time series by linear dependency. Statistics and Computing, 29(4):655–676.
data(TaiwanAirBox032017) output <- GCCclus(TaiwanAirBox032017[1:50,1:8])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.