DR: Downhill Riding (DR) Procedure
In vlyubchich/funtimes: Functions for Time Series Analysis

View source: R/DR.R

DR	R Documentation

Downhill Riding (DR) Procedure

Description

Downhill riding procedure for selecting optimal tuning parameters in clustering algorithms, using an (in)stability probe.

Usage

DR(X, method, minPts = 3, theta = 0.9, B = 500, lb = -30, ub = 10)

Arguments

`X`	an `n\times k` matrix where columns are `k` objects to be clustered, and each object contains n observations (objects could be a set of time series).
`method`	the clustering method to be used – currently either “TRUST” \insertCiteCiampi_etal_2010funtimes or “DBSCAN” \insertCiteEster_etal_1996funtimes. If the method is `DBSCAN`, then set `MinPts` and optimal `\epsilon` is selected using DR. If the method is `TRUST`, then set `theta`, and optimal `\delta` is selected using DR.
`minPts`	the minimum number of samples in an `\epsilon`-neighborhood of a point to be considered as a core point. The `minPts` is to be used only with the `DBSCAN` method. The default value is 3.
`theta`	connectivity parameter `\theta \in (0,1)`, which is to be used only with the `TRUST` method. The default value is 0.9.
`B`	number of random splits in calculating the Average Cluster Deviation (ACD). The default value is 500.
`lb, ub`	endpoints for a range of search for the optimal parameter.

Details

Parameters lb,ub are endpoints for the search for the optimal parameter. The parameter candidates are calculated in a way such that P:= 1.1^x , x \in {lb,lb+0.5,lb+1.0,...,ub}. Although the default range of search is sufficiently wide, in some cases lb,ub can be further extended if a warning message is given.

For more discussion on properties of the considered clustering algorithms and the DR procedure see \insertCiteHuang_etal_2016;textualfuntimes and \insertCiteHuang_etal_2018_riding;textualfuntimes.

Value

A list containing the following components:

`P_opt`	the value of the optimal parameter. If the method is `DBSCAN`, then `P_opt` is optimal `\epsilon`. If the method is `TRUST`, then `P_opt` is optimal `\delta`.
`ACD_matrix`	a matrix that returns `ACD` for different values of a tuning parameter. If the method is `DBSCAN`, then the tuning parameter is `\epsilon`. If the method is `TRUST`, then the tuning parameter is `\delta`.

Author(s)

Xin Huang, Yulia R. Gel

References

\insertAllCited

Examples

## Not run: 
## example 1
## use iris data to test DR procedure

data(iris)  
require(clue)  # calculate NMI to compare the clustering result with the ground truth
require(scatterplot3d)

Data <- scale(iris[,-5])
ground_truth_label <- iris[,5]

# perform DR procedure to select optimal eps for DBSCAN 
# and save it in variable eps_opt
eps_opt <- DR(t(Data), method="DBSCAN", minPts = 5)$P_opt   

# apply DBSCAN with the optimal eps on iris data 
# and save the clustering result in variable res
res <- dbscan(Data, eps = eps_opt, minPts =5)$cluster  

# calculate NMI to compare the clustering result with the ground truth label
clue::cl_agreement(as.cl_partition(ground_truth_label),
                   as.cl_partition(as.numeric(res)), method = "NMI") 
# visualize the clustering result and compare it with the ground truth result
# 3D visualization of clustering result using variables Sepal.Width, Sepal.Length, 
# and Petal.Length
scatterplot3d(Data[,-4],color = res)
# 3D visualization of ground truth result using variables Sepal.Width, Sepal.Length,
# and Petal.Length
scatterplot3d(Data[,-4],color = as.numeric(ground_truth_label))


## example 2
## use synthetic time series data to test DR procedure

require(funtimes)
require(clue) 
require(zoo)

# simulate 16 time series for 4 clusters, each cluster contains 4 time series
set.seed(114) 
samp_Ind <- sample(12,replace=F)
time_points <- 30
X <- matrix(0,nrow=time_points,ncol = 12)
cluster1 <- sapply(1:4,function(x) arima.sim(list(order = c(1, 0, 0), ar = c(0.2)),
                                             n = time_points, mean = 0, sd = 1))
cluster2 <- sapply(1:4,function(x) arima.sim(list(order = c(2 ,0, 0), ar = c(0.1, -0.2)),
                                             n = time_points, mean = 2, sd = 1))
cluster3 <- sapply(1:4,function(x) arima.sim(list(order = c(1, 0, 1), ar = c(0.3), ma = c(0.1)),
                                             n = time_points, mean = 6, sd = 1))

X[,samp_Ind[1:4]] <- t(round(cluster1, 4))
X[,samp_Ind[5:8]] <- t(round(cluster2, 4))
X[,samp_Ind[9:12]] <- t(round(cluster3, 4))


# create ground truth label of the synthetic data
ground_truth_label = matrix(1, nrow = 12, ncol = 1) 
for(k in 1:3){
    ground_truth_label[samp_Ind[(4*k - 4 + 1):(4*k)]] = k
}

# perform DR procedure to select optimal delta for TRUST
# and save it in variable delta_opt
delta_opt <- DR(X, method = "TRUST")$P_opt 

# apply TRUST with the optimal delta on the synthetic data 
# and save the clustering result in variable res
res <- CSlideCluster(X, Delta = delta_opt, Theta = 0.9)  

# calculate NMI to compare the clustering result with the ground truth label
clue::cl_agreement(as.cl_partition(as.numeric(ground_truth_label)),
                   as.cl_partition(as.numeric(res)), method = "NMI")

# visualize the clustering result and compare it with the ground truth result
# visualization of the clustering result obtained by TRUST
plot.zoo(X, type = "l", plot.type = "single", col = res, xlab = "Time index", ylab = "")
# visualization of the ground truth result 
plot.zoo(X, type = "l", plot.type = "single", col = ground_truth_label,
         xlab = "Time index", ylab = "")

## End(Not run)

vlyubchich/funtimes documentation built on May 6, 2023, 3:21 a.m.