discretize.jointly: Discretize Multivariate Continuous Data by a...

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/discretize_jointly.R

Description

Discretize multivariate continuous data using a grid that captures the joint distribution via preserving clusters in the original data

Usage

1
discretize.jointly(data, k = c(2:10), cluster_label = NULL, min_level = 2)

Arguments

data

a matrix containing two or more continuous variables. Columns are variables, rows are observations.

k

either the number or range of clusters to be found on data. The default is 2 to 10 clusters. If a range is specified, an optimal k in the range is chosen to maximize the average silhouette width. If cluster_label is specified, k is ignored.

cluster_label

a vector of user-specified cluster labels for each observation in data. The user is free to choose any clustering. If unspecified, k-means clustering is used by default.

min_level

the minimum number of levels along each dimension

Details

The function implements algorithms described in \insertCiteJwang2020BCBGridOnClusters.

Value

A list that contains four items:

D

a matrix that contains the discretized version of the original data. Discretized values are one(1)-based.

grid

a list of vectors containing decision boundaries for each variable/dimension.

clabels

a vector containing cluster labels for each observation in data.

csimilarity

a similarity score between clusters from joint discretization D and cluster labels clabels. The score is the adjusted Rand index.

Author(s)

Jiandong Wang, Sajal Kumar and Mingzhou Song

References

\insertAllCited

See Also

See Ckmeans.1d.dp for discretizing univariate continuous data.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# using a specified k
x = rnorm(100)
y = sin(x)
z = cos(x)
data = cbind(x, y, z)
discretized_data = discretize.jointly(data, k=5)$D

# using a range of k
x = rnorm(1000)
y = log1p(abs(x))
z = tan(x)
data = cbind(x, y, z)
discretized_data = discretize.jointly(data, k=c(3:10))$D

# using an alternate clustering method to k-means
library(cluster)
x = rnorm(1000)
y = log1p(abs(x))
z = sin(x)
data = cbind(x, y, z)

# pre-cluster the data using partition around medoids (PAM)
cluster_label = pam(x=data, diss = FALSE, metric = "euclidean", k = 5)$clustering
discretized_data = discretize.jointly(data, cluster_label = cluster_label)$D

GridOnClusters documentation built on Sept. 16, 2020, 1:08 a.m.