PDQ: Probabilistic Distance Clustering Adjusted for Cluster Size

View source: R/PDQ.R

PDQR Documentation

Probabilistic Distance Clustering Adjusted for Cluster Size

Description

An implementation of probabilistic distance clustering adjusted for cluster size (PDQ), a probabilistic distance clustering algorithm that involves optimizing the PD-clustering criterion. The algorithm can be used, on continous, count, or mixed type data setting Euclidean, Chi square, or Gower as dissimilarity measurments.

Usage

PDQ(x=NULL,k=2,ini='kmd',dist='euc',cent=NULL,ord=NULL,cat=NULL,bin=NULL,cont=NULL,w=NULL)

Arguments

x

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

k

A numerical parameter giving the number of clusters.

ini

A parameter that selects center starts. Options available are random ("random"), kmedoid ("kmd", by default"), center ("center", the user inputs the center), and kmode ("kmode", for categoriacal data sets).

dist

A parameter that selects the distance measure used. Options available are Eucledean ("euc"), Gower ("gower") and chi square ("chi").

cent

User inputed centers if method selected is "random".

ord

column indices of the x matrix indicating which columns are ordinal variables.

cat

column indices of the x matrix indicating which columns are categorical variables.

bin

column indices of the x matrix indicating which columns are binary variables.

cont

column indices of the x matrix indicating which columns are continuous variables.

w

numerical vector same length as the columns of the data, ccontaining the variable weights when using Gower distance, equal weights by default.

Value

A class FPDclustering list with components

label

A vector of integers indicating the cluster membership for each unit

centers

A matrix of cluster centers

probability

A matrix of probability of each point belonging to each cluster

JDF

The value of the Joint distance function

iter

The number of iterations

jdfvector

collection of all jdf calculations at each iteration

data

the data set

Author(s)

Cristina Tortora and Noe Vidales

References

Iyigun, Cem, and Adi Ben-Israel. Probabilistic distance clustering adjusted for cluster size. Probability in the Engineering and Informational Sciences 22.4 (2008): 603-621. doi.org/10.1017/S0269964808000351.

Tortora and Palumbo. Clustering mixed-type data using a probabilistic distance algorithm. submitted.

See Also

PDC

Examples


#Mixed type data

sig=matrix(0.7,4,4)
diag(sig)=1###creat a correlation matrix
x1=rmvnorm(200,c(0,0,3,3))##  cluster 1
x2=rmvnorm(200,c(4,4,6,6),sigma=sig)##  cluster 2
x=rbind(x1,x2)# data set with 2 clusters
l=c(rep(1,200),rep(2,200))#creating the labels
x1=cbind(x1,rbinom(200,4,0.2),rbinom(200,4,0.2))#categorical variables
x2=cbind(x2,rbinom(200,4,0.7),rbinom(200,4,0.7))
x=rbind(x1,x2) ##Data set

#### Performing PDQ
pdq_class<-PDQ(x=x,k=2, ini="random", dist="gower", cont= 1:4, cat = 5:6)

###Output
table(l,pdq_class$label)
plot(pdq_class)
summary(pdq_class)



###Continuous data example
# Gaussian Generated Data  no  overlap 
x<-rmvnorm(100, mean=c(1,5,10), sigma=diag(1,3))
y<-rmvnorm(100, mean=c(4,8,13), sigma=diag(1,3))
data<-rbind(x,y)

#### Performing PDQ
pdq1=PDQ(data,2,ini="random",dist="euc")
table(rep(c(2,1),each=100),pdq1$label)
Silh(pdq1$probability)
plot(pdq1)
summary(pdq1)


# Gaussian Generated Data with  overlap 
x2<-rmvnorm(100, mean=c(1,5,10), sigma=diag(1,3))
y2<-rmvnorm(100, mean=c(2,6,11), sigma=diag(1,3))
data2<-rbind(x2,y2)

#### Performing PDQ
pdq2=PDQ(data2,2,ini="random",dist="euc")
table(rep(c(1,2),each=100),pdq2$label)
plot(pdq2)
summary(pdq2)

FPDclustering documentation built on Aug. 31, 2022, 5:09 p.m.