Use kmeans() w/ default values to find the k=2 solution for the 2-dimensional data cluster.csv
setwd("./data") data<-read.csv("cluster.csv",header=F) setwd("../")
Plot initial data.
plot(data)
Cluster: kmeans() produces centers, cluster assignments, etc.
fit<-kmeans(data,2) fit fit$centers
Plot: Key here is to use kmeans() output rather than moving data around.
plot(data$V1,data$V2,col=fit$cluster) points(fit$centers,col=c("black","red"),pch=19)
Use kmeans() w/ default values to find the k=2 solution for the 2-dimensional sonar data.
setwd("./data") train<-read.csv("sonar_train.csv",header=F) test<-read.csv("sonar_test.csv",header=F) setwd("../")
Plot just the first two columns of the sonar data.
plot(train[,1:2])
Cluster: kmeans() can use as many attributes as you want. But, let's look at the clusters created by the first two.
fit<-kmeans(train[,1:2],2) fit
Plot: Key here is to use kmeans() output rather than moving data around.
plot(train[,1:2],col=fit$cluster) points(fit$centers,col="blue",pch=19)
plot(train[,1:2],pch=19,xlab=expression(x[1]), ylab=expression(x[2])) ## get your y labels y<-train[,61] ## re-plot points with color based on class labels. points(train[,1:2],col=2+2*y,pch=19)
What if we used kmeans() to classify. What would our misclass error be?
## transform cluster labels (1's and 2's) to -1s and 1s sum(fit$cluster*2-3==y)/length(y)
fit<-kmeans(train[,1:60],2) sum(fit$cluster*2-3==y)/length(y) sum(fit$cluster*2-3!=y)/length(y)
Try w/ more centroids. Disaster.
fit<-kmeans(train[,1:60],10) sum(fit$cluster*2-3==y)/length(y) sum(fit$cluster*2-3!=y)/length(y)
Gist: kmeans() is a good clustering tool. Not a good prediction tool.
First code it manually.
x<-c(1,2,3,5,6,7,8) center1<-1 center2<-2 for (k in 2:10) { cluster1<-x[abs(x-center1[k-1])<=abs(x-center2[k-1])] ## Put in cluster1 all x's where distance to c1<= distance to c2. cluster2<-x[abs(x-center1[k-1])>abs(x-center2[k-1])] ## Put in c2 all x's where distance to c1>distance to c2 center1[k]<-mean(cluster1) center2[k]<-mean(cluster2) ## apparently mean() will take the mean between of all values in a cluster. ## set k=2. Decrement it 1 to control iteration. Also use it to track the updates clusters. } center1 center2 cluster1 cluster2
Compare to kmeans()
x<-c(1,2,3,5,6,7,8) fit<-kmeans(x,2) plot(x,col=fit$cluster)
x1<-c(2,2) x2<-c(5,7) data<-matrix(c(x1,x2),nrow=2,byrow=T) data dist(data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.