options(digits.secs=3) library(lazyIris) knitr::opts_chunk$set(dpi=100, fig.width=10, fig.height=10)
lazyIris is a small implementation of k-nearest neighbours applied to the famous iris dataset.
First, ensure that the devtools package is installed and then install directly from the package github repository.
## check for and install devtools. # install.packages("devtools") # install and load. # devtools::install_github("phil8192/lazy-iris") require(lazyIris)
The package has preprocessed iris data attached.
attach(iris.data)
Example data may be loaded from the inst/extdata directory by using the loadData function. In addition, the checkData function will perform any necessary data sanity checks.
iris.data <- checkData(loadData())
The dataset consists of 4 features and 3 possible classes. Some of the features are highly correlated:
cor(iris.data[, 1:4])
knitr::kable(cor(iris.data[, 1:4]))
The package provides a means to visualise the relationship between the 4 features and the corresponding class.
# plot all the data. visualise(iris.data, class.name="species", main="iris data", plot.hist=TRUE, plot.cor=TRUE)
In the above visualisation, the colours correspond to the classification of the species of iris plant:
| colour | species | |-------:|:-----------------| | red | Iris setosa | | green | Iris versicolour | | blue | Iris virginica |
The lower left panels show the correlation between the 4 iris features, the diagonal panels contain a histogram of the distribution of each feature, and finally, the upper right panels contain scatter plots of each possible feature combination colour coded by species.
The knn function makes it possible to query the data for neighbouring instances given an arbitrary list of features.
The following example obtains the top 10 nearest neighbours to query:
# form the query. # in this example, the feature values are actually the mean values in the # dataset, thus the results may be interpreted as the top 10 "most average" # instances. query <- list( sepal.length=5.84, sepal.width=3.05, petal.length=3.76, petal.width=1.20) # obtain the nearest-neighbours. top.10 <- knn(query, iris.data, 10) print(top.10, row.names=FALSE)
query <- list(sepal.length=5.84, sepal.width=3.05, petal.length=3.76, petal.width=1.20) top.10 <- knn(query, iris.data, k=10) knitr::kable(top.10, row.names=F)
In addition to the N-nearest neighours, the function also returns the distance from the query point. This distance can be used to predict the most likely class of the query point using the classifier function.
prediction <- classifier(top.10$species, top.10$distance) print(paste("prediction =", prediction$pred, "confidence =", prediction$conf))
Given a list of nearest neighbours returned from the knn function, it is possible to visualise the query point and it's nearest neighbours over all dimensions in the feature space by using the visualise function.
# visualise the result. visualise(iris.data, class.name="species", query=query, neighbours=top.10, main="iris versicolour classification", plot.hist=TRUE, plot.cor=FALSE)
In the above plot, the query point is shown as a black point. The resulting neighbours from the knn query are highlighted (opaque) circles.
In addition, the query point with respect to the distribution of features has been highlighted with a black dashed vertical line over the corresponding feature histograms. Clearly the query point is within the Iris versicolour cluster (green).
Closer to a decision boundary (which is non-linear for iris data), the class to which the query point belongs to is ambiguous:
q <- list(sepal.length=6, sepal.width=3, petal.length=4.75, petal.width=1.75) top.10 <- knn(q, iris.data, k=10) visualise(iris.data, class.name="species", query=q, neighbours=top.10, main="iris versicolour/virginica", plot.hist=TRUE, plot.cor=FALSE)
From the top 10 results:
print(top.10, row.names=F)
knitr::kable(top.10, row.names=F)
The 10 (unweighted) neighbours yield a 50/50 classification:
prediction <- with(knn(q, iris.data, k=10), classifier(species, distance)) with(prediction, paste0("prediction = ", pred, " confidence = ", conf*100, "%"))
A rare iris was discovered. It had petals as large as an iris versicolor, and a stem the size of a setosa...
# construct query as the mean iris setosa type. q <- as.list(colMeans(iris.data[iris.data$species == "Iris-setosa", 1:4])) q$petal.width <- 4*q$petal.width q$petal.length <- 1.75*q$petal.length print(unlist(q))
knitr::kable(as.data.frame(q))
Given the 10 nearest known neighbours to the disovery:
top.10 <- knn(q, iris.data, k=10) print(top.10, row.names=F)
knitr::kable(top.10, row.names=F)
The discovery is a most likely (90% by majority voting) setosa...
unlist(classifier(top.10$species, top.10$distance))
knitr::kable(as.data.frame(classifier(top.10$species, top.10$distance)))
Which can be seen in the following visualisation...
visualise(iris.data, class.name="species", query=q, neighbours=top.10, main="neighbours of the peculiar iris", plot.hist=TRUE, plot.cor=FALSE)
In the interest of sampling bias, The discovery was hastily discarded.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.