options(digits.secs=3)
library(lazyIris)
knitr::opts_chunk$set(dpi=100, fig.width=10, fig.height=10)

Overview

lazyIris is a small implementation of k-nearest neighbours applied to the famous iris dataset.

Installing

First, ensure that the devtools package is installed and then install directly from the package github repository.

## check for and install devtools.
# install.packages("devtools")

# install and load.
# devtools::install_github("phil8192/lazy-iris")
require(lazyIris)

Loading data

The package has preprocessed iris data attached.

attach(iris.data)

Example data may be loaded from the inst/extdata directory by using the loadData function. In addition, the checkData function will perform any necessary data sanity checks.

iris.data <- checkData(loadData())

A quick look at the data

The dataset consists of 4 features and 3 possible classes. Some of the features are highly correlated:

cor(iris.data[, 1:4])
knitr::kable(cor(iris.data[, 1:4]))

Iris data visualisation

The package provides a means to visualise the relationship between the 4 features and the corresponding class.

# plot all the data.
visualise(iris.data, class.name="species", main="iris data", plot.hist=TRUE,
    plot.cor=TRUE)

In the above visualisation, the colours correspond to the classification of the species of iris plant:

| colour | species | |-------:|:-----------------| | red | Iris setosa | | green | Iris versicolour | | blue | Iris virginica |

The lower left panels show the correlation between the 4 iris features, the diagonal panels contain a histogram of the distribution of each feature, and finally, the upper right panels contain scatter plots of each possible feature combination colour coded by species.

Querying the data

The knn function makes it possible to query the data for neighbouring instances given an arbitrary list of features.

10-Nearest neighbours

The following example obtains the top 10 nearest neighbours to query:

# form the query.
# in this example, the feature values are actually the mean values in the
# dataset, thus the results may be interpreted as the top 10 "most average"
# instances.

query <- list(
    sepal.length=5.84,
    sepal.width=3.05,
    petal.length=3.76,
    petal.width=1.20)

# obtain the nearest-neighbours.
top.10 <- knn(query, iris.data, 10)
print(top.10, row.names=FALSE)
query <- list(sepal.length=5.84, sepal.width=3.05,
              petal.length=3.76, petal.width=1.20)
top.10 <- knn(query, iris.data, k=10)
knitr::kable(top.10, row.names=F)

Classification

In addition to the N-nearest neighours, the function also returns the distance from the query point. This distance can be used to predict the most likely class of the query point using the classifier function.

prediction <- classifier(top.10$species, top.10$distance)
print(paste("prediction =", prediction$pred,
            "confidence =", prediction$conf))

Visualising a query result for Versicolour (green)

Given a list of nearest neighbours returned from the knn function, it is possible to visualise the query point and it's nearest neighbours over all dimensions in the feature space by using the visualise function.

# visualise the result.
visualise(iris.data, class.name="species", query=query, neighbours=top.10,
    main="iris versicolour classification", plot.hist=TRUE, plot.cor=FALSE)

In the above plot, the query point is shown as a black point. The resulting neighbours from the knn query are highlighted (opaque) circles.

In addition, the query point with respect to the distribution of features has been highlighted with a black dashed vertical line over the corresponding feature histograms. Clearly the query point is within the Iris versicolour cluster (green).

Versicolour or Virginica (green/blue)

Closer to a decision boundary (which is non-linear for iris data), the class to which the query point belongs to is ambiguous:

q <- list(sepal.length=6, sepal.width=3, petal.length=4.75, petal.width=1.75)
top.10 <- knn(q, iris.data, k=10)
visualise(iris.data, class.name="species", query=q, neighbours=top.10,
    main="iris versicolour/virginica", plot.hist=TRUE, plot.cor=FALSE)

From the top 10 results:

print(top.10, row.names=F)
knitr::kable(top.10, row.names=F)

The 10 (unweighted) neighbours yield a 50/50 classification:

prediction <- with(knn(q, iris.data, k=10), classifier(species, distance))
with(prediction, paste0("prediction = ", pred, " confidence = ", conf*100, "%"))

A very peculiar iris...

A rare iris was discovered. It had petals as large as an iris versicolor, and a stem the size of a setosa...

# construct query as the mean iris setosa type. 
q <- as.list(colMeans(iris.data[iris.data$species == "Iris-setosa", 1:4]))
q$petal.width <- 4*q$petal.width
q$petal.length <- 1.75*q$petal.length
print(unlist(q))
knitr::kable(as.data.frame(q))

Given the 10 nearest known neighbours to the disovery:

top.10 <- knn(q, iris.data, k=10)
print(top.10, row.names=F)
knitr::kable(top.10, row.names=F)

The discovery is a most likely (90% by majority voting) setosa...

unlist(classifier(top.10$species, top.10$distance))
knitr::kable(as.data.frame(classifier(top.10$species, top.10$distance)))

Which can be seen in the following visualisation...

visualise(iris.data, class.name="species", query=q, neighbours=top.10, 
    main="neighbours of the peculiar iris", plot.hist=TRUE, plot.cor=FALSE)

In the interest of sampling bias, The discovery was hastily discarded.



phil8192/lazy-iris documentation built on May 25, 2019, 2:56 a.m.