knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(KnnDist)
This vignette walks the reader through the KnnDist package using a published morphological dataset of rat lower first molars. The goal of this vignette is to provide the user with the code and tools to use this packages main functionality. The main goal of the k-nearest neighbour (k-NN) method is to accurately and efficiently classify specimens of unknown group membership to user defined groups. This classification method uses a set number of nearest neighbours (called k) to assign group classification; for example, if there are two groups and k is set to 10 then the 10 nearest neighbours to the unknown specimen will be examined and the unknown specimen will be classified to whichever group makes up >5 of the nearest neighbours. The tools presented here allow are both for classification based on user input distances but also for approaches to optimise k (namely stepwise increase approaches).
To first present a very basic visualization of the principal behind k-NN we will examine a set of dumby data; we will then move onto examination of real hyperdimensional data. This is because the representation of hyperdimensional data in 2D can be misleading and specimens that look like they are nearest neighbours in 2D may in fact be quite distant when multiple dimensions are taken into account. Below we can see the result of k=7 is 'red'.
``` {r Visualisation of k-NN with simple 2D dumby data} Nearest <- sort(DummyData$DistMat[1,], index.return=TRUE)
plot(DummyData$Coords, pch=16, col=DummyData$Groups)
apply(X = DummyData$Coords[Nearest$ix[1:8],], MARGIN = 1, FUN = function(SpecLoc, X){lines(rbind(SpecLoc, X))}, SpecLoc = DummyData$Coords[1,])
Rattini tribe species discrimination using dental morphology: dataset and distance calculations ------------------------------------------------------------------------------------------------------- The dataset we will use here indlues 14 species of the Rattini tribe with a total of 395 specimens. ``` {r Example dataset table of specimen numbers} table(RatData$Info$Species)
The sample sizes are unequal with the smallest being Rattus nitidus
``` {r Identifying smallest sample size} names(which(table(RatData$Info$Species)==min(table(RatData$Info$Species))))
For this dataset we will work with the raw shape data. The Procrustes distances can be calculated using the ProcDistanceTable function or the ProcDistanceTablePar function. This distance table takes some time to comute so is provided in the package data. The `ProcDistanceTable` function is a wrapper of the `procdist` function from the `shapes` package that outputs a square distance materix. This function is slow as it needs to calculate the pairwise Procrustes superimpositions among each pair of specimen. Use the function ProcDistanceTablePar for large datasets to make use of parallel processing. Note this will only be faster for large datasets as with small ones the time taken to recompile the results of each parallel process can create a bottleneck. For example, using an Intel(R) Core(TM) i7-8565U CPU with 4 cores and 8 threads to process this full dataset the single core processing takes ~2.36mins whereas parallel processing takes ~1.17mins; with this processor parallel processing becomes more efficient than single thread processing around 60 specimens. Parallel processing is not demonstrated in this vignette as it can complicate automatic vignette and package checks; parallel versions of functions can be typically be accessed using the same function name with 'Par' at the end. ``` {r Calculating procrustes distance matrix among specimens} RatDistMat <- RatData$FullProcrustesDistMat str(RatData$FullProcrustesDistMat)
In the first instance we can take an single individual and treat it as an unknown to assess the approach of the k-NN classification analysis. The function KnnIDingSingleInd will return the results of both a simple count of the majority vote and also the results of a weighted majority vote (see below for details). In the first instance we will examine the simple majority vote process and results.
``` {r Classification by nearest 10 neighbours} KnnIDingSingleInd(X = RatDistMat[1,-1], K = 10, GroupMembership = RatData$Info$Species[-1], TieBreaker = 'Report')
This is achieved by taking the distance matrix, sorting it and taking the *k* (10 in this case) nearest neighbours. A simple majority vote count can be used to determine group membership classification. ``` {r Classification by nearest 10 neighbours breakdown} Distance2Unknown <- cbind(RatDistMat[1,-1], RatData$Info$Species[-1]) SortNearest <- sort(Distance2Unknown[,1], index.return=TRUE) NearestNeighbours <- Distance2Unknown[SortNearest$ix,] table(NearestNeighbours[1:10,2])
The TieBreaker argument implements one of three options for dealing with a tie; either it can randomly select an individual ('Random'); report the multiple classification calls with classifications separated by '_' ('Report'); or it can remove the result, returning 'UnIDed' ('Remove'). This last TieBreaker option is primarily for leave-one-out correct cross-validation procedures (see below).
``` {r Classification by nearest 10 neighbours with a tied vote example} KnnIDingSingleInd(X = RatDistMat[24,-24], K = 10, GroupMembership = RatData$Info$Species[-24], TieBreaker = 'Remove')
k-NN by weighted majority ------------------------------------------------------------------------------------------------------- A possible approach to weighting the voting system is to divide 1 by the distance to the specimen being identified. Then the sum of the *k* nearest weights is calculated for each group. The specimen is assigned to the group with the highest summed weight. This can produces different results from the majority vote approach; the simplest example of this is when the unweighted majority vote (i.e. count) is tied, the weighted result is likely to be different. ``` {r Classification by nearest 10 neighbours with difference in weighted v unweighted results} KnnIDingSingleInd(X = RatDistMat[24,-24], K = 10, GroupMembership = RatData$Info$Species[-24], TieBreaker = 'Report')
For example in the below case the simple vote is tied between R. argentiventer and R. tanezumi.
``` {r Classification by nearest 10 neighbours with a tied vote breakdown} Distance2Unknown <- cbind(RatDistMat[24,-24], RatData$Info$Species[-24]) SortNearest <- sort(Distance2Unknown[,1], index.return=TRUE)
NearestNeighbours <- Distance2Unknown[SortNearest$ix,]
UnweightedScores <- table(NearestNeighbours[1:10,2])
sort(UnweightedScores, decreasing = TRUE)
The weighted approach returns a single result with no tied vote. ``` {r Classification by nearest 10 neighbours calculating the weighted result} WeightedScores <- aggregate(1/as.numeric(NearestNeighbours[1:10,1]), FUN=sum, by=list(NearestNeighbours[1:10,2])) WeightedScores[sort(WeightedScores[,2], index.return=TRUE, decreasing = TRUE)$ix,]
As demonstrated above the k-NN process is relatively simple to implement. The KnnDist package carries out the above operations in a simple manor. However, assessing the accuracy of k and then efficiently optimising k when sample sizes are also unbalanced is a little less straight forward. Here the KnnDist package provides a couple of options.
A visualization of the differences between weighted and unweighted classifcation is provided below. Again this example is carried out with a provided dummy data (the example Rattini tribe dataset is hyperdimensional and representation of it in 2D can be misleading).
``` {r Visualisation of k-NN with weighted and unweighted differences} Nearest <- sort(DummyData$DistMat[1,], index.return=TRUE)
plot(DummyData$Coords, pch=16, col=DummyData$Groups)
apply(X = DummyData$Coords[Nearest$ix[1:7],], MARGIN = 1, FUN = function(SpecLoc, X){lines(rbind(SpecLoc, X))}, SpecLoc = DummyData$Coords[1,])
When *k*-NN is applied with *k* set to 6 the outcome of the unweighted analysis is mixed, but when carried out with a weighted analysis the result is the red group due to the closer proximity of those specimens. ``` {r k-NN result with weighted and unweighted differences} KnnIDingSingleInd(DummyData$DistMat[1,-1], K = 6, GroupMembership = DummyData$Groups[-1], TieBreaker = 'Report')
The previous demonstrations of k-NN identification are of no value if they are not in the context of how well the method performs overall in the total dataset from which those examples come from. Therefore, the next step is to assess the accuracy of the k-NN procedure on a given training dataset. This can be achieved with a leave-one-out correct-cross-validation analyses. This is where an individual is treated as an unknown and then is identified based on the rest of the dataset; this is carried out repeatedly to cycle through all individuals in the dataset and the percentage of times it is correct is calculated and considered the percentage correct cross-validation.
We will start with k set to 10 as described above. We will set the TieBreaker argument to 'Remove'; this means that if there is no majority vote the specimen will be removed and considered unidentified. The function will return both the results of a weighted and an unweighted k-NN analysis. Note that we set the IgnorePrompts argument to TRUE in the vignette, the default is FALSE and is primarily for when resampling to equal sample size and Verbose outputs are desired, which can lead to an extremely slow process. If both equal sample size and Verbose is required then the KnnDistCVPar function which implements parallel processing is likely to be much faster.
``` {r Simple correct cross validation} KnnDistCV(DistMat = RatDistMat, GroupMembership = RatData$Info$Species, K = 10, Equal = FALSE, TieBreaker = 'Remove', IgnorePrompts = TRUE)
This dataset includes groups of unequal sample size and will require additional treatment to assess how the accuracy of the *k*-NN process when applied to new unknown individuals. This is because with unequal sample size the method is more likely to correctly identify the larger sample size just by chance; for example if one group has 95 individuals and the other group has 5 then if all samples were identified to the larger group then 95% of the identifications would be correct and would misleadingly suggest that the correct cross validation was 95%. To get a better idea of how well the method will work with an unknown individual we can iteratively resample to equal sample size; this can be achieved by setting the Equal argument to TRUE and adding the EqualIter argument to the number of times this will be resampled to equal sample size. Note that this process can be slow when a high EqualIter number is set; this process can be greatly sped up using the parallel version of this function KnnDistCVPar, which accepts the same arguments as the KnnDistCV function. ``` {r Correct cross validation with resampling to equal sample size} EqualRatCVP <- KnnDistCV(DistMat = RatDistMat, GroupMembership = RatData$Info$Species, K = 10, Equal = TRUE, EqualIter = 10, TieBreaker = 'Remove', IgnorePrompts = TRUE) apply(EqualRatCVP, MARGIN = 2, FUN = mean)
The KnnDistDV function outputs the results of each iteration of the resampling exercise, so the user needs to then calculate the mean value (and set percentiles if they are of interest) from these to summarise the results and get a better indication of classification accuracy.
We have kept k set to 10 so far, but with a change in k the results might (and usually do!) change. Therefore users should optimise k for classification in their chosen dataset. KnnDist provides a specific tool for this in the form of KnnDistCVStepwise and KnnDistCVStepwisePar (the parallel processing equivelant).
The simplest way to assess for optimising k is to start at k=1 and then increase k in a stepwise manner to a given number (e.g. maybe the smallest sample size in the dataset) and assess the classification accuracy at each step, as is done with the KnnDistCVStepwise functions.
This can be done with the function KnnDistCVStepwise as follows (note in the first instance this will be done with Equal set to FALSE so the results will be with unequal sample size as demonstration, but the default is to resample to equal sample size 100 times). Below is an example with the stepwise increase in k on the x axis and the percentage of correct classifications on the y axis. We can see that unweighted k of 1 is optimal and weighted k of 7 is optimal and that weighted optimal k performs better than unweighted optimal k:
``` {r k optimisation with stepwise correct cross validation} UnequalKOpt <- KnnDistCVStepwise(DistMat = RatDistMat, GroupMembership = RatData$Info$Species, Kmax = 10, Equal = FALSE, PrintProg = FALSE, PlotResults = TRUE, TieBreaker = 'Remove')
Again, using unequal sample size might bias the results and present an inaccurate indication of the confidence with which this method can classify specimens. Therefore, resampling to equal sample size is also advised for the *k* optimisation. This can be achieved by setting the Equal argument to TRUE. The result will then plot the mean correct cross-validation percentages with a range around it at the 5th and 95th percentiles. Note that resampling to equal sample size combined with stepwise *k* optimisation is an extremely slow process and it is therefore advised the KnnDistCVStepwisePar is used. Here KnnDistCVStepwise is used because of the limitations of providing a vignette with parallel processing functions. As a result, the EqualIter argument has been set to be low at 10, in practice it is advisable to resample to equal sample many more than 10 times and the default is 100. ``` {r k optimisation with stepwise correct cross validation with resampling: Rats} EqualKOpt <- KnnDistCVStepwise(DistMat = RatDistMat, GroupMembership = RatData$Info$Species, Kmax = 10, Equal = TRUE, EqualIter = 10, PrintProg = FALSE, Verbose = TRUE, PlotResults = TRUE, TieBreaker = 'Remove')
So far we have applied the KnnDist functions to geometric morphometric full Procrustes distances calculated from rat teeth, but these functions can be applied to any dataset from which a user might wish to calculate distances or dissimilarities. Comparisons of bird songs is often carried as a dissimilarity metric base around dynamic time warping. User defined groups such as regional/local dialects or age groups can be assessed using k-NN analyses.
``` {r k optimisation with stepwise correct cross validation with resampling: Blue chaffinch} BirdTotalEqualKOpt <- KnnDistCVStepwise(DistMat = BirdsongData$SongDistMat, GroupMembership = BirdsongData$Groups$Regions, Kmax = 10, Equal = TRUE, EqualIter = 10, PrintProg = FALSE, Verbose = TRUE, PlotResults = TRUE, TieBreaker = 'Remove')
Note that when running the above we get a warning that the Kmax number is higher than the smallest sample size. ``` {r Blue chaffinch dataset sample sizes} table(BirdsongData$Groups$Regions)
The 'Central' region specimens are a widely dispersed group of individuals that do not clearly belong to the other regional groups and the sample size is very small. Therefore we can try to remove them.
``` {r k optimisation: Blue chaffinch central group removed} Rem <- which(BirdsongData$Groups$Regions=='Central')
BirdSubsetEqualKOpt <- KnnDistCVStepwise(DistMat = BirdsongData$SongDistMat[-Rem,-Rem], GroupMembership = BirdsongData$Groups$Regions[-Rem], Kmax = 10, Equal = TRUE, EqualIter = 10, PrintProg = FALSE, Verbose = TRUE, PlotResults = TRUE, TieBreaker = 'Remove')
Finally, if the stepwise *k* optimisation has already been carried out we can plot the results again and together with other analyses using the function PlotStepwise. This can be done as follows (note that additional stepwise results can be added to the plot using the Add argument): ``` {r Plotting stepwise k results for comparison among different analyses} #First (in red) plotting the stepwise k-NN results for the total dataset #Total unweighted PlotStepwise(StepwiseResultsMat = BirdTotalEqualKOpt$Unweighted, Percentiles = c(0.05,.95), PlotCol = 'pink') #Total weighted PlotStepwise(StepwiseResultsMat = BirdTotalEqualKOpt$Weighted.Results, Percentiles = c(0.05,.95), PlotCol = 'red', Add = TRUE) #Then (in blue) plotting the stepwise k-NN results for the dataset with central Tenerife specimens removed #Subset unweighted PlotStepwise(StepwiseResultsMat = BirdSubsetEqualKOpt$Weighted.Results, Percentiles = c(0.05,.95), PlotCol = 'darkblue', Add = TRUE) #Subset weighted PlotStepwise(StepwiseResultsMat = BirdSubsetEqualKOpt$Unweighted.Results, Percentiles = c(0.05,.95), PlotCol = 'lightblue', Add = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.