separationkNN: Proportion of same-class nearest neighbours
In montesmariana/semcloud: Post-processing of token-level clouds

separationkNN

R Documentation

Proportion of same-class nearest neighbours

Description

This function takes a square matrix dmx that contains item by item distances, and a factor classes (with as many items as there are rows, and thus columns, in dmx) that assigns a class to each item. The function returns a measure q of how well the distances in dmx 'capture' the classification in classes, where distances are taken to 'capture a classification' to the extent that items are (immediately) surrounded by other items from the same class, and not by items from some other class. Next to an overall cluster quality for all the data taken together, the function also returns the cluster quality of individual points and the cluster quality of individual classes (as well as the mean cluster quality over classes). All these measures are called q in the output of the function.

Usage

separationkNN(dmx, classes, k = NULL, weights = c("linear", "s-curve", "none"))

Arguments

`dmx`	A square matrix containing item by item distances
`classes`	A factor of the same length as the number of rows and columns in `dmx`; the class in position i in `classes` is the class assigned to the item of row i and column i in `dmx`
`k`	The value of `k` that is to be used to identify the `k` nearest neighbours. If `k` is not specified, then `k` is taken to be either the total number of items divided by ten (if the number of items divided by ten is smaller than the size of the smallest class), or the size of the smallest class minus one (if the size of the smallest class minus one is smaller than the total number of items divided by ten). This default behaviour obviously is but a very crude attempt at guessing a sensible value for `k`. Most of the time you probably want to overrule this default behaviour. If you explicitly specify `k`, all value from one up to the total number of items minus one are allowed.
`weights`	The `weights` argument determines how exactly the cluster quality of a point is derived from the class membership of its k nearest neighbours. This cluster quality is 'the weighted mean of class membership values of these neighbours (1=same class as target item; 0=different class)', with the weights being determined by the `weights` argument. The weights are k numbers, the first of which indicates the weight if the closest neighbour, the second of which indicates the weight of the second closest neighbour, etc. The sum of the weights always is one. When `weights` is `"linear"`, which is the default situation, weights decrease linearly as one progresses through the set of neighbours (starting from the one that is closest to the target item). When `weights` is `"s-curve"`, weights decrease as one progresses through the set of neigbours (starting from the one that is closest to the target item) according to the s-shape of `y<-(40:-40)/10; plot(1:81, exp(y) / (1 + exp(y)), type="l")`, but with the actual weights rescaled so that they add up to one. Finally, when `weight` is `"none"`, all connections in the path have equal weight. The actual weights that are used in a call to `separationkNN()` can be found in the `weights` components in its output.

Details

The q measures are calculated as follows: first, for each item an item-specific cluster quality is calculated. This is done by calculating the proportion of 'same class items' among its k nearest neighbours. The higher the measure, the better the cluster quality for that item. However, what is calculated is not simply the proportion, but rather the weighted mean of the values of the k nearest neighbours, where a 'same class neighbour' has value one, a 'different class neighbour' has value zero, and the weights of the neighbours can have different settings (see below). In the default settings, weights decrease linearly with their rank of 'distance from the item', and all weights add up to one. For instance, if k is one then the weight is 1. If k is 2, then the weights, starting from the closest nearest neighbour, are .67 and .33. If k is 3, then the weights are .5, .33, and .17. If k is 4, they are .4, .3, .2, and .1. Etc.

The overall cluster quality of the data is then calculated as the mean cluster quality of all items. Additionally, the cluster quality for every class in classes is calculated as the mean cluster quality of the items belonging to that class. The mean class quality, finally, is the mean of all class-specific class quality measures.

Value

An object of the class clustqualkNN, which is a list containing at least the following components:

`globqual`	The global cluster quality q
`meanclassqual`	The mean of all class-specific cluster quality values q
`classqual`	A table with for each class its class-specific clusters quality q
`pointqual`	A numeric vector with for each item its item-specific cluster quality q
`weights`	A numeric vector with the weights that were used
`k`	A number indication which `k` was used

Examples

# we create a 'point cloud', with points belonging to two classes
points <- rbind(matrix(rnorm(100, 2, 2), ncol=2),
                matrix(rnorm(100, 4, 2), ncol=2))
                dst <- dist(points, diag=TRUE, upper=TRUE)
                classes <- as.factor(rep(c("a","b"), c(50, 50)))
# we analyse the cluster quality, letting the procedure choose k
fitkNN <- separationkNN(dst, classes)
summary(fitkNN)
fitkNN$globqual        # global cluster quality
fitkNN$meanclassqual   # mean class quality
fitkNN$classqual       # class-specific quality

# we analyse the cluster quality, setting k to 25
fitkNN <- separationkNN(dst, classes, k=25)
summary(fitkNN)

montesmariana/semcloud documentation built on April 15, 2022, 6:57 a.m.