knnExtract: Nearest Neighbors Features

Description Usage Arguments Details Value Author(s) Examples

Description

Do feature engineering on the original dataset and extract new features, generating a new dataset. Since KNN is a nonlinear learner, it makes a nonlinear mapping from the original dataset, making possible to achieve a great classification performance using a simple linear model on the new features, like GLM or LDA.

Usage

1
2
knnExtract(xtr, ytr, xte, k = 1, normalize = NULL, folds = 5,
  nthread = 1)

Arguments

xtr

matrix containing the training instances.

ytr

factor array with the training labels.

xte

matrix containing the test instances.

k

number of neighbors considered (default is 1). This choice is directly related to the number of new features. So, be careful with it. A large k may increase a lot the computing time for big datasets.

normalize

variable scaler as in fastknn.

folds

number of folds (default is 5) or an array with fold ids between 1 and n identifying what fold each observation is in. The smallest value allowable is nfolds=3.

nthread

the number of CPU threads to use (default is 1).

Details

This feature engineering procedure generates k * c new features using the distances between each observation and its k nearest neighbors inside each class, where c is the number of class labels. The procedure can be summarized as follows:

  1. Generate the first feature as the distances from the nearest neighbor in the first class.

  2. Generate the second feature as the sum of distances from the 2 nearest neighbors inside the first class.

  3. Generate the third feature as the sum of distances from the 3 nearest neighbors inside the first class.

  4. And so on.

Repeat it for each class to generate the k * c new features. For the new training set, a n-fold CV approach is used to avoid overfitting.

This procedure is not so simple. But this method provides a easy interface to do it, and is very fast.

Value

list with the new data:

Author(s)

David Pinto.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
## Not run: 
library("mlbench")
library("caTools")
library("fastknn")
library("glmnet")

data("Ionosphere")

x <- data.matrix(subset(Ionosphere, select = -Class))
y <- Ionosphere$Class

# Remove near zero variance columns
x <- x[, -c(1,2)]

set.seed(2048)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]

# GLM with original features
glm <- glmnet(x = x.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, x.te, type = "class"))
yhat <- factor(yhat, levels = levels(y.tr))
classLoss(actual = y.te, predicted = yhat)

set.seed(2048)
new.data <- knnExtract(xtr = x.tr, ytr = y.tr, xte = x.te, k = 3)

# GLM with KNN features
glm <- glmnet(x = new.data$new.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, new.data$new.te, type = "class"))
yhat <- factor(yhat, levels = levels(y.tr))
classLoss(actual = y.te, predicted = yhat)

## End(Not run)

davpinto/fastknn documentation built on May 15, 2019, 1:18 a.m.