Description Usage Arguments Details Value Author(s) Examples
Do feature engineering on the original dataset and extract new features, generating a new dataset. Since KNN is a nonlinear learner, it makes a nonlinear mapping from the original dataset, making possible to achieve a great classification performance using a simple linear model on the new features, like GLM or LDA.
1 2 | knnExtract(xtr, ytr, xte, k = 1, normalize = NULL, folds = 5,
nthread = 1)
|
xtr |
matrix containing the training instances. |
ytr |
factor array with the training labels. |
xte |
matrix containing the test instances. |
k |
number of neighbors considered (default is 1). This choice is
directly related to the number of new features. So, be careful with it. A
large |
normalize |
variable scaler as in |
folds |
number of folds (default is 5) or an array with fold ids between
1 and |
nthread |
the number of CPU threads to use (default is 1). |
This feature engineering procedure generates k * c
new
features using the distances between each observation and its k
nearest neighbors inside each class, where c
is the number of class
labels. The procedure can be summarized as follows:
Generate the first feature as the distances from the nearest neighbor in the first class.
Generate the second feature as the sum of distances from the 2 nearest neighbors inside the first class.
Generate the third feature as the sum of distances from the 3 nearest neighbors inside the first class.
And so on.
Repeat it for each class to generate the k * c
new features. For the
new training set, a n-fold CV approach is used to avoid overfitting.
This procedure is not so simple. But this method provides a easy interface to do it, and is very fast.
list
with the new data:
new.tr
: matrix
with the new training instances.
new.te
: matrix
with the new test instances.
David Pinto.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | ## Not run:
library("mlbench")
library("caTools")
library("fastknn")
library("glmnet")
data("Ionosphere")
x <- data.matrix(subset(Ionosphere, select = -Class))
y <- Ionosphere$Class
# Remove near zero variance columns
x <- x[, -c(1,2)]
set.seed(2048)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]
# GLM with original features
glm <- glmnet(x = x.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, x.te, type = "class"))
yhat <- factor(yhat, levels = levels(y.tr))
classLoss(actual = y.te, predicted = yhat)
set.seed(2048)
new.data <- knnExtract(xtr = x.tr, ytr = y.tr, xte = x.te, k = 3)
# GLM with KNN features
glm <- glmnet(x = new.data$new.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, new.data$new.te, type = "class"))
yhat <- factor(yhat, levels = levels(y.tr))
classLoss(actual = y.te, predicted = yhat)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.