library(learnr) library(gradethis) tutorial_options(exercise.timelimit = 5, exercise.checker = gradethis::grade_learnr) knitr::opts_chunk$set(echo = FALSE) data("iris") iris$Species = as.factor(iris$Species == 'virginica') levels(iris$Species) = c('not virginica','virginica') library(GGally) library(tidyverse) library(MASS) library(class) library(cowplot)
When the number of features $p$ is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when $p$ is large. We will now investigate this curse.
Answer the following questions from Exercise 4 in Introduction to Statistical Learning, chapter 4.
quiz( question("Suppose that we have a set of observations, each with measurements on $p = 1$ feature, `X`. We assume that `X` is uniformly distributed on $[0,1]$. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10 % of the range of `X` closest to that test observation. For instance, in order to predict the response for a test observation with `X = 0.6`, we will use observations in the range `[0.55,0.65]`. On average, what fraction of the available observations will we use to make the prediction?", answer("10%", correct=TRUE), answer("100%"), answer("1%"), answer("20%"), allow_retry = TRUE ), question("Now suppose that we have a set of observations, each with measurements on $p = 2$ features, `X1` and `X2`. We assume that `(X1,X2)` are uniformly distributed on $[0,1] \\times [0,1]$. We wish to predict a test observation’s response using only observations that are within 10 % of the range of `X1` _and_ within 10 % of the range of `X2` closest to that test observation. For instance, in order to predict the response for a test observation with `X1 = 0.6` and `X2 = 0.35`, we will use observations in the range `[0.55, 0.65]` for `X1` and in the range `[0.3, 0.4]` for `X2`. On average, what fraction of the available observations will we use to make the prediction?", answer("10%"), answer("100%"), answer("1%", correct=TRUE), answer("20%"), allow_retry = TRUE ), question("Now suppose that we have a set of observations on $p = 100$ features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10 % of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?", answer("$0.1^{100} \\approx 0$%", correct=TRUE), answer(".01%"), answer(".1%"), answer("1%"), allow_retry = TRUE ) )
Take a minute to think about the following question before clicking to continue.
Using your answers to parts (1)–(3), argue that a drawback of KNN when $p$ is large is that there are very few training observations "near" any given test observation.
Solution: As $p$ increases, the number of "near" neighbors decreases exponentially. So, points that are neighbors will not actually be that similar to each other.
Now suppose that we wish to make a prediction for a test observation by creating a $p$-dimensional hypercube centered around the test observation that contains, on average, 10 % of the training observations. For $p = 1,2,$ and 100, what is the length of each side of the hypercube? Comment on your answer. Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When $p = 1$, a hypercube is simply a line segment, when $p = 2$ it is a square, and when $p = 100$ it is a 100-dimensional cube.
quiz( question("When p=1,", answer(".1", correct=TRUE), answer(".2"), answer(".3"), answer(".4"), allow_retry = TRUE ), question("When p=2,", answer("$\\sqrt{.1}$=0.316", correct=TRUE), answer("$\\sqrt{.2}$=0.447"), answer("$\\sqrt{.3}$=0.548"), answer("$\\sqrt{.4}$=0.632"), allow_retry = TRUE ), question("When p=100,", answer("$\\sqrt[100]{.1}$=0.977", correct=TRUE), answer("$\\sqrt[100]{.2}$=0.984"), answer("$\\sqrt[100]{.3}$=0.988"), answer("$\\sqrt[100]{.4}$=0.991"), allow_retry = TRUE ) )
Explanation: The volume of the hypercube is equal to it's probability. A $p$-dimensional hypercube with sides of length $l$ will have volume $l^p$. Therefore, if we want volume equal to 10%, we need $l= .1, \sqrt{.1},$ and $\sqrt[100]{.1}$. That is $l=.1$, r signif(sqrt(.1),3)
, and r signif(.1^(1/100),3)
.
For this analysis, we'll be using the dataset iris
. This dataset has 5 variables, Species
, Sepal.Length
, Sepal.Width
, Petal.Length
, Petal.Width
. The aim of this analysis is to classify, or identify, the Species of the flower using data on the other four variables.
data("iris") iris$Species = as.factor(iris$Species == 'virginica') levels(iris$Species) = c('not virginica','virginica') library(GGally) library(tidyverse) library(MASS) library(class) library(cowplot) ggpairs(iris, aes(color=Species), columns=1:4) + theme_cowplot()
Estimate logistic regression and LDA using the iris
data. Does logistic regression throw a warning? Why?
logit_iris = ____________________ coef(logit_iris)
logit_iris = coef(glm(Species~., data=iris, family = 'binomial')) lda_iris = coef(lda(Species~., data=iris)) grade_result( pass_if(~ identical(.result, logit_iris), "Correct!"), fail_if(~ identical(.result, lda_iris), "Did you accidentally perform LDA instead?"), fail_if(~ TRUE, "Incorrect.") )
lda_iris = ______________________ coef(lda_iris)
logit_iris = coef(glm(Species~., data=iris, family = 'binomial')) lda_iris = coef(lda(Species~., data=iris)) grade_result( pass_if(~ identical(.result, lda_iris), "Correct!"), fail_if(~ identical(.result, logit_iris), "Did you accidentally perform Logistic Regression instead?"), fail_if(~ TRUE, "Incorrect.") )
Solution: Yes, Logistic regression throws a warning. The reason is the two classes are perfectly separable, so the MLE is undefined. The result is one of (infitely) many possible solutions.
Estimate knn using a range of k. Choose the best k using CV as in the lecture.
set.seed(1234) kmax = 20 err = double(kmax) for(ii in 1:kmax){ pk = ___________________ # does leave one out CV err[ii] = _________________ } ggplot(data.frame(k=1:kmax,error=err), aes(k,error)) + geom_point(color='red') + geom_line(color='red') + theme_cowplot()
If your solution is correct, your plot should look something like
set.seed(1234) kmax = 20 err = double(kmax) for(ii in 1:kmax){ pk = knn.cv(iris[,-5],iris$Species, k=ii) # does leave one out CV err[ii] = mean(pk != iris$Species) } ggplot(data.frame(k=1:kmax,error=err), aes(k,error)) + geom_point(color='red') + geom_line(color='red') + theme_cowplot()
A possible solution would look like:
set.seed(1234) kmax = 20 err = double(kmax) for(ii in 1:kmax){ pk = knn.cv(iris[,-5],iris$Species, k=ii) # does leave one out CV err[ii] = mean(pk != iris$Species) } ggplot(data.frame(k=1:kmax,error=err), aes(k,error)) + geom_point(color='red') + geom_line(color='red') + theme_cowplot()
Which method has the lowest classification error? Run the following code.
logit_iris = glm(Species~., data=iris, family = 'binomial') lda_iris = lda(Species~., data=iris) set.seed(1234) kmax = 20 err = double(kmax) for(ii in 1:kmax){ pk = knn.cv(iris[,-5],iris$Species, k=ii) # does leave one out CV err[ii] = mean(pk != iris$Species) }
tibble( logit = mean(as.integer(predict(logit_iris, type="response") > 0.5) != as.integer(iris$Species)-1), lda = mean(predict(lda_iris)$class != iris$Species), knn = mean(iris$Species !=knn(iris[,-5], iris[,-5], k=which.min(err), cl=iris$Species)) )
You should observe that in this case, logistic regression has the lowest classification error.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.