library(learnr)
library(gradethis)

tutorial_options(exercise.timelimit = 5, exercise.checker = gradethis::grade_learnr)

knitr::opts_chunk$set(echo = FALSE)

data("iris")
iris$Species = as.factor(iris$Species == 'virginica')
levels(iris$Species) = c('not virginica','virginica')
library(GGally)
library(tidyverse)
library(MASS)
library(class)
library(cowplot)

The Curse of Dimensionality

When the number of features $p$ is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when $p$ is large. We will now investigate this curse.

Exercises

Answer the following questions from Exercise 4 in Introduction to Statistical Learning, chapter 4.

quiz(
  question("Suppose that we have a set of observations, each with measurements on $p = 1$ feature, `X`. We assume that `X` is uniformly distributed on $[0,1]$. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10 % of the range of `X` closest to that test observation. For instance, in order to predict the response for a test observation with `X = 0.6`, we will use observations in the range `[0.55,0.65]`. On average, what fraction of the available observations will we use to make the prediction?",
    answer("10%", correct=TRUE),
    answer("100%"),
    answer("1%"),
    answer("20%"),
    allow_retry = TRUE
  ),
   question("Now suppose that we have a set of observations, each with measurements on $p = 2$ features, `X1` and `X2`. We assume that `(X1,X2)` are uniformly distributed on $[0,1] \\times [0,1]$. We wish to predict a test observation’s response using only observations that are within 10 % of the range of `X1` _and_ within 10 % of the range of `X2` closest to that test observation. For instance, in order to predict the response for a test observation with `X1 = 0.6` and `X2 = 0.35`, we will use observations in the range `[0.55, 0.65]` for `X1` and in the range `[0.3, 0.4]` for `X2`. On average, what fraction of the available observations will we use to make the prediction?",
    answer("10%"),
    answer("100%"),
    answer("1%", correct=TRUE),
    answer("20%"),
    allow_retry = TRUE
  ),
  question("Now suppose that we have a set of observations on $p = 100$ features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10 % of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?",
    answer("$0.1^{100} \\approx 0$%", correct=TRUE),
    answer(".01%"),
    answer(".1%"),
    answer("1%"),
    allow_retry = TRUE
  )
)

Take a minute to think about the following question before clicking to continue.

Using your answers to parts (1)–(3), argue that a drawback of KNN when $p$ is large is that there are very few training observations "near" any given test observation.

Solution: As $p$ increases, the number of "near" neighbors decreases exponentially. So, points that are neighbors will not actually be that similar to each other.

Now suppose that we wish to make a prediction for a test observation by creating a $p$-dimensional hypercube centered around the test observation that contains, on average, 10 % of the training observations. For $p = 1,2,$ and 100, what is the length of each side of the hypercube? Comment on your answer. Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When $p = 1$, a hypercube is simply a line segment, when $p = 2$ it is a square, and when $p = 100$ it is a 100-dimensional cube.

quiz(
  question("When p=1,",
    answer(".1", correct=TRUE),
    answer(".2"),
    answer(".3"),
    answer(".4"),
    allow_retry = TRUE
  ),
   question("When p=2,",
    answer("$\\sqrt{.1}$=0.316", correct=TRUE),
    answer("$\\sqrt{.2}$=0.447"),
    answer("$\\sqrt{.3}$=0.548"),
    answer("$\\sqrt{.4}$=0.632"),
    allow_retry = TRUE
  ),
  question("When p=100,",
    answer("$\\sqrt[100]{.1}$=0.977", correct=TRUE),
    answer("$\\sqrt[100]{.2}$=0.984"),
    answer("$\\sqrt[100]{.3}$=0.988"),
    answer("$\\sqrt[100]{.4}$=0.991"),
    allow_retry = TRUE
  )
)

Explanation: The volume of the hypercube is equal to it's probability. A $p$-dimensional hypercube with sides of length $l$ will have volume $l^p$. Therefore, if we want volume equal to 10%, we need $l= .1, \sqrt{.1},$ and $\sqrt[100]{.1}$. That is $l=.1$, r signif(sqrt(.1),3), and r signif(.1^(1/100),3).

Data analysis

For this analysis, we'll be using the dataset iris. This dataset has 5 variables, Species, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width. The aim of this analysis is to classify, or identify, the Species of the flower using data on the other four variables.

data("iris")
iris$Species = as.factor(iris$Species == 'virginica')
levels(iris$Species) = c('not virginica','virginica')
library(GGally)
library(tidyverse)
library(MASS)
library(class)
library(cowplot)
ggpairs(iris, aes(color=Species), columns=1:4) + theme_cowplot()

Estimate logistic regression and LDA using the iris data. Does logistic regression throw a warning? Why?

logit_iris = ____________________
coef(logit_iris)
logit_iris = coef(glm(Species~., data=iris, family = 'binomial'))
lda_iris = coef(lda(Species~., data=iris))

grade_result(
  pass_if(~ identical(.result, logit_iris), "Correct!"),
  fail_if(~ identical(.result, lda_iris), "Did you accidentally perform LDA instead?"),
  fail_if(~ TRUE, "Incorrect.")
)
lda_iris = ______________________
coef(lda_iris)
logit_iris = coef(glm(Species~., data=iris, family = 'binomial'))
lda_iris = coef(lda(Species~., data=iris))

grade_result(
  pass_if(~ identical(.result, lda_iris), "Correct!"),
  fail_if(~ identical(.result, logit_iris), "Did you accidentally perform Logistic Regression instead?"),
  fail_if(~ TRUE, "Incorrect.")
)

Solution: Yes, Logistic regression throws a warning. The reason is the two classes are perfectly separable, so the MLE is undefined. The result is one of (infitely) many possible solutions.

Estimate KNN

Estimate knn using a range of k. Choose the best k using CV as in the lecture.

set.seed(1234)
kmax = 20
err = double(kmax)
for(ii in 1:kmax){
  pk = ___________________ # does leave one out CV
  err[ii] = _________________
}
ggplot(data.frame(k=1:kmax,error=err), aes(k,error)) + geom_point(color='red') +
  geom_line(color='red') + theme_cowplot()

If your solution is correct, your plot should look something like

set.seed(1234)
kmax = 20
err = double(kmax)
for(ii in 1:kmax){
  pk = knn.cv(iris[,-5],iris$Species, k=ii) # does leave one out CV
  err[ii] = mean(pk != iris$Species)
}
ggplot(data.frame(k=1:kmax,error=err), aes(k,error)) + geom_point(color='red') +
  geom_line(color='red') + theme_cowplot()

A possible solution would look like:

set.seed(1234)
kmax = 20
err = double(kmax)
for(ii in 1:kmax){
  pk = knn.cv(iris[,-5],iris$Species, k=ii) # does leave one out CV
  err[ii] = mean(pk != iris$Species)
}
ggplot(data.frame(k=1:kmax,error=err), aes(k,error)) + geom_point(color='red') +
  geom_line(color='red') + theme_cowplot()

Classification Error

Which method has the lowest classification error? Run the following code.

logit_iris = glm(Species~., data=iris, family = 'binomial')
lda_iris = lda(Species~., data=iris)
set.seed(1234)
kmax = 20
err = double(kmax)
for(ii in 1:kmax){
  pk = knn.cv(iris[,-5],iris$Species, k=ii) # does leave one out CV
  err[ii] = mean(pk != iris$Species)
}
tibble(
  logit = mean(as.integer(predict(logit_iris, type="response") > 0.5) !=
                 as.integer(iris$Species)-1),
  lda = mean(predict(lda_iris)$class != iris$Species),
  knn = mean(iris$Species !=knn(iris[,-5], iris[,-5], k=which.min(err), cl=iris$Species))
)

You should observe that in this case, logistic regression has the lowest classification error.



dajmcdon/ubc-stat406-labs documentation built on Aug. 18, 2020, 1:23 p.m.