library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, fig.align='center',fig.width=10,
               fig.height=6, cache=TRUE, autodep = TRUE)

Instructions

  1. Rename this document with your student ID (not the 10-digit number, your IU username, e.g. dajmcdon). Include your buddy in the author field if you are working together.
  2. Follow the instructions in each section.

Trees and leaves

The leaf dataset contains data derived from images of 40 different species of leaves. Examine the file LeafDescription.pdf to see some example images and detailed descriptions of the different covariates. Use this description and the data to perform the analysis. Note: the included .csv file only has species 1-36.

Analysis

  1. Load the data. Note: only the "simple" leaves are included. You may need to rename the columns with:
names(leaf) = c('Species','Specimen_Number','Eccentricity','Aspect_Ratio',
                'Elongation','Solidity','Stochastic_Convexity','Isoperimetric_Factor',
                'Maximal_Indent_Depth','Lobedness','Average_Intensity',
                'Average_Contrast','Smoothness','Third_moment','Uniformity','Entropy')
  1. Create a new factor called lobes which collapses the different leaf species into two groups: those in Species 5-7, 11, 15, 23, and 30 (many), versus the rest (one).
  2. Produce a pairs plot of all continuous predictors. Color the points by lobes.
  3. Train a tree classifier based on this data for predicting complexity. Use all predictors (not Species or Specimen_number obvs). Prune your tree using cross validation to choose the depth and plot the tree (see Slide 13 from this week). Produce a confusion matrix and find the tree's in-sample error rate.
  4. Train a random forest using 400 trees. Produce a variable importance plot, a confusion matrix, and find the in-sample error rate.

Load data and pairs plot

library(tidyverse)
library(GGally)
leaf = read.csv('leaf.csv', header = FALSE)
names(leaf) = c('Species','Specimen_Number','Eccentricity','Aspect_Ratio',
                'Elongation','Solidity','Stochastic_Convexity','Isoperimetric_Factor',
                'Maximal_Indent_Depth','Lobedness','Average_Intensity',
                'Average_Contrast','Smoothness','Third_moment','Uniformity','Entropy')
leaf = leaf %>% mutate(lobes = factor(Species %in% c(5:7, 11, 15, 23, 30), 
                                   labels = c('one','many')))
ggpairs(leaf, aes(color=lobes), columns = 3:16)

The tree

library(tree)
library(maptree)
my_tree = tree(lobes ~. -Species-Specimen_Number, data = leaf)
tree_cv = cv.tree(my_tree, K=5)
pruned_tree = prune.tree(my_tree, k=tree_cv$k[which.min(tree_cv$dev)])
draw.tree(pruned_tree)
(tree_conf <- table(predict(pruned_tree, type='class'), leaf$lobes))
1 - sum(diag(tree_conf))/sum(tree_conf)

The forest

library(randomForest)
my_forest = randomForest(lobes ~. -Species-Specimen_Number, data = leaf, ntree=400)
varImpPlot(my_forest)
(forest_conf <- table(predict(my_forest), leaf$lobes))
1 - sum(diag(forest_conf))/sum(forest_conf)


dajmcdon/ubc-stat406-labs documentation built on Aug. 18, 2020, 1:23 p.m.