library(tidyverse)
library(PPBDS.data)
library(learnr)
library(shiny)
library(rstanarm)
library(nnet)
library(tidymodels)
library(skimr)

knitr::opts_chunk$set(echo = FALSE, message = FALSE)

# These models take awhile to build. 

options(tutorial.exercise.timelimit = 600, 
        tutorial.storage = "local")

Confirm Correct Package Version

Confirm that you have the correct version of PPBDS.data installed by pressing "Run Code."

packageVersion('PPBDS.data')

The answer should be ‘0.3.2.9012’, or a higher number. If it is not, you should upgrade your installation by issuing these commands:

remove.packages('PPBDS.data')  
library(remotes)  
remotes::install_github('davidkane9/PPBDS.data')  

Strictly speaking, it should not be necessary to remove a package. Just installing it again should overwrite the current version. But weird things sometimes happen, so removing first is the safest approach.

Name

question_text(
  "Student Name:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Email

``` {r email, echo=FALSE} question_text( "Email:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )

## EDA cces
###

<!-- DK: Better to build these pipes by just spitting them out until the end and then, last step, put the assignment at the front. -->

`cces` is a data set which is part of the collection of **PPBDS.data** packages. Let's start by familiarizing ourselves with this data set.


### Exercise 1

Let's first run `glimpse()` to see all the different variables in the `cces` dataset. 

```r
glimpse(cces)

Exercise 2

As you can see there are lots of observations! For our purposes, we are going to narrow our data.

Let's create an object named ch_11. To do so, start with cces. Use %>% to add select() to select year, gender, race, approval, and age.


ch_11 <- 
  cces %>%
  select(...)

Exercise 3

Extend the pipe using filter(). Set year equal to 2016. Also, use drop_na() to get rid of any rows with missing data.


ch_11 <-
  cces %>%
  select(year, gender, race, age, approval) %>%
  filter(year == ...) %>% 
  drop_na()

Exercise 4

ch_11 is still too big. So, let's add sample_n(1000) to the end of the pipe so that we only keep 1,000 observations. Let's also use set.seed(9) at the top so that we only get the same sample. Note that set.seed() is never part of a pipe. It is a stand-alone line of code.


set.seed(9)
ch_11 <-
  cces %>%
  select(year, gender, race, age, approval) %>%
  filter(year == ...) %>% 
  drop_na() %>% 
  sample_n(1000)  

Model using "stan"

Let's create the following graph. The graph compares the predicted values for age with the true values for age.

Note: In this section, we will be using the engine "stan" for our model.

# Objects needed throughout section

set.seed(9)
ch_11 <-
  cces %>%
  select(year, gender, race, age, approval) %>%
  filter(year == 2016) %>% 
  drop_na() %>% 
  sample_n(1000) 

set.seed(10)
cces_split <- initial_split(ch_11, prob = .80)
cces_train <- training(cces_split)
cces_test  <-  testing(cces_split)
cces_folds <- vfold_cv(cces_train, v = 10)

# Wanted to have state in the regression but that led to lack of convergence.
# Could not figure out how to pass `iter` argument.

# Also, I used drop_na() above. Did we really need to do this? I thought not.


stan_wfl <- 
  workflow() %>% 
     add_model(linear_reg() %>% 
              set_engine("stan") %>% 
              set_mode("regression")) %>% 
      add_recipe(recipe(age ~ approval + gender + race, 
                    data = cces_train) %>% 
                 step_dummy(all_nominal())) 
stan_p <- stan_wfl %>% 
  fit(data = cces_train) %>% 
  predict(new_data = cces_train) %>% 
  bind_cols(cces_train %>% select(age)) %>% 
  ggplot(aes(y = age, x = `.pred`)) + 
    geom_jitter(height = 0.2, alpha = 0.1) + 
    labs(title= "Predicted vs Truth",
         subtitle = "Comparing the predicted values of age with the true value",  
         x = "Predicted Age", 
         y = "True Age") 

stan_p

Exercise 1

First use set.seed() with the argument 10 to make sure our results can be reproduced later.


set.seed(10)

Exercise 2

Create an object named cces_split. To do so, use initial_split(). The first argument should be ch_11. Also set prob to .80.


cces_split <- initial_split(ch_11, prob = ...)

Exercise 3

Now create an object named cces train. To do so, use training() with the argument cces_split. Also, create a new object named cces_test. To do so, use testing() with the argument cces_split.


cces_train <- training(...)
cces_test  <- testing(...)

Exercise 4

Let's now create cross-validation data sets from the training data. Create the object cces_folds. To do so, use vfold_cv(). The first argument should be cces_train. Also set v to 10.


cces_folds <- 
  vfold_cv(cces_train, v = ...)

Exercise 5

Great. Now let's start a model pipeline with workflow() and then add a model object. To do so, use add_model(). Inside add_model(), use linear_reg(). Then, use %>% to add set_engine() with the argument "stan". Extend the pipe one more time to add set_mode(). Its argument should be "regression".


workflow() %>%
  add_model(linear_reg() %>% 
              set_engine(...) %>% 
              set_mode(...))

Exercise 6

Copy and paste your work from above. Extend the code using %>% to add add_recipe(). Inside add_recipe(), use recipe(). The first argument to recipe() should be the formula argument age ~ approval + gender + race. Also set data to cces_train.


...  %>% 
  add_recipe(recipe(age ~ approval + gender + race, 
                    data = ...)

Exercise 7

Use %>% to add step_dummy() with the argument all_nominal() within the call to add_recipe().


...  %>%
  add_recipe(recipe(age ~ approval + gender + race, 
                    data = cces_train) %>% 
             step_dummy(all_nominal())) 

Exercise 8

Copy and paste your work from above and assign it to an object named stan_wfl.


stan_wfl <- ...

Exercise 9

Let's plot the data. Start with stan_wfl and use %>% to add fit() and set data to cces_train. Also add predict() and set new_data to cces_train.


  stan_wfl %>% 
    fit(data = ...) %>% 
    predict(new_data = ...)

Exercise 10

Extend the pipe again to use bind_cols(). Pipe (within the argument) cces_train to select(age) in order to grab the true value of each age. Also use %>% to add ggplot(). Using ggplot(), map age to the y-axis and .pred to the x-axis.


stan_wfl %>% 
    ... %>% 
    ... %>% 
    bind_cols(cces_train %>% select(...)) %>% 
    ggplot(aes(x = ..., y = ...)) 

Exercise 11

Great. Now add the layer geom_jitter() and set alpha to 0.1 and height to 0.2. Title your graph "Predicted vs Truth". Your subtitle should be "Comparing the predicted values of age with the true values". Also label your y-axis "True Age" and your x-axis "Predicted Age".

Reminder: This is what our graph looks like:

stan_p

Your graph does not need to look exactly like ours.


Model using "nnet"

In this section we will create a neural network model. We will try to solve the same problem as before: Can we predict someone's age given information about their approval, gender, and race?

The following graph compares the predicted values for age, using our neural network, with the true values for age.

# Key object which uses will need at various points.

nnet_wfl <- workflow() %>%
     add_model(mlp(hidden_units = 5) %>%
              set_engine("nnet") %>%
              set_mode("regression")) %>%
      add_recipe(recipe(age ~ approval + gender + race,
                    data = cces_train) %>%
                 step_dummy(all_nominal())) 
nnet_p <- nnet_wfl %>%
  fit(data = cces_train) %>%
  predict(new_data = cces_train) %>%
  bind_cols(cces_train %>% select(age)) %>% 
  ggplot(aes(y = age, x = `.pred`)) +
    geom_jitter(alpha = 0.1, height = 0.2) +
    labs(title = "Predicted vs Truth",
        subtitle = "Comparing the predicted values of age with the true value", 
        x = "Predicted Age",
        y = "True Age") 

nnet_p

Note: We are going to jump straight into creating our workflow pipeline. We have already split our data and created a cross-validation data set in the previous section.

Exercise 1

Let's start a model pipeline with workflow() and then add a model object. To do so, use add_model(). Inside add_model(), use mlp() and set hidden_units to 5. Then, use %>% to add set_engine() with the argument "nnet". Extend the pipe one more time to add set_mode(). Its argument should be "regression".


workflow() %>%
  add_model(mlp(hidden_units = ...) %>%
              set_engine("...") %>%
              set_mode("..."))

Exercise 2

Copy and paste your work from above. Extend the code using %>% to add add_recipe(). Inside add_recipe(), use recipe(). The first argument to recipe() should be the formula argument age ~ approval + gender + race. Also set data to cces_train.


...  %>%
  add_recipe(recipe(age ~ approval + gender + race,
                    data = ...))

Exercise 3

Extend the pipe by using %>% to add step_dummy() with the argument all_nominal() within the call to add_recipe().


...  %>%
  add_recipe(recipe(age ~ approval + gender + race,
                    data = cces_train) %>%
             step_dummy(all_nominal()))

Exercise 4

Awesome. Now let's assign this pipe to a single workflow object named nnet_wfl.


This process --- in which we create a named workflow object and then use it as the first step in subsequent analyses --- is very common.

nnet_wfl <- ...

Exercise 5

Let's plot the data. Start with nnet_wfl and use %>% to add fit(), setting the data argument to cces_train. Also add predict() and set new_data to cces_train.


nnet_wfl %>%
  fit(data = ...) %>%
  predict(new_data = ...)

Exercise 6

Extend the pipe again to use bind_cols(). Pipe (within the argument) cces_train to select(age) in order to grab the true value of each age.


This is another common structure. It is very handy to have the true values and the predicted values lined up next to each other in a tibble.

nnet_wfl %>%
  ... %>%
  ... %>%
  bind_cols(cces_train %>% select(age))

Exercise 7

Next, use %>% to add ggplot(). Using ggplot(), map age to the y-axis and .pred to the x-axis. Then use geom_jitter() and set alpha to 0.1 and height to 0.2. Title your graph "Predicted vs Truth". Your subtitle should be "Comparing the predicted values of age with the true values". Also label your y-axis "True Age" and your x-axis "Predicted Age".

Reminder: This is what our graph looks like:

nnet_p

Your graph does not need to look exactly like ours. Indeed, because of the inherent randomness in the model fitting algorithm, your plot will certainly look at least a bit different.


nnet_wfl %>%
  fit(data = cces_train) %>%
  predict(new_data = cces_train) %>%
  bind_cols(cces_train %>% select(age)) %>% 
  ggplot(aes(y = ..., x = ...)) +
    geom_jitter(alpha = ..., height = ...) +
    labs(title= "Predicted vs Truth",
        subtitle = "Comparing the predicted values of age with the true values", 
        x = "Predicted Age",
        y = ...) 

Model Choice

In this section, we will use cross-validation to evaluate the two models we created in the previous sections. Cross-validation is the most important tool for deciding which model we should choose.

Let's start by looking at the first model we created using the "stan" engine.

Exercise 1

Start with your workflow object stan_wfl. Using %>%, add fit_resamples() and set the resamples argument to cces_folds, which we have already created. Extend the pipe again and add collect_metrics(). This might take up to 30 seconds to run.


stan_wfl %>%
  fit_resamples(resamples = ...) %>%
  collect_metrics()

Awesome. Let's now looking at the second model we created using the "nnet" engine.

Exercise 2

Start with your workflow object nnet_wfl. Using %>%, add fit_resamples() and set the resamples argument to cces_folds. Extend the pipe again and add collect_metrics().


nnet_wfl %>%
  fit_resamples(resamples = ...) %>%
  collect_metrics()

This is the real power of tidymodels. In just a few lines of code, you are able to complete a thorough cross-validation exercise.

Exercise 3

Given the performance you have seen from both models, write two sentences about which model would you would select and why.

``` {r model-choice-written, echo=FALSE} question_text( "Answer:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )

###


During the decision process for choosing a model, we use the RMSE values to guide our decision. We generally prefer the model with the lower RMSE value. Therefore, we choose the neural network model. Note, however, that these two models have *very* similar RMSE averages, taken over the 10 cross-validation data sets. (Note how large the standard errors are, compared to the difference in RMSE.) 

In any event, let's now see how well the nnet model does on the test data.


### Exercise 4

Start with `nnet_wfl` and use `%>%` to add `fit()`. Set data to `cces_train`. Extend the pipe to use `predict()` and set `new_data` to `cces_test`. Extend the pipe again to use `bind_cols()`. Pipe (within the argument) `cces_test` to `select(age)` in order to grab the true value of each age.


```r
nnet_wfl %>%
  fit(data = ...) %>%
  predict(new_data = ...) %>%
  bind_cols(cces_test %>% select(age))

Exercise 5

Extend the pipe using %>% to add metrics(). The first argument to metrics() should set truth to age. The second argument should set estimate to .pred.


nnet_wfl %>%
  ... %>%
  ... %>%
  ... %>% 
  metrics(truth = age, estimate = .pred)

Trying our final model and the test data is usually the last step in the process. The resulting RMSE is our guess as to the RMSE we will see on brand new data.

Happy Data Sciencing!

Submit

submission_ui
submission_server()


davidkane9/PPBDS.data documentation built on Nov. 18, 2020, 1:17 p.m.