library(tidyverse) library(PPBDS.data) library(learnr) library(shiny) library(rstanarm) library(nnet) library(tidymodels) library(skimr) knitr::opts_chunk$set(echo = FALSE, message = FALSE) # These models take awhile to build. options(tutorial.exercise.timelimit = 600, tutorial.storage = "local")
Confirm that you have the correct version of PPBDS.data installed by pressing "Run Code."
packageVersion('PPBDS.data')
The answer should be ‘0.3.2.9012’, or a higher number. If it is not, you should upgrade your installation by issuing these commands:
remove.packages('PPBDS.data') library(remotes) remotes::install_github('davidkane9/PPBDS.data')
Strictly speaking, it should not be necessary to remove a package. Just installing it again should overwrite the current version. But weird things sometimes happen, so removing first is the safest approach.
question_text( "Student Name:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )
``` {r email, echo=FALSE} question_text( "Email:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )
## EDA cces ### <!-- DK: Better to build these pipes by just spitting them out until the end and then, last step, put the assignment at the front. --> `cces` is a data set which is part of the collection of **PPBDS.data** packages. Let's start by familiarizing ourselves with this data set. ### Exercise 1 Let's first run `glimpse()` to see all the different variables in the `cces` dataset. ```r
glimpse(cces)
As you can see there are lots of observations! For our purposes, we are going to narrow our data.
Let's create an object named ch_11
. To do so, start with cces
. Use %>%
to add select()
to select year
, gender
, race
, approval
, and age
.
ch_11 <- cces %>% select(...)
Extend the pipe using filter()
. Set year
equal to 2016
. Also, use drop_na()
to get rid of any rows with missing data.
ch_11 <- cces %>% select(year, gender, race, age, approval) %>% filter(year == ...) %>% drop_na()
ch_11
is still too big. So, let's add sample_n(1000)
to the end of the pipe so that we only keep 1,000 observations. Let's also use set.seed(9)
at the top so that we only get the same sample. Note that set.seed()
is never part of a pipe. It is a stand-alone line of code.
set.seed(9) ch_11 <- cces %>% select(year, gender, race, age, approval) %>% filter(year == ...) %>% drop_na() %>% sample_n(1000)
Let's create the following graph. The graph compares the predicted values for age
with the true values for age
.
Note: In this section, we will be using the engine "stan" for our model.
# Objects needed throughout section set.seed(9) ch_11 <- cces %>% select(year, gender, race, age, approval) %>% filter(year == 2016) %>% drop_na() %>% sample_n(1000) set.seed(10) cces_split <- initial_split(ch_11, prob = .80) cces_train <- training(cces_split) cces_test <- testing(cces_split) cces_folds <- vfold_cv(cces_train, v = 10) # Wanted to have state in the regression but that led to lack of convergence. # Could not figure out how to pass `iter` argument. # Also, I used drop_na() above. Did we really need to do this? I thought not. stan_wfl <- workflow() %>% add_model(linear_reg() %>% set_engine("stan") %>% set_mode("regression")) %>% add_recipe(recipe(age ~ approval + gender + race, data = cces_train) %>% step_dummy(all_nominal()))
stan_p <- stan_wfl %>% fit(data = cces_train) %>% predict(new_data = cces_train) %>% bind_cols(cces_train %>% select(age)) %>% ggplot(aes(y = age, x = `.pred`)) + geom_jitter(height = 0.2, alpha = 0.1) + labs(title= "Predicted vs Truth", subtitle = "Comparing the predicted values of age with the true value", x = "Predicted Age", y = "True Age") stan_p
First use set.seed()
with the argument 10
to make sure our results can be reproduced later.
set.seed(10)
Create an object named cces_split
. To do so, use initial_split()
. The first argument should be ch_11
. Also set prob
to .80.
cces_split <- initial_split(ch_11, prob = ...)
Now create an object named cces train
. To do so, use training()
with the argument cces_split
. Also, create a new object named cces_test
. To do so, use testing()
with the argument cces_split
.
cces_train <- training(...) cces_test <- testing(...)
Let's now create cross-validation data sets from the training data. Create the object cces_folds
. To do so, use vfold_cv()
. The first argument should be cces_train
. Also set v
to 10.
cces_folds <- vfold_cv(cces_train, v = ...)
Great. Now let's start a model pipeline with workflow()
and then add a model object. To do so, use add_model()
. Inside add_model()
, use linear_reg()
. Then, use %>%
to add set_engine()
with the argument "stan". Extend the pipe one more time to add set_mode()
. Its argument should be "regression".
workflow() %>% add_model(linear_reg() %>% set_engine(...) %>% set_mode(...))
Copy and paste your work from above. Extend the code using %>%
to add add_recipe()
. Inside add_recipe()
, use recipe()
. The first argument to recipe()
should be the formula argument age ~ approval + gender + race
. Also set data
to cces_train
.
... %>% add_recipe(recipe(age ~ approval + gender + race, data = ...)
Use %>%
to add step_dummy()
with the argument all_nominal()
within the call to add_recipe()
.
... %>% add_recipe(recipe(age ~ approval + gender + race, data = cces_train) %>% step_dummy(all_nominal()))
Copy and paste your work from above and assign it to an object named stan_wfl
.
stan_wfl <- ...
Let's plot the data. Start with stan_wfl
and use %>%
to add fit()
and set data
to cces_train
. Also add predict()
and set new_data
to cces_train
.
stan_wfl %>% fit(data = ...) %>% predict(new_data = ...)
Extend the pipe again to use bind_cols()
. Pipe (within the argument) cces_train
to select(age)
in order to grab the true value of each age. Also use %>%
to add ggplot()
. Using ggplot()
, map age
to the y-axis and .pred
to the x-axis.
stan_wfl %>% ... %>% ... %>% bind_cols(cces_train %>% select(...)) %>% ggplot(aes(x = ..., y = ...))
Great. Now add the layer geom_jitter()
and set alpha
to 0.1 and height
to 0.2. Title your graph "Predicted vs Truth". Your subtitle should be "Comparing the predicted values of age with the true values". Also label your y-axis "True Age" and your x-axis "Predicted Age".
Reminder: This is what our graph looks like:
stan_p
Your graph does not need to look exactly like ours.
In this section we will create a neural network model. We will try to solve the same problem as before: Can we predict someone's age given information about their approval, gender, and race?
The following graph compares the predicted values for age
, using our neural network, with the true values for age
.
# Key object which uses will need at various points. nnet_wfl <- workflow() %>% add_model(mlp(hidden_units = 5) %>% set_engine("nnet") %>% set_mode("regression")) %>% add_recipe(recipe(age ~ approval + gender + race, data = cces_train) %>% step_dummy(all_nominal()))
nnet_p <- nnet_wfl %>% fit(data = cces_train) %>% predict(new_data = cces_train) %>% bind_cols(cces_train %>% select(age)) %>% ggplot(aes(y = age, x = `.pred`)) + geom_jitter(alpha = 0.1, height = 0.2) + labs(title = "Predicted vs Truth", subtitle = "Comparing the predicted values of age with the true value", x = "Predicted Age", y = "True Age") nnet_p
Note: We are going to jump straight into creating our workflow pipeline. We have already split our data and created a cross-validation data set in the previous section.
Let's start a model pipeline with workflow()
and then add a model object. To do so, use add_model()
. Inside add_model()
, use mlp()
and set hidden_units
to 5
. Then, use %>%
to add set_engine()
with the argument "nnet"
. Extend the pipe one more time to add set_mode()
. Its argument should be "regression"
.
workflow() %>% add_model(mlp(hidden_units = ...) %>% set_engine("...") %>% set_mode("..."))
Copy and paste your work from above. Extend the code using %>%
to add add_recipe()
. Inside add_recipe()
, use recipe()
. The first argument to recipe()
should be the formula argument age ~ approval + gender + race
. Also set data
to cces_train
.
... %>% add_recipe(recipe(age ~ approval + gender + race, data = ...))
Extend the pipe by using %>%
to add step_dummy()
with the argument all_nominal()
within the call to add_recipe()
.
... %>% add_recipe(recipe(age ~ approval + gender + race, data = cces_train) %>% step_dummy(all_nominal()))
Awesome. Now let's assign this pipe to a single workflow object named nnet_wfl
.
This process --- in which we create a named workflow object and then use it as the first step in subsequent analyses --- is very common.
nnet_wfl <- ...
Let's plot the data. Start with nnet_wfl
and use %>%
to add fit()
, setting the data
argument to cces_train
. Also add predict()
and set new_data
to cces_train
.
nnet_wfl %>% fit(data = ...) %>% predict(new_data = ...)
Extend the pipe again to use bind_cols()
. Pipe (within the argument) cces_train
to select(age)
in order to grab the true value of each age.
This is another common structure. It is very handy to have the true values and the predicted values lined up next to each other in a tibble.
nnet_wfl %>% ... %>% ... %>% bind_cols(cces_train %>% select(age))
Next, use %>%
to add ggplot()
. Using ggplot()
, map age
to the y-axis and .pred
to the x-axis. Then use geom_jitter()
and set alpha
to 0.1 and height
to 0.2. Title your graph "Predicted vs Truth"
. Your subtitle should be "Comparing the predicted values of age with the true values"
. Also label your y-axis "True Age"
and your x-axis "Predicted Age"
.
Reminder: This is what our graph looks like:
nnet_p
Your graph does not need to look exactly like ours. Indeed, because of the inherent randomness in the model fitting algorithm, your plot will certainly look at least a bit different.
nnet_wfl %>% fit(data = cces_train) %>% predict(new_data = cces_train) %>% bind_cols(cces_train %>% select(age)) %>% ggplot(aes(y = ..., x = ...)) + geom_jitter(alpha = ..., height = ...) + labs(title= "Predicted vs Truth", subtitle = "Comparing the predicted values of age with the true values", x = "Predicted Age", y = ...)
In this section, we will use cross-validation to evaluate the two models we created in the previous sections. Cross-validation is the most important tool for deciding which model we should choose.
Let's start by looking at the first model we created using the "stan" engine.
Start with your workflow object stan_wfl
. Using %>%
, add fit_resamples()
and set the resamples
argument to cces_folds
, which we have already created. Extend the pipe again and add collect_metrics()
. This might take up to 30 seconds to run.
stan_wfl %>% fit_resamples(resamples = ...) %>% collect_metrics()
Awesome. Let's now looking at the second model we created using the "nnet" engine.
Start with your workflow object nnet_wfl
. Using %>%
, add fit_resamples()
and set the resamples
argument to cces_folds
. Extend the pipe again and add collect_metrics()
.
nnet_wfl %>% fit_resamples(resamples = ...) %>% collect_metrics()
This is the real power of tidymodels. In just a few lines of code, you are able to complete a thorough cross-validation exercise.
Given the performance you have seen from both models, write two sentences about which model would you would select and why.
``` {r model-choice-written, echo=FALSE} question_text( "Answer:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )
### During the decision process for choosing a model, we use the RMSE values to guide our decision. We generally prefer the model with the lower RMSE value. Therefore, we choose the neural network model. Note, however, that these two models have *very* similar RMSE averages, taken over the 10 cross-validation data sets. (Note how large the standard errors are, compared to the difference in RMSE.) In any event, let's now see how well the nnet model does on the test data. ### Exercise 4 Start with `nnet_wfl` and use `%>%` to add `fit()`. Set data to `cces_train`. Extend the pipe to use `predict()` and set `new_data` to `cces_test`. Extend the pipe again to use `bind_cols()`. Pipe (within the argument) `cces_test` to `select(age)` in order to grab the true value of each age. ```r
nnet_wfl %>% fit(data = ...) %>% predict(new_data = ...) %>% bind_cols(cces_test %>% select(age))
Extend the pipe using %>%
to add metrics()
. The first argument to metrics()
should set truth
to age
. The second argument should set estimate
to .pred
.
nnet_wfl %>% ... %>% ... %>% ... %>% metrics(truth = age, estimate = .pred)
Trying our final model and the test data is usually the last step in the process. The resulting RMSE is our guess as to the RMSE we will see on brand new data.
Happy Data Sciencing!
submission_ui
submission_server()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.