In DataScienceSalon/mdb: Movie Rating Model and Predictor

Part 5: Modeling

load('../data/train.Rdata')
load('../data/test.Rdata')

The aim at this stage was to develop two prediction models: Model One, a multiregression model that predicts the log of IMDb votes, and Model Two, a simple linear regression model that predicts log box office revenue based upon the log of IMDb votes. The former was the best performing of four multiregression models, developed using both forward selection and backward elimination method selection methods. These four models and their model selection methods were:

r kfigr::figr(label = "models", prefix = TRUE, link = TRUE, type="Table"): Multiregression prediction models

models <- openxlsx::read.xlsx(xlsxFile = "../data/models.xlsx")
knitr::kable(models, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The remainder of this sections is organized as follows.

Model One: Multiregression Model
1.1. Model Selection Methods
1.2. Full Model
1.3. Model Alpha
1.4. Model Beta
1.5. Model Gamma
1.6. Model Delta
1.7. Model Comparison
1.8. Model Two: Final Multiregression Model
Model Two: Simple Linear Regression Model
2.1. Model Overview
2.2. Model Diagnostics
Model Summary

Model One: Multiple Linear Regression

Model One was the best performing of models Alpha, Beta, Gamma, and Delta. The following provides an overview of the model selection methods used, then each model is described and diagnosed vis-a-vis assumptions of linearity, homoscedasticity, normality of errors, multicollinearity, and the treatment of influential points.

Model Selection Methods

Both forward selection and backward elimination model selection techniques were used. The forward selection approach optimized adjusted r-squared; whereas the backward elimination method was based upon p-values.

Forward Selection

The forward selection process began with a null model then all variables were added to the model, one-by-one, and the model which provided the greatest improvement over the current best adjusted R-squared was selected. The process repeated with each variable that was not already in the model until all variables were analyzed. Only the models that improved adjusted r-squared were retained at each step.

Backward Elimination

The backward elimination approach began with the full model. A regression analysis was performed and the least significant predictor (that with the highest p-value) was removed from the model. This process repeated, removing only the most least significant predictor at each step, until all predictors had p-values below the designated $\alpha = .05$ threshold.

Full Model Selection

The motivation for this analysis was to provide insight to inform the decisions studio executives must make at the early stages of a film project. That said, the variables selected for the full model were those that are "knowable" by studio executives before theatrical release. Consequently variables excluded from consideration as predictors consisted of:
variables not "knowable" at project inception. This would include film ratings, scores, academy awards, and box office results
variables that are redundant with other variables with higher correlations with the dependent variable
categorical variables with levels containing fewer than 5 observations, such as the actor, director and studio variables
dvd release dates as dvd sales are out of the scope of this analysis

As such, the full model is presented in r kfigr::figr(label = "fullModel", prefix = TRUE, link = TRUE, type="Table").

r kfigr::figr(label = "fullModel", prefix = TRUE, link = TRUE, type="Table"): Variables in Full Model

full <- openxlsx::read.xlsx("../data/features.xlsx")
full <- full  %>% filter(c == "yes" & Context == "Explanatory") %>% select(Type, Variable, Description) %>% arrange(Type, Variable)
knitr::kable(full, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The following sections explore various models, model selection techniques, and model diagnostics. Comparisons were conducted and the models were evaluated on test data for prediction accuracy and stability. Lastly, the best performing model is selected and described on detail.

Model Alpha

For this model, a forward selection procedure was undertaken based upon the full model described above. r kfigr::figr(label = "model_a_build", prefix = TRUE, link = TRUE, type="Table") lists the variables in the order in which they were added.

r kfigr::figr(label = "model_a_build", prefix = TRUE, link = TRUE, type="Table"): Model Alpha forward selection process

if (rerun == TRUE) {
  # Obtain Data
  mData <- process(train, stage = "m", y = "imdb_num_votes_log")  

  # Perform forward selection
  m <- forward(data = mData$full, y = "imdb_num_votes_log")

  # Conduct regression analysis
  modelA <- regressionAnalysis(mod = m, mName = "Model Alpha", yVar  = 'imdb_num_votes_log',
                                 yLab = "Log IMDb Votes")
  save(modelA, file = "../data/modelA.Rdata")
} else {
  load("../data/modelA.Rdata")
}

# Report regression steps
knitr::kable(modelA$build, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

As indicated in r kfigr::figr(label = "modelA_overview", prefix = TRUE, link = TRUE, type="Table") and graphically depicted in r kfigr::figr(label = "modelA_regression", prefix = TRUE, link = TRUE, type="Figure"), r modelA$comments$apa

r kfigr::figr(label = "modelA_overview", prefix = TRUE, link = TRUE, type="Table"): Model Alpha Summary Statistics

knitr::kable(modelA$glance, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

modelA$plots$regression

r kfigr::figr(label = "modelA_regression", prefix = TRUE, link = TRUE, type="Figure") Model Alpha Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDb votes is illustrated in r kfigr::figr(label = "modelA_linearity", prefix = TRUE, link = TRUE, type="Figure").

library(gridExtra)
n <- length(modelA$plots$linearity)
nCol <- floor(sqrt(n))
do.call("grid.arrange", c(modelA$plots$linearity, ncol=nCol))

r kfigr::figr(label = "modelA_linearity", prefix = TRUE, link = TRUE, type="Figure") Model Alpha linearity plots

r modelA$comment$linearity

Homoscedasticity

The following plot (r kfigr::figr(label = "modelA_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.

modelA$plots$res_fitted

r kfigr::figr(label = "modelA_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure") Model Alpha homoscedasticity plot

r modelA$comments$homoscedasticity

Residuals

The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelA_residuals", prefix = TRUE, link = TRUE, type="Figure") illustrate the distribution of residuals.

gridExtra::grid.arrange(modelA$plots$res_hist, modelA$plots$res_qq, ncol = 2)

r kfigr::figr(label = "modelA_residuals", prefix = TRUE, link = TRUE, type="Figure") Model Alpha residuals plot

r modelA$comments$normality

Multicollinearity

As shown in r kfigr::figr(label = "modelA_multicollinearity", prefix = TRUE, link = TRUE, type="Figure") and r kfigr::figr(label = "modelA_vif", prefix = TRUE, link = TRUE, type="Table"), r modelA$comments$collinearity

modelA$plots$multicollinearity$plot()

r kfigr::figr(label = "modelA_multicollinearity", prefix = TRUE, link = TRUE, type="Figure"): Model Alpha correlations among quantitative predictors

r kfigr::figr(label = "modelA_vif", prefix = TRUE, link = TRUE, type="Table") Model Alpha variance inflation Factors

knitr::kable(modelA$tests$collinearity, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Outliers

gridExtra::grid.arrange(modelA$plots$res_leverage, modelA$plots$cooks, ncol = 2)

r kfigr::figr(label = "modelA_outliers", prefix = TRUE, link = TRUE, type="Figure") Model Alpha Outliers

r modelA$comments$outliers The discern the effect of these outliers on the model, a new model (Model B) was created without the outliers removed.

Model Beta

This was also a forward selection model; however, it was based upon the full model with outliers from Model Alpha removed. The variables were added as described in r kfigr::figr(label = "model_b_build", prefix = TRUE, link = TRUE, type="Table")

r kfigr::figr(label = "model_b_build", prefix = TRUE, link = TRUE, type="Table"): Model Beta forward selection process

if (rerun == TRUE) {
  # Obtain Data
  mData <- process(train, stage = "m", "imdb_num_votes_log", outliers = modelA$tests$influential)  

  # Perform forward selection
  m <- forward(mData$full, y = "imdb_num_votes_log")

  # Conduct regression analysis
  modelB <- regressionAnalysis(m, mName = "Model Beta", yVar  = 'imdb_num_votes_log',
                                 yLab = "Log IMDb Votes")
  save(modelB, file = "../data/modelB.Rdata")
} else {
  load(file = "../data/modelB.Rdata")
}
# Report regression stesp
knitr::kable(modelB$build, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

As indicated in r kfigr::figr(label = "modelB_overview", prefix = TRUE, link = TRUE, type="Table") and graphically depicted in r kfigr::figr(label = "modelB_regression", prefix = TRUE, link = TRUE, type="Figure"), r modelB$comments$apa

r kfigr::figr(label = "modelB_overview", prefix = TRUE, link = TRUE, type="Table"): Model Beta Summary Statistics

knitr::kable(modelB$glance, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

modelB$plots$regression

r kfigr::figr(label = "modelB_regression", prefix = TRUE, link = TRUE, type="Figure") Model Beta Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDb votes is illustrated in r kfigr::figr(label = "modelB_linearity", prefix = TRUE, link = TRUE, type="Figure").

n <- length(modelB$plots$linearity)
nCol <- floor(sqrt(n))
do.call("grid.arrange", c(modelB$plots$linearity, ncol=nCol))

r kfigr::figr(label = "modelB_linearity", prefix = TRUE, link = TRUE, type="Figure") Model Beta linearity plots

r modelB$comment$linearity

Homoscedasticity

The following plot (r kfigr::figr(label = "modelB_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.

modelB$plots$res_fitted

r kfigr::figr(label = "modelB_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure") Model Beta homoscedasticity plot

r modelB$comments$homoscedasticity

Residuals

The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelB_residuals", prefix = TRUE, link = TRUE, type="Figure") illustrate the distribution of residuals.

gridExtra::grid.arrange(modelB$plots$res_hist, modelB$plots$res_qq, ncol = 2)

r kfigr::figr(label = "modelB_residuals", prefix = TRUE, link = TRUE, type="Figure") Model Beta residuals plot

r modelB$comments$normality

Multicollinearity

As shown in r kfigr::figr(label = "modelB_multicollinearity", prefix = TRUE, link = TRUE, type="Figure") and r kfigr::figr(label = "modelB_vif", prefix = TRUE, link = TRUE, type="Table"), r modelB$comments$collinearity

modelB$plots$multicollinearity$plot()

r kfigr::figr(label = "modelB_multicollinearity", prefix = TRUE, link = TRUE, type="Figure"): Correlations among quantitative predictors

r kfigr::figr(label = "modelB_vif", prefix = TRUE, link = TRUE, type="Table") Model Beta Variance Inflation Factors

knitr::kable(modelB$tests$collinearity, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Upon further analysis, the high VIF for genre and MPAA rating was a consequence of the reference categories having a small proportion of the overall cases. The reference category for genre, Action & Adventure, had just r nrow(subset(train, genre == "Action & Adventure")) films, r round(nrow(subset(train, genre == "Action & Adventure")) / nrow(train) * 100, 1)% of the cases. The reference category for MPAA rating was G, consisting of r nrow(subset(train, mpaa_rating == "G")) observations, r round(nrow(subset(train, mpaa_rating == "G")) / nrow(train) * 100, 1)% of the cases. Though the p-values for the indicator variables may be high, the overall test that all indicators have coefficients of zero is unaffected by the high VIFs

Outliers

gridExtra::grid.arrange(modelB$plots$res_leverage, modelB$plots$cooks, ncol = 2)

r kfigr::figr(label = "modelB_outliers", prefix = TRUE, link = TRUE, type="Figure") Model Beta Outliers

r modelB$comments$outliers A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.

Model Gamma

For this model, a backward elimination procedure was undertaken based upon the full model There were no variables removed from the model.

r kfigr::figr(label = "model_c_build", prefix = TRUE, link = TRUE, type="Table"): Model Gamma

if (rerun == TRUE) {
# Obtain Data
  mData <- process(train, stage = "m", "imdb_num_votes_log")  

  # Perform forward selection
  m <- back(mData$full, y = "imdb_num_votes_log", alpha = 0.05)

  # Conduct regression analysis
  modelC <- regressionAnalysis(m, mName = "Model Gamma", yVar  = 'imdb_num_votes_log',
                                 yLab = "Log IMDb Votes")
  save(modelC, file = "../data/modelC.Rdata")
} else {
  load(file = "../data/modelC.Rdata")
}
# Report regression stesp
knitr::kable(modelC$build, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The model therefore retained the following variables:

r kfigr::figr(label = "modelC_variables", prefix = TRUE, link = TRUE, type="Table") Model Gamma Variables

vars <- openxlsx::read.xlsx(xlsxFile = "../data/features.xlsx")
vars <- vars %>% filter(Variable %in% modelC$anova$Term) %>% select(Variable, Description)
knitr::kable(vars, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

As indicated in r kfigr::figr(label = "modelC_overview", prefix = TRUE, link = TRUE, type="Table") and graphically depicted in r kfigr::figr(label = "modelC_regression", prefix = TRUE, link = TRUE, type="Figure"), r modelC$comments$apa

r kfigr::figr(label = "modelC_overview", prefix = TRUE, link = TRUE, type="Table") Model Gamma Summary Statistics

knitr::kable(modelC$glance, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

modelC$plots$regression

r kfigr::figr(label = "modelC_regression", prefix = TRUE, link = TRUE, type="Figure") Model Gamma Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDb votes is illustrated in r kfigr::figr(label = "modelC_linearity", prefix = TRUE, link = TRUE, type="Figure").

n <- length(modelC$plots$linearity)
nCol <- floor(sqrt(n))
do.call("grid.arrange", c(modelC$plots$linearity, ncol=nCol))

r kfigr::figr(label = "modelC_linearity", prefix = TRUE, link = TRUE, type="Figure") Model Gamma linearity plots

r modelC$comment$linearity

Homoscedasticity

The following plot (r kfigr::figr(label = "modelC_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.

modelC$plots$res_fitted

r kfigr::figr(label = "modelC_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure") Model Gamma homoscedasticity plot

r modelC$comments$homoscedasticity

Residuals

The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelC_residuals", prefix = TRUE, link = TRUE, type="Figure") illustrate the distribution of residuals.

gridExtra::grid.arrange(modelC$plots$res_hist, modelC$plots$res_qq, ncol = 2)

r kfigr::figr(label = "modelC_residuals", prefix = TRUE, link = TRUE, type="Figure") Model Gamma residuals plot

r modelC$comments$normality

Multicollinearity

As shown in r kfigr::figr(label = "modelC_multicollinearity", prefix = TRUE, link = TRUE, type="Figure") and r kfigr::figr(label = "modelC_vif", prefix = TRUE, link = TRUE, type="Table"), r modelC$comments$collinearity

modelC$plots$multicollinearity$plot()

r kfigr::figr(label = "modelC_multicollinearity", prefix = TRUE, link = TRUE, type="Figure"): Correlations among quantitative predictors

r kfigr::figr(label = "modelC_vif", prefix = TRUE, link = TRUE, type="Table") Model Gamma Variance Inflation Factors

knitr::kable(modelC$tests$collinearity, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Outliers

gridExtra::grid.arrange(modelC$plots$res_leverage, modelC$plots$cooks, ncol = 2)

r kfigr::figr(label = "modelC_outliers", prefix = TRUE, link = TRUE, type="Figure") Model Gamma Outliers

r modelC$comments$outliers To discern the effect of the influential points on the model, a new model (Model Delta) was created without the influential points of this model.

Model Delta

This was also a backward elimination model; however, it was based upon the full model with outliers from Model Gamma removed. There were no variables removed from the model.

if (rerun == TRUE) {
  # Obtain Data
  mData <- process(train, stage = "m", "imdb_num_votes_log", outliers = modelC$tests$influential)  

  # Perform forward selection
  m <- back(mData$full, y = "imdb_num_votes_log", alpha = 0.05)

  # Conduct regression analysis
  modelD <- regressionAnalysis(m, mName = "Model Delta", yVar  = 'imdb_num_votes_log',
                                 yLab = "Log IMDb Votes")
  save(modelD, file = "../data/modelD.Rdata")
} else {
  load(file = "../data/modelD.Rdata")
}
# Report regression stesp
knitr::kable(modelD$build, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The model therefore retained the following variables:

r kfigr::figr(label = "modelD_variables", prefix = TRUE, link = TRUE, type="Table") Model Delta Variables

vars <- openxlsx::read.xlsx(xlsxFile = "../data/features.xlsx")
vars <- vars %>% filter(Variable %in% modelD$anova$Term) %>% select(Variable, Description)
knitr::kable(vars, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

As indicated in r kfigr::figr(label = "modelD_overview", prefix = TRUE, link = TRUE, type="Table") and graphically depicted in r kfigr::figr(label = "modelD_regression", prefix = TRUE, link = TRUE, type="Figure"), r modelD$comments$apa

r kfigr::figr(label = "modelD_overview", prefix = TRUE, link = TRUE, type="Table")

knitr::kable(modelD$glance, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

modelD$plots$regression

r kfigr::figr(label = "modelD_regression", prefix = TRUE, link = TRUE, type="Figure") Model Delta Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDb votes is illustrated in r kfigr::figr(label = "modelD_linearity", prefix = TRUE, link = TRUE, type="Figure").

n <- length(modelD$plots$linearity)
nCol <- floor(sqrt(n))
do.call("grid.arrange", c(modelD$plots$linearity, ncol=nCol))

r kfigr::figr(label = "modelD_linearity", prefix = TRUE, link = TRUE, type="Figure") Model Delta linearity plots

r modelD$comment$linearity

Homoscedasticity

The following plot (r kfigr::figr(label = "modelD_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.

modelD$plots$res_fitted

r kfigr::figr(label = "modelD_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure") Model Delta homoscedasticity plot

r modelD$comments$homoscedasticity

Residuals

The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelD_residuals", prefix = TRUE, link = TRUE, type="Figure") illustrate the distribution of residuals.

gridExtra::grid.arrange(modelD$plots$res_hist, modelD$plots$res_qq, ncol = 2)

r kfigr::figr(label = "modelD_residuals", prefix = TRUE, link = TRUE, type="Figure") Model Delta residuals plot

r modelD$comments$normality

Multicollinearity

As shown in r kfigr::figr(label = "modelD_multicollinearity", prefix = TRUE, link = TRUE, type="Figure") and r kfigr::figr(label = "modelD_vif", prefix = TRUE, link = TRUE, type="Table"), r modelD$comments$collinearity

modelD$plots$multicollinearity$plot()

r kfigr::figr(label = "modelD_multicollinearity", prefix = TRUE, link = TRUE, type="Figure"): Correlations among quantitative predictors

r kfigr::figr(label = "modelD_vif", prefix = TRUE, link = TRUE, type="Table") Model Delta Variance Inflation Factors

knitr::kable(modelD$tests$collinearity, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Outliers

gridExtra::grid.arrange(modelD$plots$res_leverage, modelD$plots$cooks, ncol = 2)

r kfigr::figr(label = "modelD_outliers", prefix = TRUE, link = TRUE, type="Figure") Model Delta Outliers

r modelD$comments$outliers A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.

Model Comparisons

To summarize, models Alpha and Beta were constructed using forward selection and models Gamma and Delta were developed via backward elimination. Models Beta and Delta were fitted without the influential data points from models Alpha and Gamma respectively.

r kfigr::figr(label = "model_comparison", prefix = TRUE, link = TRUE, type="Table") Summary of models

# Compare models
abcd <- rbind(modelA$glance, modelB$glance, modelC$glance, modelD$glance)
knitr::kable(abcd, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Forward Selection vs. Backward Elimination

The differences in root mean square error for the models was not significant r round((abcd$RMSE[3] - abcd$RMSE[1]) / abcd$RMSE[3] * 100, 2)% and r round((abcd$RMSE[2] - abcd$RMSE[4]) / abcd$RMSE[4] * 100, 2)%. Similarly, the differences in adjusted R-squared were r round((abcd[1,9] - abcd[3,9]) / abcd[3,9] * 100, 2)% and r round((abcd[4,9] - abcd[2,9]) / abcd[2,9] * 100, 2)%, not a significant difference. However, the differences in the percent variance explained by the models was not insignificant (r round((abcd[3,11] - abcd[1,11]) / abcd[1,11] * 100, 2)% and r round((abcd[4,11] - abcd[2,11]) / abcd[2,11] * 100, 2)%). The backward selection models tended to perform slightly better in this regard.

Influential Points: Drop or Not

The Beta and Delta models were trained on data sans the influential points from Alpha and Gamma. The differences in RMSE (r round((abcd$RMSE[1] - abcd$RMSE[2]) / abcd$RMSE[2] * 100, 2)% and r round((abcd$RMSE[3] - abcd$RMSE[4]) / abcd$RMSE[4] * 100, 2)%) were insignificant, as were the differences in adjusted R-squared (r round((abcd[2,9] - abcd[1,9]) / abcd[1,9] * 100, 2)% and r round((abcd[4,9] - abcd[3,9]) / abcd[3,9] * 100, 2)%), and the percent of variance explained (r round((abcd[2,11] - abcd[1,11]) / abcd[1,11] * 100, 2)% and r round((abcd[4,11] - abcd[3,11]) / abcd[3,11] * 100, 2)%). However, a case-wise review of the influential points did not reveal any data quality issues; therefore, the points would not be removed.

Prediction Accuracy

The evaluate the effects of model selection method and the treatment of outliers on prediction accuracy, the four multiregression models were evaluated for prediction accuracy on the test data. Four measures of prediction accuracy were used:

MAPE - Mean Absolute Percentage Error
MPE - Mean Percentage Error
MSE - Mean Squared Error
RMSE - Root Mean Squared Error

In addition, a percent accuracy measure was computed as the percentage of the observations in the test set in which the actual log number of IMDb votes fell within the prediction interval.

r kfigr::figr(label = "prediction_comparison", prefix = TRUE, link = TRUE, type="Table") Model Predictive Accuracy Summary

mods <- list(modelA, modelB, modelC, modelD)
accuracy <- comparePredictions(mods = mods, test = test)
knitr::kable(accuracy, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

There were no significant differences in MAPE, MSE, and RMSE between the models. The negative MPE indicated that all models were biased with over predictions. The lack of justification for excluding the influential points, ruled out models Beta and Delta. Model Alpha was selected as the most parsimonious mode since two of the predictors accounted for r round(sum(modelA$anova[1:2,7]) / sum(modelA$anova[,7]) * 100, 1)% of total variation and r round(sum(modelA$anova[1:2,7]) / sum(modelA$anova[1:(nrow(modelA$anova) - 1),7]) * 100, 1)% of variation allocated to terms. As such Model Alpha would advance to the prediction stage.

Model Two: Final Multiregression Model

The final prediction equation was defined as follows: $y_i$ = r modelA$coefficients$estimate[1] + r round(modelA$coefficients$estimate[2], 3)$x_1$ + r round(modelA$coefficients$estimate[3], 3)$x_2$ + r round(modelA$coefficients$estimate[4], 3)$x_3$ + r round(modelA$coefficients$estimate[5], 3)$x_4$ + r round(modelA$coefficients$estimate[6], 3)$x_5$ + r round(modelA$coefficients$estimate[7], 3)$x_6$ + r round(modelA$coefficients$estimate[8], 3)$x_7$ + r round(modelA$coefficients$estimate[9], 3)$x_8$ + r round(modelA$coefficients$estimate[10], 3)$x_9$ + r round(modelA$coefficients$estimate[11], 3)$x_{10}$ + r round(modelA$coefficients$estimate[12], 3)$x_{11}$ + r round(modelA$coefficients$estimate[13], 3)$x_{12}$ + r round(modelA$coefficients$estimate[14], 3)$x_{13}$ + r round(modelA$coefficients$estimate[15], 3)$x_{14}$ + r round(modelA$coefficients$estimate[16], 3)$x_{15}$ + r round(modelA$coefficients$estimate[17], 3)$x_{16}$ + $\epsilon$

where: $x_1$ is r modelA$coefficients$term[2]
$x_2$ is r modelA$coefficients$term[3]
$x_3$ is r modelA$coefficients$term[4]
$x_4$ is r modelA$coefficients$term[5]
$x_5$ is r modelA$coefficients$term[6]
$x_6$ is r modelA$coefficients$term[7]
$x_7$ is r modelA$coefficients$term[8]
$x_8$ is r modelA$coefficients$term[9]
$x_9$ is r modelA$coefficients$term[10]
$x_{10}$ is r modelA$coefficients$term[11]
$x_{11}$ is r modelA$coefficients$term[12]
$x_{12}$ is r modelA$coefficients$term[13]
$x_{13}$ is r modelA$coefficients$term[14]
$x_{14}$ is r modelA$coefficients$term[15]
$x_{15}$ is r modelA$coefficients$term[16]
$x_{16}$ is r modelA$coefficients$term[17]

The genre and MPAA rating variables were code 0 or 1 in accordance with the genre and MPAA rating each observation.

Analysis of Variance

r kfigr::figr(label = "modelA_anova", prefix = TRUE, link = TRUE, type="Figure") summarizes the analysis of variance.

a <- modelA$anova
knitr::kable(a, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

r kfigr::figr(label = "modelA_anova", prefix = TRUE, link = TRUE, type="Figure") Model Alpha analysis of variance

r modelA$comments$anova

Interpretation of Coefficients

Although there were only r nrow(modelA$build) variables, there were some r nrow(modelA$coefficients) coefficients, a consequence of the number of levels in the categorical variables. The coefficients estimates are identified in r kfigr::figr(label = "modelA_coef", prefix = TRUE, link = TRUE, type="Table").

r kfigr::figr(label = "modelA_coef", prefix = TRUE, link = TRUE, type="Table"): Model Alpha Coefficients

knitr::kable(modelA$coefficients, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The intercept estimate, r round(modelA$coefficients$estimate[1], 3) , is the regression estimate for the mean log number of IMDb votes for a G-rated action and adventure film, with zeros for all of the other variables. The other coefficient estimates adjust the estimate accordingly. Therefore a prediction for the log number of IMDb votes was equal to:

the intercept value, r round(modelA$coefficients$estimate[1], 3),
plus a value associated with the genre of the film,
plus r round(subset(modelA$coefficients, term == "cast_scores", select = estimate), 3) percent change in IMDb votes for each percent change in the point scores for the cast members.
plus r round(subset(modelA$coefficients, term == "runtime_log", select = estimate), 3) percent change in IMDb votes for each percent change in runtime,
plus a percent change associated with the MPAA rating for the film.

Model Two: Simple Linear Regression

runmodelTwo <- function() {
  homoscedasticity <- FALSE
  out <- c()
  while(homoscedasticity == FALSE) {
    mData <- process(mdb2, stage = "a", y = "box_office_log", outliers = out)
    m <- slr(mData$full, y = "box_office_log") 
    modelTwo <- regressionAnalysis(mod = m, mName = "Model Two", yVar  = 'box_office_log',
                                 yLab = "Log Daily Box Office") 
    if (modelTwo$tests$homoscedasticity$p > .05) {
      homoscedasticity <- TRUE
      save(modelTwo, file = '../data/modelTwo.Rdata')
    } else {
      out <- c(out, modelTwo$tests$influential)
    }
  }
  return(modelTwo)
}

if (rerun == TRUE) {
  modelTwo <- runmodelTwo()
  rerun <- FALSE
} else {
  load(file = '../data/modelTwo.Rdata')
}

Model Overview

A simple linear regression was calculated to predict the log of box office revenue based upon the log of the number of IMDb votes. A significant regression equation was found (F(r modelTwo$glance$df,r modelTwo$glance[1,4]) = r round(modelTwo$glance[1,5], 3), p < .001), with an $R^2$ of r round(modelTwo$glance[1,8], 3). The prediction equation is as follows:

$y_i$ = r round(modelTwo$coefficients$estimate[1], 2) + r round(modelTwo$coefficients$estimate[2], 2)$x_1$ + $\epsilon$

where:
$x_1$ is r modelTwo$coefficients$term[2]

r kfigr::figr(label = "modelTwo_comparison", prefix = TRUE, link = TRUE, type="Table"): Best performing simple linear regression on log of box office revenue

knitr::kable(head(modelTwo$anova, 5), digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Model Diagnostics

Linearity

The linearity of the predictor with the log of box office is illustrated in r kfigr::figr(label = "modelTwo_linearity", prefix = TRUE, link = TRUE, type="Figure").

modelTwo$plots$linearity[[1]]

r kfigr::figr(label = "modelTwo_linearity", prefix = TRUE, link = TRUE, type="Figure") Model Two linearity plot

r modelTwo$comment$linearity

Homoscedasticity

The following plot (r kfigr::figr(label = "modelTwo_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.

modelTwo$plots$res_fitted

r kfigr::figr(label = "modelTwo_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure") Model Two homoscedasticity plot

r modelTwo$comments$homoscedasticity

Residuals

The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelTwo_residuals", prefix = TRUE, link = TRUE, type="Figure") illustrate the distribution of residuals.

gridExtra::grid.arrange(modelTwo$plots$res_hist, modelTwo$plots$res_qq, ncol = 2)

r kfigr::figr(label = "modelTwo_residuals", prefix = TRUE, link = TRUE, type="Figure") Model Two residuals plot

r modelTwo$comments$normality

Outliers

gridExtra::grid.arrange(modelTwo$plots$res_leverage, modelTwo$plots$cooks, ncol = 2)

r kfigr::figr(label = "modelTwo_outliers", prefix = TRUE, link = TRUE, type="Figure") Model Two Outliers

r modelTwo$comments$outliers A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points were not removed from the model.

Model Summary

The purpose of this section was to develop models that would predict "box office success". Therefore, two regression models were fit in this section: Model One (F(r modelA$glance[1,3], r modelA$glance[1,4]) = r round(modelA$glance[1,5], 2), p < .001), predicted the log number of IMDb votes, the proxy for box office success. Model Two, a simple linear regression model (F(r modelTwo$glance[1,3], r modelTwo$glance[1,4]) = r round(modelTwo$glance[1,5],2), p < .001) predicted log box office revenue based upon the log IMDb votes.

Next, the models will be used to predict the number of log IMDb votes and the log box office for a randomly selected film.

DataScienceSalon/mdb documentation built on May 28, 2019, 12:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DataScienceSalon/mdb Movie Rating Model and Predictor

In DataScienceSalon/mdb: Movie Rating Model and Predictor

Part 5: Modeling

Model One: Multiple Linear Regression

Model Selection Methods

Forward Selection

Backward Elimination

Full Model Selection

Model Alpha

Model Diagnostics

Linearity

Homoscedasticity

Residuals

Multicollinearity

Outliers

Model Beta

Model Diagnostics

Linearity

Homoscedasticity

Residuals

Multicollinearity

Outliers

Model Gamma

Model Diagnostics

Linearity

Homoscedasticity

Residuals

Multicollinearity

Outliers

Model Delta

Model Diagnostics

Linearity

Homoscedasticity

Residuals

Multicollinearity

Outliers

Model Comparisons

Forward Selection vs. Backward Elimination

Influential Points: Drop or Not

Prediction Accuracy

Model Two: Final Multiregression Model

Analysis of Variance

Interpretation of Coefficients

Model Two: Simple Linear Regression

Model Overview

Model Diagnostics

Linearity

Homoscedasticity

Residuals

Outliers

Model Summary

R Package Documentation

Browse R Packages

We want your feedback!

DataScienceSalon/mdb
Movie Rating Model and Predictor