load('../data/train.Rdata') load('../data/test.Rdata')
The aim at this stage was to develop two prediction models: Model One, a multiregression model that predicts the log of IMDb votes, and Model Two, a simple linear regression model that predicts log box office revenue based upon the log of IMDb votes. The former was the best performing of four multiregression models, developed using both forward selection and backward elimination method selection methods. These four models and their model selection methods were:
r kfigr::figr(label = "models", prefix = TRUE, link = TRUE, type="Table")
: Multiregression prediction models
models <- openxlsx::read.xlsx(xlsxFile = "../data/models.xlsx") knitr::kable(models, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
The remainder of this sections is organized as follows.
Model One: Multiregression Model
1.1. Model Selection Methods
1.2. Full Model
1.3. Model Alpha
1.4. Model Beta
1.5. Model Gamma
1.6. Model Delta
1.7. Model Comparison
1.8. Model Two: Final Multiregression Model
Model Two: Simple Linear Regression Model
2.1. Model Overview
2.2. Model Diagnostics
Model Summary
Model One was the best performing of models Alpha, Beta, Gamma, and Delta. The following provides an overview of the model selection methods used, then each model is described and diagnosed vis-a-vis assumptions of linearity, homoscedasticity, normality of errors, multicollinearity, and the treatment of influential points.
Both forward selection and backward elimination model selection techniques were used. The forward selection approach optimized adjusted r-squared; whereas the backward elimination method was based upon p-values.
The forward selection process began with a null model then all variables were added to the model, one-by-one, and the model which provided the greatest improvement over the current best adjusted R-squared was selected. The process repeated with each variable that was not already in the model until all variables were analyzed. Only the models that improved adjusted r-squared were retained at each step.
The backward elimination approach began with the full model. A regression analysis was performed and the least significant predictor (that with the highest p-value) was removed from the model. This process repeated, removing only the most least significant predictor at each step, until all predictors had p-values below the designated $\alpha = .05$ threshold.
The motivation for this analysis was to provide insight to inform the decisions studio executives must make at the early stages of a film project. That said, the variables selected for the full model were those that are "knowable" by studio executives before theatrical release. Consequently variables excluded from consideration as predictors consisted of:
variables not "knowable" at project inception. This would include film ratings, scores, academy awards, and box office results
variables that are redundant with other variables with higher correlations with the dependent variable
categorical variables with levels containing fewer than 5 observations, such as the actor, director and studio variables
dvd release dates as dvd sales are out of the scope of this analysis
As such, the full model is presented in r kfigr::figr(label = "fullModel", prefix = TRUE, link = TRUE, type="Table")
.
r kfigr::figr(label = "fullModel", prefix = TRUE, link = TRUE, type="Table")
: Variables in Full Model
full <- openxlsx::read.xlsx("../data/features.xlsx") full <- full %>% filter(c == "yes" & Context == "Explanatory") %>% select(Type, Variable, Description) %>% arrange(Type, Variable) knitr::kable(full, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
The following sections explore various models, model selection techniques, and model diagnostics. Comparisons were conducted and the models were evaluated on test data for prediction accuracy and stability. Lastly, the best performing model is selected and described on detail.
For this model, a forward selection procedure was undertaken based upon the full model described above. r kfigr::figr(label = "model_a_build", prefix = TRUE, link = TRUE, type="Table")
lists the variables in the order in which they were added.
r kfigr::figr(label = "model_a_build", prefix = TRUE, link = TRUE, type="Table")
: Model Alpha forward selection process
if (rerun == TRUE) { # Obtain Data mData <- process(train, stage = "m", y = "imdb_num_votes_log") # Perform forward selection m <- forward(data = mData$full, y = "imdb_num_votes_log") # Conduct regression analysis modelA <- regressionAnalysis(mod = m, mName = "Model Alpha", yVar = 'imdb_num_votes_log', yLab = "Log IMDb Votes") save(modelA, file = "../data/modelA.Rdata") } else { load("../data/modelA.Rdata") } # Report regression steps knitr::kable(modelA$build, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
As indicated in r kfigr::figr(label = "modelA_overview", prefix = TRUE, link = TRUE, type="Table")
and graphically depicted in r kfigr::figr(label = "modelA_regression", prefix = TRUE, link = TRUE, type="Figure")
, r modelA$comments$apa
r kfigr::figr(label = "modelA_overview", prefix = TRUE, link = TRUE, type="Table")
: Model Alpha Summary Statistics
knitr::kable(modelA$glance, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
modelA$plots$regression
r kfigr::figr(label = "modelA_regression", prefix = TRUE, link = TRUE, type="Figure")
Model Alpha Regression
The linearity of each predictor with the log number of IMDb votes is illustrated in r kfigr::figr(label = "modelA_linearity", prefix = TRUE, link = TRUE, type="Figure")
.
library(gridExtra) n <- length(modelA$plots$linearity) nCol <- floor(sqrt(n)) do.call("grid.arrange", c(modelA$plots$linearity, ncol=nCol))
r kfigr::figr(label = "modelA_linearity", prefix = TRUE, link = TRUE, type="Figure")
Model Alpha linearity plots
r modelA$comment$linearity
The following plot (r kfigr::figr(label = "modelA_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.
modelA$plots$res_fitted
r kfigr::figr(label = "modelA_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
Model Alpha homoscedasticity plot
r modelA$comments$homoscedasticity
The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelA_residuals", prefix = TRUE, link = TRUE, type="Figure")
illustrate the distribution of residuals.
gridExtra::grid.arrange(modelA$plots$res_hist, modelA$plots$res_qq, ncol = 2)
r kfigr::figr(label = "modelA_residuals", prefix = TRUE, link = TRUE, type="Figure")
Model Alpha residuals plot
r modelA$comments$normality
As shown in r kfigr::figr(label = "modelA_multicollinearity", prefix = TRUE, link = TRUE, type="Figure")
and r kfigr::figr(label = "modelA_vif", prefix = TRUE, link = TRUE, type="Table")
, r modelA$comments$collinearity
modelA$plots$multicollinearity$plot()
r kfigr::figr(label = "modelA_multicollinearity", prefix = TRUE, link = TRUE, type="Figure")
: Model Alpha correlations among quantitative predictors
r kfigr::figr(label = "modelA_vif", prefix = TRUE, link = TRUE, type="Table")
Model Alpha variance inflation Factors
knitr::kable(modelA$tests$collinearity, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(modelA$plots$res_leverage, modelA$plots$cooks, ncol = 2)
r kfigr::figr(label = "modelA_outliers", prefix = TRUE, link = TRUE, type="Figure")
Model Alpha Outliers
r modelA$comments$outliers
The discern the effect of these outliers on the model, a new model (Model B) was created without the outliers removed.
This was also a forward selection model; however, it was based upon the full model with outliers from Model Alpha removed. The variables were added as described in r kfigr::figr(label = "model_b_build", prefix = TRUE, link = TRUE, type="Table")
r kfigr::figr(label = "model_b_build", prefix = TRUE, link = TRUE, type="Table")
: Model Beta forward selection process
if (rerun == TRUE) { # Obtain Data mData <- process(train, stage = "m", "imdb_num_votes_log", outliers = modelA$tests$influential) # Perform forward selection m <- forward(mData$full, y = "imdb_num_votes_log") # Conduct regression analysis modelB <- regressionAnalysis(m, mName = "Model Beta", yVar = 'imdb_num_votes_log', yLab = "Log IMDb Votes") save(modelB, file = "../data/modelB.Rdata") } else { load(file = "../data/modelB.Rdata") } # Report regression stesp knitr::kable(modelB$build, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
As indicated in r kfigr::figr(label = "modelB_overview", prefix = TRUE, link = TRUE, type="Table")
and graphically depicted in r kfigr::figr(label = "modelB_regression", prefix = TRUE, link = TRUE, type="Figure")
, r modelB$comments$apa
r kfigr::figr(label = "modelB_overview", prefix = TRUE, link = TRUE, type="Table")
: Model Beta Summary Statistics
knitr::kable(modelB$glance, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
modelB$plots$regression
r kfigr::figr(label = "modelB_regression", prefix = TRUE, link = TRUE, type="Figure")
Model Beta Regression
The linearity of each predictor with the log number of IMDb votes is illustrated in r kfigr::figr(label = "modelB_linearity", prefix = TRUE, link = TRUE, type="Figure")
.
n <- length(modelB$plots$linearity) nCol <- floor(sqrt(n)) do.call("grid.arrange", c(modelB$plots$linearity, ncol=nCol))
r kfigr::figr(label = "modelB_linearity", prefix = TRUE, link = TRUE, type="Figure")
Model Beta linearity plots
r modelB$comment$linearity
The following plot (r kfigr::figr(label = "modelB_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.
modelB$plots$res_fitted
r kfigr::figr(label = "modelB_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
Model Beta homoscedasticity plot
r modelB$comments$homoscedasticity
The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelB_residuals", prefix = TRUE, link = TRUE, type="Figure")
illustrate the distribution of residuals.
gridExtra::grid.arrange(modelB$plots$res_hist, modelB$plots$res_qq, ncol = 2)
r kfigr::figr(label = "modelB_residuals", prefix = TRUE, link = TRUE, type="Figure")
Model Beta residuals plot
r modelB$comments$normality
As shown in r kfigr::figr(label = "modelB_multicollinearity", prefix = TRUE, link = TRUE, type="Figure")
and r kfigr::figr(label = "modelB_vif", prefix = TRUE, link = TRUE, type="Table")
, r modelB$comments$collinearity
modelB$plots$multicollinearity$plot()
r kfigr::figr(label = "modelB_multicollinearity", prefix = TRUE, link = TRUE, type="Figure")
: Correlations among quantitative predictors
r kfigr::figr(label = "modelB_vif", prefix = TRUE, link = TRUE, type="Table")
Model Beta Variance Inflation Factors
knitr::kable(modelB$tests$collinearity, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
Upon further analysis, the high VIF for genre and MPAA rating was a consequence of the reference categories having a small proportion of the overall cases. The reference category for genre, Action & Adventure, had just r nrow(subset(train, genre == "Action & Adventure"))
films, r round(nrow(subset(train, genre == "Action & Adventure")) / nrow(train) * 100, 1)
% of the cases. The reference category for MPAA rating was G, consisting of r nrow(subset(train, mpaa_rating == "G"))
observations, r round(nrow(subset(train, mpaa_rating == "G")) / nrow(train) * 100, 1)
% of the cases. Though the p-values for the indicator variables may be high, the overall test that all indicators have coefficients of zero is unaffected by the high VIFs
gridExtra::grid.arrange(modelB$plots$res_leverage, modelB$plots$cooks, ncol = 2)
r kfigr::figr(label = "modelB_outliers", prefix = TRUE, link = TRUE, type="Figure")
Model Beta Outliers
r modelB$comments$outliers
A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.
For this model, a backward elimination procedure was undertaken based upon the full model There were no variables removed from the model.
r kfigr::figr(label = "model_c_build", prefix = TRUE, link = TRUE, type="Table")
: Model Gamma
if (rerun == TRUE) { # Obtain Data mData <- process(train, stage = "m", "imdb_num_votes_log") # Perform forward selection m <- back(mData$full, y = "imdb_num_votes_log", alpha = 0.05) # Conduct regression analysis modelC <- regressionAnalysis(m, mName = "Model Gamma", yVar = 'imdb_num_votes_log', yLab = "Log IMDb Votes") save(modelC, file = "../data/modelC.Rdata") } else { load(file = "../data/modelC.Rdata") } # Report regression stesp knitr::kable(modelC$build, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
The model therefore retained the following variables:
r kfigr::figr(label = "modelC_variables", prefix = TRUE, link = TRUE, type="Table")
Model Gamma Variables
vars <- openxlsx::read.xlsx(xlsxFile = "../data/features.xlsx") vars <- vars %>% filter(Variable %in% modelC$anova$Term) %>% select(Variable, Description) knitr::kable(vars, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
As indicated in r kfigr::figr(label = "modelC_overview", prefix = TRUE, link = TRUE, type="Table")
and graphically depicted in r kfigr::figr(label = "modelC_regression", prefix = TRUE, link = TRUE, type="Figure")
, r modelC$comments$apa
r kfigr::figr(label = "modelC_overview", prefix = TRUE, link = TRUE, type="Table")
Model Gamma Summary Statistics
knitr::kable(modelC$glance, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
modelC$plots$regression
r kfigr::figr(label = "modelC_regression", prefix = TRUE, link = TRUE, type="Figure")
Model Gamma Regression
The linearity of each predictor with the log number of IMDb votes is illustrated in r kfigr::figr(label = "modelC_linearity", prefix = TRUE, link = TRUE, type="Figure")
.
n <- length(modelC$plots$linearity) nCol <- floor(sqrt(n)) do.call("grid.arrange", c(modelC$plots$linearity, ncol=nCol))
r kfigr::figr(label = "modelC_linearity", prefix = TRUE, link = TRUE, type="Figure")
Model Gamma linearity plots
r modelC$comment$linearity
The following plot (r kfigr::figr(label = "modelC_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.
modelC$plots$res_fitted
r kfigr::figr(label = "modelC_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
Model Gamma homoscedasticity plot
r modelC$comments$homoscedasticity
The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelC_residuals", prefix = TRUE, link = TRUE, type="Figure")
illustrate the distribution of residuals.
gridExtra::grid.arrange(modelC$plots$res_hist, modelC$plots$res_qq, ncol = 2)
r kfigr::figr(label = "modelC_residuals", prefix = TRUE, link = TRUE, type="Figure")
Model Gamma residuals plot
r modelC$comments$normality
As shown in r kfigr::figr(label = "modelC_multicollinearity", prefix = TRUE, link = TRUE, type="Figure")
and r kfigr::figr(label = "modelC_vif", prefix = TRUE, link = TRUE, type="Table")
, r modelC$comments$collinearity
modelC$plots$multicollinearity$plot()
r kfigr::figr(label = "modelC_multicollinearity", prefix = TRUE, link = TRUE, type="Figure")
: Correlations among quantitative predictors
r kfigr::figr(label = "modelC_vif", prefix = TRUE, link = TRUE, type="Table")
Model Gamma Variance Inflation Factors
knitr::kable(modelC$tests$collinearity, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(modelC$plots$res_leverage, modelC$plots$cooks, ncol = 2)
r kfigr::figr(label = "modelC_outliers", prefix = TRUE, link = TRUE, type="Figure")
Model Gamma Outliers
r modelC$comments$outliers
To discern the effect of the influential points on the model, a new model (Model Delta) was created without the influential points of this model.
This was also a backward elimination model; however, it was based upon the full model with outliers from Model Gamma removed. There were no variables removed from the model.
if (rerun == TRUE) { # Obtain Data mData <- process(train, stage = "m", "imdb_num_votes_log", outliers = modelC$tests$influential) # Perform forward selection m <- back(mData$full, y = "imdb_num_votes_log", alpha = 0.05) # Conduct regression analysis modelD <- regressionAnalysis(m, mName = "Model Delta", yVar = 'imdb_num_votes_log', yLab = "Log IMDb Votes") save(modelD, file = "../data/modelD.Rdata") } else { load(file = "../data/modelD.Rdata") } # Report regression stesp knitr::kable(modelD$build, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
The model therefore retained the following variables:
r kfigr::figr(label = "modelD_variables", prefix = TRUE, link = TRUE, type="Table")
Model Delta Variables
vars <- openxlsx::read.xlsx(xlsxFile = "../data/features.xlsx") vars <- vars %>% filter(Variable %in% modelD$anova$Term) %>% select(Variable, Description) knitr::kable(vars, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
As indicated in r kfigr::figr(label = "modelD_overview", prefix = TRUE, link = TRUE, type="Table")
and graphically depicted in r kfigr::figr(label = "modelD_regression", prefix = TRUE, link = TRUE, type="Figure")
, r modelD$comments$apa
r kfigr::figr(label = "modelD_overview", prefix = TRUE, link = TRUE, type="Table")
knitr::kable(modelD$glance, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
modelD$plots$regression
r kfigr::figr(label = "modelD_regression", prefix = TRUE, link = TRUE, type="Figure")
Model Delta Regression
The linearity of each predictor with the log number of IMDb votes is illustrated in r kfigr::figr(label = "modelD_linearity", prefix = TRUE, link = TRUE, type="Figure")
.
n <- length(modelD$plots$linearity) nCol <- floor(sqrt(n)) do.call("grid.arrange", c(modelD$plots$linearity, ncol=nCol))
r kfigr::figr(label = "modelD_linearity", prefix = TRUE, link = TRUE, type="Figure")
Model Delta linearity plots
r modelD$comment$linearity
The following plot (r kfigr::figr(label = "modelD_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.
modelD$plots$res_fitted
r kfigr::figr(label = "modelD_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
Model Delta homoscedasticity plot
r modelD$comments$homoscedasticity
The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelD_residuals", prefix = TRUE, link = TRUE, type="Figure")
illustrate the distribution of residuals.
gridExtra::grid.arrange(modelD$plots$res_hist, modelD$plots$res_qq, ncol = 2)
r kfigr::figr(label = "modelD_residuals", prefix = TRUE, link = TRUE, type="Figure")
Model Delta residuals plot
r modelD$comments$normality
As shown in r kfigr::figr(label = "modelD_multicollinearity", prefix = TRUE, link = TRUE, type="Figure")
and r kfigr::figr(label = "modelD_vif", prefix = TRUE, link = TRUE, type="Table")
, r modelD$comments$collinearity
modelD$plots$multicollinearity$plot()
r kfigr::figr(label = "modelD_multicollinearity", prefix = TRUE, link = TRUE, type="Figure")
: Correlations among quantitative predictors
r kfigr::figr(label = "modelD_vif", prefix = TRUE, link = TRUE, type="Table")
Model Delta Variance Inflation Factors
knitr::kable(modelD$tests$collinearity, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(modelD$plots$res_leverage, modelD$plots$cooks, ncol = 2)
r kfigr::figr(label = "modelD_outliers", prefix = TRUE, link = TRUE, type="Figure")
Model Delta Outliers
r modelD$comments$outliers
A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.
To summarize, models Alpha and Beta were constructed using forward selection and models Gamma and Delta were developed via backward elimination. Models Beta and Delta were fitted without the influential data points from models Alpha and Gamma respectively.
r kfigr::figr(label = "model_comparison", prefix = TRUE, link = TRUE, type="Table")
Summary of models
# Compare models abcd <- rbind(modelA$glance, modelB$glance, modelC$glance, modelD$glance) knitr::kable(abcd, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
The differences in root mean square error for the models was not significant r round((abcd$RMSE[3] - abcd$RMSE[1]) / abcd$RMSE[3] * 100, 2)
% and r round((abcd$RMSE[2] - abcd$RMSE[4]) / abcd$RMSE[4] * 100, 2)
%. Similarly, the differences in adjusted R-squared were r round((abcd[1,9] - abcd[3,9]) / abcd[3,9] * 100, 2)
% and r round((abcd[4,9] - abcd[2,9]) / abcd[2,9] * 100, 2)
%, not a significant difference. However, the differences in the percent variance explained by the models was not insignificant (r round((abcd[3,11] - abcd[1,11]) / abcd[1,11] * 100, 2)
% and r round((abcd[4,11] - abcd[2,11]) / abcd[2,11] * 100, 2)
%). The backward selection models tended to perform slightly better in this regard.
The Beta and Delta models were trained on data sans the influential points from Alpha and Gamma. The differences in RMSE (r round((abcd$RMSE[1] - abcd$RMSE[2]) / abcd$RMSE[2] * 100, 2)
% and r round((abcd$RMSE[3] - abcd$RMSE[4]) / abcd$RMSE[4] * 100, 2)
%) were insignificant, as were the differences in adjusted R-squared (r round((abcd[2,9] - abcd[1,9]) / abcd[1,9] * 100, 2)
% and r round((abcd[4,9] - abcd[3,9]) / abcd[3,9] * 100, 2)
%), and the percent of variance explained (r round((abcd[2,11] - abcd[1,11]) / abcd[1,11] * 100, 2)
% and r round((abcd[4,11] - abcd[3,11]) / abcd[3,11] * 100, 2)
%). However, a case-wise review of the influential points did not reveal any data quality issues; therefore, the points would not be removed.
The evaluate the effects of model selection method and the treatment of outliers on prediction accuracy, the four multiregression models were evaluated for prediction accuracy on the test data. Four measures of prediction accuracy were used:
In addition, a percent accuracy measure was computed as the percentage of the observations in the test set in which the actual log number of IMDb votes fell within the prediction interval.
r kfigr::figr(label = "prediction_comparison", prefix = TRUE, link = TRUE, type="Table")
Model Predictive Accuracy Summary
mods <- list(modelA, modelB, modelC, modelD) accuracy <- comparePredictions(mods = mods, test = test) knitr::kable(accuracy, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
There were no significant differences in MAPE, MSE, and RMSE between the models. The negative MPE indicated that all models were biased with over predictions. The lack of justification for excluding the influential points, ruled out models Beta and Delta. Model Alpha was selected as the most parsimonious mode since two of the predictors accounted for r round(sum(modelA$anova[1:2,7]) / sum(modelA$anova[,7]) * 100, 1)
% of total variation and r round(sum(modelA$anova[1:2,7]) / sum(modelA$anova[1:(nrow(modelA$anova) - 1),7]) * 100, 1)
% of variation allocated to terms. As such Model Alpha would advance to the prediction stage.
The final prediction equation was defined as follows:
$y_i$ =
r modelA$coefficients$estimate[1]
+ r round(modelA$coefficients$estimate[2], 3)
$x_1$
+ r round(modelA$coefficients$estimate[3], 3)
$x_2$
+ r round(modelA$coefficients$estimate[4], 3)
$x_3$
+ r round(modelA$coefficients$estimate[5], 3)
$x_4$
+ r round(modelA$coefficients$estimate[6], 3)
$x_5$
+ r round(modelA$coefficients$estimate[7], 3)
$x_6$
+ r round(modelA$coefficients$estimate[8], 3)
$x_7$
+ r round(modelA$coefficients$estimate[9], 3)
$x_8$
+ r round(modelA$coefficients$estimate[10], 3)
$x_9$
+ r round(modelA$coefficients$estimate[11], 3)
$x_{10}$
+ r round(modelA$coefficients$estimate[12], 3)
$x_{11}$
+ r round(modelA$coefficients$estimate[13], 3)
$x_{12}$
+ r round(modelA$coefficients$estimate[14], 3)
$x_{13}$
+ r round(modelA$coefficients$estimate[15], 3)
$x_{14}$
+ r round(modelA$coefficients$estimate[16], 3)
$x_{15}$
+ r round(modelA$coefficients$estimate[17], 3)
$x_{16}$
+ $\epsilon$
where:
$x_1$ is r modelA$coefficients$term[2]
$x_2$ is r modelA$coefficients$term[3]
$x_3$ is r modelA$coefficients$term[4]
$x_4$ is r modelA$coefficients$term[5]
$x_5$ is r modelA$coefficients$term[6]
$x_6$ is r modelA$coefficients$term[7]
$x_7$ is r modelA$coefficients$term[8]
$x_8$ is r modelA$coefficients$term[9]
$x_9$ is r modelA$coefficients$term[10]
$x_{10}$ is r modelA$coefficients$term[11]
$x_{11}$ is r modelA$coefficients$term[12]
$x_{12}$ is r modelA$coefficients$term[13]
$x_{13}$ is r modelA$coefficients$term[14]
$x_{14}$ is r modelA$coefficients$term[15]
$x_{15}$ is r modelA$coefficients$term[16]
$x_{16}$ is r modelA$coefficients$term[17]
The genre and MPAA rating variables were code 0 or 1 in accordance with the genre and MPAA rating each observation.
r kfigr::figr(label = "modelA_anova", prefix = TRUE, link = TRUE, type="Figure")
summarizes the analysis of variance.
a <- modelA$anova knitr::kable(a, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
r kfigr::figr(label = "modelA_anova", prefix = TRUE, link = TRUE, type="Figure")
Model Alpha analysis of variance
r modelA$comments$anova
Although there were only r nrow(modelA$build)
variables, there were some r nrow(modelA$coefficients)
coefficients, a consequence of the number of levels in the categorical variables. The coefficients estimates are identified in r kfigr::figr(label = "modelA_coef", prefix = TRUE, link = TRUE, type="Table")
.
r kfigr::figr(label = "modelA_coef", prefix = TRUE, link = TRUE, type="Table")
: Model Alpha Coefficients
knitr::kable(modelA$coefficients, digits = 3) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
The intercept estimate, r round(modelA$coefficients$estimate[1], 3)
, is the regression estimate for the mean log number of IMDb votes for a G-rated action and adventure film, with zeros for all of the other variables. The other coefficient estimates adjust the estimate accordingly. Therefore a prediction for the log number of IMDb votes was equal to:
r round(modelA$coefficients$estimate[1], 3)
, r round(subset(modelA$coefficients, term == "cast_scores", select = estimate), 3)
percent change in IMDb votes for each percent change in the point scores for the cast members.r round(subset(modelA$coefficients, term == "runtime_log", select = estimate), 3)
percent change in IMDb votes for each percent change in runtime, runmodelTwo <- function() { homoscedasticity <- FALSE out <- c() while(homoscedasticity == FALSE) { mData <- process(mdb2, stage = "a", y = "box_office_log", outliers = out) m <- slr(mData$full, y = "box_office_log") modelTwo <- regressionAnalysis(mod = m, mName = "Model Two", yVar = 'box_office_log', yLab = "Log Daily Box Office") if (modelTwo$tests$homoscedasticity$p > .05) { homoscedasticity <- TRUE save(modelTwo, file = '../data/modelTwo.Rdata') } else { out <- c(out, modelTwo$tests$influential) } } return(modelTwo) } if (rerun == TRUE) { modelTwo <- runmodelTwo() rerun <- FALSE } else { load(file = '../data/modelTwo.Rdata') }
A simple linear regression was calculated to predict the log of box office revenue based upon the log of the number of IMDb votes. A significant regression equation was found (F(r modelTwo$glance$df
,r modelTwo$glance[1,4]
) = r round(modelTwo$glance[1,5], 3)
, p < .001), with an $R^2$ of r round(modelTwo$glance[1,8], 3)
. The prediction equation is as follows:
$y_i$ = r round(modelTwo$coefficients$estimate[1], 2)
+ r round(modelTwo$coefficients$estimate[2], 2)
$x_1$
+ $\epsilon$
where:
$x_1$ is r modelTwo$coefficients$term[2]
r kfigr::figr(label = "modelTwo_comparison", prefix = TRUE, link = TRUE, type="Table")
: Best performing simple linear regression on log of box office revenue
knitr::kable(head(modelTwo$anova, 5), digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
The linearity of the predictor with the log of box office is illustrated in r kfigr::figr(label = "modelTwo_linearity", prefix = TRUE, link = TRUE, type="Figure")
.
modelTwo$plots$linearity[[1]]
r kfigr::figr(label = "modelTwo_linearity", prefix = TRUE, link = TRUE, type="Figure")
Model Two linearity plot
r modelTwo$comment$linearity
The following plot (r kfigr::figr(label = "modelTwo_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances.
modelTwo$plots$res_fitted
r kfigr::figr(label = "modelTwo_homoscedasticity", prefix = TRUE, link = TRUE, type="Figure")
Model Two homoscedasticity plot
r modelTwo$comments$homoscedasticity
The histogram and the normal Q-Q plot in r kfigr::figr(label = "modelTwo_residuals", prefix = TRUE, link = TRUE, type="Figure")
illustrate the distribution of residuals.
gridExtra::grid.arrange(modelTwo$plots$res_hist, modelTwo$plots$res_qq, ncol = 2)
r kfigr::figr(label = "modelTwo_residuals", prefix = TRUE, link = TRUE, type="Figure")
Model Two residuals plot
r modelTwo$comments$normality
gridExtra::grid.arrange(modelTwo$plots$res_leverage, modelTwo$plots$cooks, ncol = 2)
r kfigr::figr(label = "modelTwo_outliers", prefix = TRUE, link = TRUE, type="Figure")
Model Two Outliers
r modelTwo$comments$outliers
A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points were not removed from the model.
The purpose of this section was to develop models that would predict "box office success". Therefore, two regression models were fit in this section: Model One (F(r modelA$glance[1,3]
, r modelA$glance[1,4]
) = r round(modelA$glance[1,5], 2)
, p < .001), predicted the log number of IMDb votes, the proxy for box office success. Model Two, a simple linear regression model (F(r modelTwo$glance[1,3]
, r modelTwo$glance[1,4]
) = r round(modelTwo$glance[1,5],2)
, p < .001) predicted log box office revenue based upon the log IMDb votes.
Next, the models will be used to predict the number of log IMDb votes and the log box office for a randomly selected film.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.