Models

The principal question of this paper - the effect of default priors on Bayesian regression model selection, model size, posterior inclusion probabilities of regressors, inference and on predictive performance - was investigated in four steps. First, the model averages under BMA were explored in terms of model spaces, complexity and parameter inclusion probabilities for each of the nine default parameter priors under consideration. Second, model assumptions for inference, specifically, the distribution of residuals versus fitted values, were assessed. Third, model performance of the BMA, highest probability models (HPM), best predictive models (BPM) and the median predictive models (MPM) was evaluated for all nine priors. Mean squared error was the metric used for evaluating model performance. Lastly, the parameter estimates for the top models were interpreted. The best of the models would be used to predict audience scores for a heretofore, unobserved set of feature films released in 2016.

Bayesian Model Averaging

Bayesian model averaging was conducted on the full data of r nrow(yX) observations with a log transformation of the imdb_num_votes variable. To mitigate heteroscedacticity, the imdb_rating variable was removed from the data set.

Model Spaces

The following figures show the model spaces for each model set and nine default parameter priors. The rows correspond to each of the variables and the intercept term, the names of which are indicated along the y-axis. The columns pertain to the models used in the averaging process, whereby the width of the columns represents the posterior probabilities of the models. A wider column indicates a higher posterior probability, the values of which are indicated along the x-axis. The models are sorted left to right, from highest posterior probability to lowest. Variables excluded from the model are shown in black for each column. The colored areas represent variables included in the model, whereby the color is associated with, and proportional to, the log posterior probability of the variable. Models that have the same color have similar log probabilities.

The largest models were those based upon the AIC prior, with model sizes ranging from r min(summary(models[[2]])[20,-1]) to r max(summary(models[[2]])[20,-1]) regressors for the top five models. The top five Zellner's g-prior models included between r min(summary(models[[5]])[20,-1]) and r max(summary(models[[5]])[20,-1]) parameters. The remaining model spaces, containing between r min(summary(models[[4]])[20,-1]) to r max(summary(models[[4]])[20,-1]) predictors, were almost identical.

graphics::par(mfrow = c(1,3), family = "Open Sans")
for (i in 1:length(models)) {
  bmaImage(models[[i]], rotate = F, main = paste(models[[i]]$priorDesc, "Model Rank"))
  title(sub = "")
}

r kfigr::figr(label = "model_spaces", prefix = TRUE, link = TRUE, type="Figure"): Model spaces under BMA for the nine candidate parameter priors: model prior = uniform

Highest Probability Models (HPM)

The posterior probabilities, coefficients of determination and model sizes for the highest probability models under each prior in r kfigr::figr(label = "hpm", prefix = TRUE, link = TRUE, type="Figure") characterize models vis-a-vis the data. Several aspects are worth noting. First, despite having the largest model, the AIC prior had the lowest posterior probability, but the highest coefficient of determination, given the data. Second, the coefficients of determination were largely flat across all priors. Lastly, eight of the nine priors had equal sizes, yet produced models with a range of posterior probabilities, ranging fromr round(min(analysis$topModels$data[,2]), 3) to r round(max(analysis$topModels$data[,2]), 3)

do.call("grid.arrange", c(analysis$topModels$plots, ncol = 3))

r kfigr::figr(label = "hpm", prefix = TRUE, link = TRUE, type="Figure"): Highest probability models by prior

Model Complexity

The relationships between model complexity and posterior probability (r kfigr::figr(label = "complexity", prefix = TRUE, link = TRUE, type="Figure")) align with, and to an extent, explain the model spaces above. Recall, the AIC and g-prior models were characterized by their larger sizes relative to the those of the other models. Similarly, distributions of posterior probabilities for the AIC and g-prior models tended to center at the higher dimensions. On the other hand, the distributions of posterior probabilities for the other priors tended to occur with models containing between three and four predictors.

do.call("grid.arrange", c(analysis$complexity, ncol = 3))

r kfigr::figr(label = "complexity", prefix = TRUE, link = TRUE, type="Figure"): Model posterior probabilities vis-a-vis complexity

Parameter Posterior Inclusion Probabilities under BMA

r kfigr::figr(label = "pip", prefix = TRUE, link = TRUE, type="Table") reports the BMA posterior inclusion probabilities for all nine prior distributions. The number of predictors with an inclusion probability exceeding 50%, (not including the intercept) ranged from a low of r min(analysis$pip$nSig) to a high of r max(analysis$pip$nSig) regressors for the AIC prior.

r kfigr::figr(label = "pip", prefix = TRUE, link = TRUE, type="Table"): Posterior inclusion probabilities across parameter priors: model prior = uniform

knitr::kable(analysis$pip$plots$table, escape = F, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center") %>%
  kableExtra::footnote(general = "Posterior inclusion probabilities that exceed 50% are in bold font.")

With respect to the relative importance of the predictors, r kfigr::figr(label = "mean_pip", prefix = TRUE, link = TRUE, type="Figure") shows the average posterior inclusion probabilities for each parameter across all priors.

analysis$pip$plots$meanPIP

r kfigr::figr(label = "mean_pip", prefix = TRUE, link = TRUE, type="Figure"): Mean posterior inclusion probabilities across all nine priors.

Model Assumptions

To ascertain the degree to which inference assumptions were met, the distribution of errors vis-a-vis predictions were evaluated for homoscedasticity. r kfigr::figr(label = "rvf", prefix = TRUE, link = TRUE, type="Figure"), presents very similar distributions of errors vis-a-vis fitted values for each of the nine priors. A greater concentration of residuals was observed at the higher audience scores; whereas the residuals for lower scores were more dispersed. The quality of the predictions tended to be positively associated with the quantity of observations at any particular range of audience scores.

Residuals vs Fitted

par(mfrow = c(3, 3))
for (i in 1:length(models)) plot(models[[i]], which = 1, caption = (paste0("Residuals vs Fitted (", models[[i]]$prior,")")))

r kfigr::figr(label = "rvf", prefix = TRUE, link = TRUE, type="Figure"): Residual vs Fitted Values

Though residuals appeared to center at zero, the variance in residuals across the fitted values indicates a degree of heteroscedacticity. As such, standard errors for the parameter estimates may be biased. That said, the ordinary least squares estimates for the parameter coefficients are not affected. It was therefore concluded that inference using any of the nine priors and four estimators should be taken with caution; however, the 36 models were on "equal footing" with respect to prediction.

Model Performance

The predictive performance of competing default priors was compared on the basis of mean squared error (MSE) on hold-out samples. Predictions were rendered using the BAS package [@Clyde2017], for four model estimators: (1) the BMA model, (2) the best predictive model (BPM), (3) the highest probability model (HPM), and (4) the median probability model (MPM). Predictions for each of the nine priors and the four model estimators were rendered for a total of 36 predictions. The movie data set was randomly split into a training set, $D_{train}$, (80% of observations) which was used to train the model, and a test set, $D_{test}$, (20% of observations) which was used to assess the quality of the resulting predictive distributions. The mean MSE was computed for each prior and estimator, for a total of 36 models. This analysis was repeated r trials times for r trials different random splits. The average MSE of predictions for each model was computed as follows: $$1/j * 1/n_{test}\displaystyle\sum_{j = 1}^{t}\displaystyle\sum_{i = 1}^{n_{test}^{t}}(y_{test} - \hat{y}{test})^2$$ where:
$t$ is the total number of trials
$n
{test}^{t}$ is the total number of observations in $D_{test}^j$
$y_{test}$ is the observed audience score for an observation in $D_{test}^j$
$\hat{y}{test}$ is the predicted audience score for an observation in $D{test}^j$

The predictive performance of the nine parameter priors and four estimators, in conjunction with the uniform model priors and evaluated by MSE, are shown in r kfigr::figr(label = "performance_table", prefix = TRUE, link = TRUE, type="Table"). The top five scores (lowest average MSE) are highlighted in bold and the lowest average MSE is noted in red font. The r report$best$PriorDesc[1] prior using the r report$best$Estimator[1] estimator outperformed the other 35 models, but not decisively.

r kfigr::figr(label = "performance_table", prefix = TRUE, link = TRUE, type="Table"): Parameter priors and predictive performance by prior and estimator (movie dataset, model prior: uniform); r trials subsamples

knitr::kable(report$table, escape = F, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center") %>%
  kableExtra::footnote(general = "The five lowest MSE scores are in bold font. The lowest MSE is in red font.")

To ascertain the significance of the differences in mean MSE among the models, t-tests of the MSE means were conducted at an $\alpha = .05$ significance level. r kfigr::figr(label = "performance_compare", prefix = TRUE, link = TRUE, type="Table") reports the top ten models in order of ascending mean MSE. The p.values indicate the probability of observing a difference between the mean MSE for each model and the lowest MSE, under the null hypothesis. As indicated by the p.values, there was no statistically significant difference among eight of the top nine models.

r kfigr::figr(label = "performance_compare", prefix = TRUE, link = TRUE, type="Table"): Mean MSE by prior and estimator for top ten models with p.values for the $MSE -MSE_{best} = 0$.

scores <- report$best[c(1:10),c(2,3,4,5)]
colnames(scores) <- c("Prior", "Estimator", "Mean MSE", "p.value")
knitr::kable(scores, escape = F, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Parameter Estimates

The parameter estimates for the top ten models are listed in r kfigr::figr(label = "pe", prefix = TRUE, link = TRUE, type="Table"). Since the estimates for models one through six, nine and ten are almost identical, the following analyses will be limited to models 1, 7, 8 and 9.

r kfigr::figr(label = "pe", prefix = TRUE, link = TRUE, type="Table"): Parameter estimates for top 10 models

n <- names(eval$pdc)
knitr::kable(eval$pe, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center") %>%
  kableExtra::add_header_above(c(" " = 1, "(1)" = 1, "(2)" = 1,"(3)" = 1,"(4)" = 1,"(5)" = 1,"(6)" = 1,"(7)" = 1,"(8)" = 1,"(9)" = 1,"(10)" = 1))

r n[1] Prior

The regression equation for r n[1] prior is stated as follows:

$Y$ = r round(eval$pdc[[1]][1,2], 3) + r round(eval$pdc[[1]][2,2], 3)$x_1$ + r round(eval$pdc[[1]][3,2], 3)$x_2$ + r round(eval$pdc[[1]][4,2], 3)$x_3$ + r round(eval$pdc[[1]][5,2], 3)$x_4$ + r round(eval$pdc[[1]][6,2], 3)$x_5$

where:
$x_1$ is the feature film indicator variable (1 = yes, 0 = no)
$x_2$ is the drama genre indicator variable (1 = yes, 0 = no)
$x_3$ is the year of theatrical release
$x_4$ is the critics score
$x_5$ is the log number of IMDb votes

The single most influential parameter was the feature film indicator. All else equal, feature films received audience scores approximately r abs(round(eval$pdc[[1]][2,2], 1)) points lower than non-feature films. The next most influential variables were the drama indicator variable, which added an additional r round(eval$pdc[[1]][3,2], 3) points to audience score, and the number of IMDb votes, which increased audience scores by r round(eval$pdc[[1]][6,2], 3) percent, for each percent increase in the number of IMDb votes. Critics score had a slight positive effect of r round(eval$pdc[[1]][5,2], 3) points for each critic point received, all else equal. The year of theatrical release had a slight negative effect on audience scores (r abs(round(eval$pdc[[1]][4,2], 3)) points). r kfigr::figr(label = "pdc1", prefix = TRUE, link = TRUE, type="Table") which states the standard deviations and 95% credible intervals, reflects the uncertainty around these estimates.

r kfigr::figr(label = "pdc1", prefix = TRUE, link = TRUE, type="Table"): Coefficient Estimates for r n[1] prior.

print(knitr::kable(eval$pdc[[1]], digits = 4) %>%
kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center") %>%
kableExtra::group_rows(n[1], 1,nrow(eval$pdc[[1]])))

r n[7] Prior

The regression equation for r n[7] prior is stated as follows:

$Y$ = r round(eval$pdc[[7]][1,2], 3) + r round(eval$pdc[[7]][2,2], 3)$x_1$ + r round(eval$pdc[[7]][3,2], 3)$x_2$ + r round(eval$pdc[[7]][4,2], 3)$x_3$ + r round(eval$pdc[[7]][5,2], 3)$x_4$ + r round(eval$pdc[[7]][6,2], 3)$x_5$ + r round(eval$pdc[[7]][7,2], 3)$x_6$

where:
$x_1$ is the feature film indicator variable (1 = yes, 0 = no)
$x_2$ is the drama genre indicator variable (1 = yes, 0 = no)
$x_3$ is the year of theatrical release
$x_4$ is the critics score
$x_5$ is the Best Picture nomination indicator (1 = yes, 0 = no) $x_6$ is the log number of IMDb votes

Again, the single most influential parameter was the feature film indicator. All else equal, feature films received audience scores approximately r abs(round(eval$pdc[[7]][2,2], 1)) points lower than non-feature films. The next most influential parameter was the Best Picture nomination indicator variable. Films nominated for Best Picture earned an additional r round(eval$pdc[[7]][6,2], 3) audience score points, all else held equal. Next, the drama indicator effect was r round(eval$pdc[[7]][3,2], 3) points in the positive direction. The log number of IMDb votes also had a slight positive effect on audience scores (r round(eval$pdc[[7]][7,2], 3) points), as did critics score (r round(eval$pdc[[7]][5,2], 3) points). The year of theatrical release had a slight negative effect on audience scores (r abs(round(eval$pdc[[7]][4,2], 3)) points). The standard deviations and 95% credible intervals for the parameters are listed in r kfigr::figr(label = "pdc7", prefix = TRUE, link = TRUE, type="Table").

r kfigr::figr(label = "pdc7", prefix = TRUE, link = TRUE, type="Table"): Coefficient Estimates for r n[7] prior.

print(knitr::kable(eval$pdc[[7]], digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center") %>%
kableExtra::group_rows(n[7], 1,nrow(eval$pdc[[7]])))

r n[8] Prior

The regression equation for r n[8] prior is stated as follows:

$Y$ = r round(eval$pdc[[8]][1,2], 3) + r round(eval$pdc[[8]][2,2], 3)$x_1$ + r round(eval$pdc[[8]][3,2], 3)$x_2$ + r round(eval$pdc[[8]][4,2], 3)$x_3$ + r round(eval$pdc[[8]][5,2], 3)$x_4$ + r round(eval$pdc[[8]][6,2], 3)$x_5$ + r round(eval$pdc[[8]][7,2], 3)$x_6$ + r round(eval$pdc[[8]][8,2], 3)$x_7$

where:
$x_1$ is the feature film indicator variable (1 = yes, 0 = no)
$x_2$ is the drama genre indicator variable (1 = yes, 0 = no)
$x_3$ is the year of theatrical release
$x_4$ is the critics score
$x_5$ is the Best Picture nomination indicator (1 = yes, 0 = no) $x_6$ is the Best Actor indicator (1 = yes, 0 = no) $x_6$ is the log number of IMDb votes

Once again, feature films earned audience scores approximately r abs(round(eval$pdc[[8]][2,2], 1)) points lower than non-feature films. The next most influential parameter was the Best Picture nomination indicator variable. Films nominated for Best Picture earned an additional r round(eval$pdc[[8]][6,2], 3) audience score points, all else held equal. Next, the drama indicator effect was r round(eval$pdc[[8]][3,2], 3) points in the positive direction. The log number of IMDb votes also had a slight positive effect on audience scores (r round(eval$pdc[[8]][8,2], 3) points); whereas, films earning Best Actor award were associated with a r abs(round(eval$pdc[[8]][7,2], 3)) point reduction in audience scores. Critics scores had a slight positive effect, resulting in r abs(round(eval$pdc[[8]][5,2], 3)) audience score points for each point from the critics. The year of theatrical release had a slight negative effect on audience scores (r abs(round(eval$pdc[[8]][4,2], 3)) points). The standard deviations and 95% credible intervals for the parameters are listed in r kfigr::figr(label = "pdc8", prefix = TRUE, link = TRUE, type="Table").

r kfigr::figr(label = "pdc8", prefix = TRUE, link = TRUE, type="Table"): Coefficient Estimates for r n[8] prior.

print(knitr::kable(eval$pdc[[8]], digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center") %>%
kableExtra::group_rows(n[8], 1,nrow(eval$pdc[[8]])))

r n[9] Prior

The regression equation for r n[9] prior is stated as follows:

$Y$ = r round(eval$pdc[[9]][1,2], 3) + r round(eval$pdc[[9]][2,2], 3)$x_1$ + r round(eval$pdc[[9]][3,2], 3)$x_2$ + r round(eval$pdc[[9]][4,2], 3)$x_3$ + r round(eval$pdc[[9]][5,2], 3)$x_4$ + r round(eval$pdc[[9]][6,2], 3)$x_5$

where:
$x_1$ is the feature film indicator variable (1 = yes, 0 = no)
$x_2$ is the drama genre indicator variable (1 = yes, 0 = no)
$x_3$ is the year of theatrical release
$x_4$ is the critics score
$x_5$ is the log number of IMDb votes

As with the other priors, feature films under this model earned audience scores approximately r abs(round(eval$pdc[[9]][2,2], 1)) points lower than non-feature films. Drama films were associated with a r round(eval$pdc[[9]][3,2], 3) point increase in audience score points, all else held equal. The log number of IMDb votes also had a slight positive effect on audience scores (r round(eval$pdc[[9]][6,2], 3) points), as did critics score (r round(eval$pdc[[9]][5,2], 3) points). The year of theatrical release had a slight negative effect on audience scores (r abs(round(eval$pdc[[9]][4,2], 3)) points). The standard deviations and 95% credible intervals for the parameters are listed in r kfigr::figr(label = "pdc9", prefix = TRUE, link = TRUE, type="Table").

r kfigr::figr(label = "pdc9", prefix = TRUE, link = TRUE, type="Table"): Coefficient Estimates for r n[9] prior.

print(knitr::kable(eval$pdc[[9]], digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center") %>%
kableExtra::group_rows(n[9], 1,nrow(eval$pdc[[9]])))

The performance of the above four models was further evaluated on five heretofore, unseen films released in 2016.




DataScienceSalon/Bayesian-Regression documentation built on May 29, 2019, 12:06 a.m.