Part 7: Conclusion

load('../data/modelTwo.Rdata')
load('../data/modelA.Rdata')
load('../data/train.Rdata')
load(file = '../data/rLogBoxOffice.Rdata')
load(file = '../data/rLogIMDbVotes.Rdata')
conclusions <- conclusion(slr = modelTwo, mlr = modelA, data = train)

votesPlots <- conclusions$votes$plots
genrePlots <- conclusions$genre$plots
castPlots <- conclusions$cast$plots
runtimePlots <- conclusions$runtime$plots
mpaaPlots <- conclusions$mpaa$plots

votesData <- conclusions$votes$data
genreData <- conclusions$genre$data %>% arrange(desc(pctChgGenreRev))
mpaaData <- conclusions$mpaa$data %>% arrange(desc(pctChgMPAARev))

votesPctChg <- conclusions$votes$pctChg
castPctChg <- conclusions$cast$pctChg
rtPctChg <- conclusions$runtime$pctChg
mpaaPctChg <- conclusions$mpaa$pctChg

The purpose of this analysis was to determine the factors that make a movie popular, and to generate insights that would inform the key decisions studio executives and producers must make at the nascent stages of a project. Defining movie popularity strictly as box office success, came with its challenges. Since box office revenue was not among the available features, a proxy variable was designated for prediction. This raised the following research question.

Given the available data, which factor(s) is/are most associated with box office revenue?

Once this proxy variable was identified, it was designated as the dependent variable for multiple regression modeling, begging the second research question.

Given the available data, which combination of factors provide the best predictions of the designated proxy dependent variable?

Having identified a model to predict the proxy variable, the next challenge was to relate the predictions to box office revenue, leading to the third research question.

Given the variables and parameters identified above, what are the implications for box office success?

Question 1: Predictor of Box Office Revenue

The best predictor of box office revenue wasn't audience score, IMDb ratings or critics scores. In fact, each of those variables was negatively correlated with (log) box office revenue. Coefficients of correlation with the log of box office revenue were r = r round(subset(rLogBoxOffice$tests, Variable == 'audience_score', select = Correlation), 3), r = r round(subset(rLogBoxOffice$tests, Variable == 'imdb_rating', select = Correlation), 3), and r = r round(subset(rLogBoxOffice$tests, Variable == 'critics_score', select = Correlation), 3), for audience score, IMDb rating and critics scores, respectively. Rather, the (log) number of IMDb votes was most highly correlated with (log) box office revenue, with a correlation coefficient of r round(subset(rLogBoxOffice$tests, Variable == 'imdb_num_votes_log', select = Correlation), 3).

library(gridExtra)
n <- length(votesPlots)
nCol <- max(floor(sqrt(n)), 2)
do.call("grid.arrange", c(votesPlots, ncol=nCol))

r kfigr::figr(label = "c_votes", prefix = TRUE, link = TRUE, type="Figure"): Effect of votes on box office revenue

As shown in r kfigr::figr(label = "c_votes", prefix = TRUE, link = TRUE, type="Figure"), the log of IMDb votes grows linearly with the log of box office revenue; however, box office grows nonlinearly with the number of IMDb votes, on the linear scale. A one percent change in IMDb votes corresponds with a r round(votesPctChg$vb, 2) percent change in box office revenue.

So, studio executives and producers can worry less about the critics and focus rather on making films that generate the buzz that compels the audience to render an opinion, good or bad.

Question 2: Predictors of Buzz (IMDb Votes)

The following four predictors, in order of the percent of non-residual response variance explained, were cast scores (r round(subset(modelA$anova, Term == 'cast_scores', select = 7) / sum(modelA$anova[1:(nrow(modelA$anova)-1), 7]) * 100, 2)% ), genre (r round(subset(modelA$anova, Term == 'genre', select = 7) / sum(modelA$anova[1:(nrow(modelA$anova)-1), 7]) * 100, 2)% ), MPAA rating (r round(subset(modelA$anova, Term == 'mpaa_rating', select = 7) / sum(modelA$anova[1:(nrow(modelA$anova)-1), 7]) * 100, 2)% ), and (log) runtime (r round(subset(modelA$anova, Term == 'runtime_log', select = 7) / sum(modelA$anova[1:(nrow(modelA$anova)-1), 7]) * 100, 2)% ). A one percent change in cast scores is associated with a r round(castPctChg$csVotes, 3) percent change in the number of IMDb votes. Similarly, a one percent change on the runtime corresponds with a r round(rtPctChg$rtVotes, 3) percent change in IMDb votes. The effects of genre on IMDb votes are as follows:

`r kfigr::figr(label = "c_genre_votes", prefix = TRUE, link = TRUE, type="Table")`: Affect of genre on IMDb votes wzxhzdk:2
`r kfigr::figr(label = "c_mpaa_votes", prefix = TRUE, link = TRUE, type="Table")`: Affect of MPAA Rating on IMDb votes wzxhzdk:3

r gd$Genre[1], r tolower(gd$Genre[2]), and r tolower(gd$Genre[3]) films are associated with significant changes in the predicted number of IMDb votes. As indicated, r tolower(gd$Genre[1]), r tolower(gd$Genre[2]), and r tolower(gd$Genre[3]) increase predictions of IMDb votes by r round(gd[1,2], 0)%, r round(gd[2,2], 0)%, and r round(gd[3,2], 0)%, respectively.

r mpaa[1,1] and r mpaa[2,1] rated films are associated with significant increases in the predicted number of IMDb votes, r round(mpaa[1,2], 0)% and r round(mpaa[2,2], 0)%, respectively. As such, genre and MPAA rating are clearly among those factors with serious box office implications.

Question 3: Implications for Box Office Success

With respect to box office success, four observations emerged as noteworthy:

  1. Casting is likely the single most important decision studio executives, producers and directors must make. Incorporating actor experience and popularity measures into a quantitative selection process can have significant implications at the box office.
  2. Some genres are more successful than others. Animation, science fiction and fantasy and action & adventure tend to be associated with higher box office returns.
  3. An MPAA rating can be a significant predictor of box office success.
  4. The significant affect of runtime on box office returns indicates the extent to which it proxies for unobserved variables, such as production budget. Large production films tend to attract best actors, directors and crew. Though runtime had significantly affects predicted IMDb votes and box office revenue, it likely stands for the influence of one or more unobserved variables.

Casting: Top Predictor of Box Office Success

Casting is certainly one of the most important decisions a studio executive or casting director must make and the selection process, one would suspect, is quite complex with a variety of factors influencing the decision. This analysis has revealed the box office value of empirically determining actor popularity, based upon ratings such as those provided by IMDb and Rotten Tomatoes, and using that information to maximize total cast popularity, given other considerations. For this analysis, actor popularity was derived from a composite total score for a film, equal to 10 x the IMDb rating plus the audience score. For each film, the total film score was apportioned among the five credited actors as follows:
Actor 1: 40%
Actor 2: 30%
Actor 3: 15%
Actor 4: 10%
* Actor 5: 5%

Scores were computed for each actor and the cast score was the sum of the actor scores for that film. This measure accounted for r round(modelA$anova %>% filter(Term == 'cast_scores') %>% select(7), 2)% of total variance and r round(modelA$anova %>% filter(Term == 'cast_scores') %>% select(7) / sum(modelA$anova[1:(nrow(modelA$anova)-1),7]) * 100, 2)% of the variance attributed to the terms.

r kfigr::figr(label = "c_cast", prefix = TRUE, link = TRUE, type="Figure") illustrates the affect of cast scores on IMDb votes, and box office revenue.

library(gridExtra)
n <- length(castPlots)
nCol <- max(floor(sqrt(n)), 2)
do.call("grid.arrange", c(castPlots, ncol=nCol))

r kfigr::figr(label = "c_cast", prefix = TRUE, link = TRUE, type="Figure"): Impact of cast scores on box office

As shown in r kfigr::figr(label = "c_cast", prefix = TRUE, link = TRUE, type="Figure"), a one percent change in cast scores results in a r round(castPctChg$csVotes, 3)% change in IMDb votes. As shown in r kfigr::figr(label = "c_votes", prefix = TRUE, link = TRUE, type="Figure"), a one percent increase in IMDb votes equates to a r round(votesPctChg$vb, 3) percent change in revenue. As such, a one percent change in cast scores equates to a r round(castPctChg$csRev, 3)% change in box office revenue. Introducing such analytics to the casting process could yield significant gains at the box office.

Genre: r genreData$Genre[1], r tolower(genreData$Genre[2]), and r tolower(genreData$Genre[3])

The multiregression analysis showed that genre accounts for r round(modelA$anova %>% filter(Term == 'genre') %>% select(7), 2)% of total variance and r round(modelA$anova %>% filter(Term == 'genre') %>% select(7) / sum(modelA$anova[1:(nrow(modelA$anova)-1),7]) * 100, 2)% of the variance attributed to the terms. r kfigr::figr(label = "c_genres", prefix = TRUE, link = TRUE, type="Figure") illuminates the effect of genre on IMDb votes and box office revenue.

library(gridExtra)
n <- length(genrePlots)
nCol <- floor(sqrt(n))
do.call("grid.arrange", c(genrePlots, ncol=nCol))

r kfigr::figr(label = "c_genres", prefix = TRUE, link = TRUE, type="Figure"): Genre Analysis

The statistics in r kfigr::figr(label = "c_genres", prefix = TRUE, link = TRUE, type="Figure") assume an average number of cast scores of r round(mean(train$cast_scores), 0) and a mean runtime of r round(mean(train$runtime), 0) minutes and an MPAA rating of "R". r genreData$Genre[1], r tolower(genreData$Genre[2]), and r tolower(genreData$Genre[3]) are the top three box office earners according to this analysis. Using comedy as the reference level, r tolower(genreData$Genre[1]) earns r round(genreData$pctChgGenreVotes[1], 0)% more IMDb votes than comedy and r round(genreData$pctChgGenreRev[1], 0)% more revenue, all other factors constant. r genreData$Genre[2] receives r round(genreData$pctChgGenreVotes[2], 0)% and r round(genreData$pctChgGenreRev[2], 0)% more IMDb votes and box office revenue than comedy, respectively. Similarly, r tolower(genreData$Genre[3]) produce r round(genreData$pctChgGenreVotes[3], 0)% and r round(genreData$pctChgGenreRev[3], 0)% more IMDb votes and box office revenue than comedy, respectively. So sayeth, the numbers.

Runtime: Proxy Variable

Runtime, or its log transformation accounting for just r modelA$anova[modelA$anova$Term == "runtime_log", 7]% of the variance in the model, repeatedly emerged as a significant predictor of IMDb votes. In fact, relative to the other factors, its impact was not insignificant. A one percent change in runtime resulted in a r round(rtPctChg$rtVotes, 2)% change in the number of IMDb votes, equating to a r round(rtPctChg$rtRev, 2)% change in box office revenue!

library(gridExtra)
n <- length(runtimePlots)
nCol <- max(floor(sqrt(n)), 2)
do.call("grid.arrange", c(runtimePlots, ncol=nCol))

r kfigr::figr(label = "c_runtime", prefix = TRUE, link = TRUE, type="Figure"): Impact of runtime on box office

Runtime was likely a proxy for a variety of hidden factors such as production budget. High budget films tend to attract the best producers, directors, actors and crew. So, runtime is less a predictor of box office success, but a proxy for factors that were unavailable for this analysis.

MPAA Ratings: r mpaaData$mpaa[1] (not PG)

MPAA rating accounts for r round(modelA$anova %>% filter(Term == 'mpaa_rating') %>% select(7), 2)% of total variance and r round(modelA$anova %>% filter(Term == 'mpaa_rating') %>% select(7) / sum(modelA$anova[1:(nrow(modelA$anova)-1),7]) * 100, 2)% of the variance attributed to the terms. r kfigr::figr(label = "c_mpaa", prefix = TRUE, link = TRUE, type="Figure") illuminates the effect of MPAA Rating on IMDb votes and box office revenue.

library(gridExtra)
n <- length(mpaaPlots)
nCol <- max(floor(sqrt(n)), 2)
do.call("grid.arrange", c(mpaaPlots, ncol=nCol))

r kfigr::figr(label = "c_mpaa", prefix = TRUE, link = TRUE, type="Figure"): MPAA Analysis

The statistics in r kfigr::figr(label = "c_mpaa", prefix = TRUE, link = TRUE, type="Figure") assume an average number of cast scores of r round(mean(train$cast_scores), 0) and a mean runtime of r round(mean(train$runtime), 0) minutes for an action and adventure film. r mpaaData$mpaa[1], r mpaaData$mpaa[2], and r mpaaData$mpaa[3] rated films are the top three box office earners according to this analysis. Using rated G films as the reference, r mpaaData$mpaa[1] rated films earn r round(mpaaData$pctChgMPAAVotes[1], 0)% more IMDb votes than G rated films and r round(mpaaData$pctChgMPAARev[1], 0)% more revenue, all other factors constant. Rated r mpaaData$mpaa[2] films receive r round(mpaaData$pctChgMPAAVotes[2], 0)% and r round(mpaaData$pctChgMPAARev[2], 0)% more IMDb votes and box office revenue than G rated films, respectively. On the other hand r mpaaData$mpaa[4] rated films are associated with r abs(round(mpaaData$pctChgMPAAVotes[4], 0))% fewer IMDb votes and a reduction in revenue (relative to G rated films) of r abs(round(mpaaData$pctChgMPAARev[2], 0))%. r mpaaData$mpaa[5] films, attract r abs(round(mpaaData$pctChgMPAAVotes[5], 0))% fewer votes and earn r abs(round(mpaaData$pctChgMPAARev[5], 0))% less than G rated films at the box office.

Summary

The purpose of this analysis was to determine the factors that made a movie popular and to provide insights that would inform decisions made at the early stages of a film project. The (log) number of IMDb votes surfaced as the best predictor of box office success. Multiregression modeling identified cast scores, genre, runtime and MPAA rating as most highly associated with (log) IMDb votes. Simple linear modeling related these predictors to box office revenue. Casting using quantitative methods surfaced as the key observation from this study. Genre and MPAA ratings also have implications for box office success. Runtime, a significant predictor of box office revenue was likely a proxy for unobserved data such as production budget, studio / director experience and investor confidence.

Limitations of this Study

There was a significant dearth of information provided by the data. Limiting features to those that are knowable at or near project inception disallowed the use of audience and critics ratings as a predictor of film success. As such, the findings put forth are those supported by a very limited data set. Other factors that influence box office success, such as distribution, production budget, studio, director and cast history, were not available for this study.

Future Study

Future studies would benefit from significantly more depth and breadth of data. An adequate volume of data would refine predictions vis-a-vis actor / director experience and popularity. Expanding the feature set to include production budget, genre vis-a-vis demographic data, distribution, historical box office revenue, social media information, and topic data would provide for a more sophisticated model with greater generalizability. Moreover, box office success doesn't imply profitability. Future studies that examine both sides of the equation would provide the types of insights that ensure not only popularity, but profitability.




DataScienceSalon/mdb documentation built on May 28, 2019, 12:23 p.m.