if (rerun == TRUE) {
  fin1 <- openxlsx::read.xlsx(xlsxFile = "../data/sample.xlsx")
  fin2 <- openxlsx::read.xlsx(xlsxFile = "../data/financials.xlsx")
  genres <- openxlsx::read.xlsx(xlsxFile = "../data/genres.xlsx")
  save(fin1, file = "../data/fin1.Rdata")
  save(fin2, file = "../data/fin2.Rdata")
  save(genres, file = "../data/genres.Rdata")
}
load("../data/movies.Rdata")
load("../data/fin1.Rdata")
load("../data/fin2.Rdata")
if (rerun == TRUE) {
  dataSets  <- preprocess(movies, fin1, fin2)
  train <- dataSets$train
  test <- dataSets$test
  case <- dataSets$case
  mdb2 <- dataSets$mdb2
  actorScores <- dataSets$actorScores
  save(train, file = "../data/train.RData")
  save(test, file = "../data/test.RData")
  save(case, file = "../data/case.RData")
  save(mdb2, file = "../data/mdb2.RData")
  save(actorScores, file = "../data/actorScores.RData")
}
load("../data/train.Rdata")
load("../data/test.Rdata")
load("../data/case.Rdata")
load("../data/mdb2.Rdata")
mdb <- rbind(train, test)

Part 2: Data

The data were comprised of audience and critics opinions, actor, director, and studio information, as well as notable academy award nominations and winnings, for a random sample of 651 movies obtained from the IMDb [@Needham1990], Rotten Tomatoes [@Flixter], and BoxOfficeMojo websites [@Fritz2008]. To relate the available features to box office success, box office revenue was obtained for a random sample of r nrow(mdb[complete.cases(mdb),]) films from The Numbers website [@Services2014]. Information for ten movies from 2016 were obtained from these four sites. These cases served as the test set for the prediction models.

Data Sources

Launched in October 1990 by Col Needham, the Internet Movie Database (abbreviated IMDb) is an online database of film information, audience and critics ratings, plot summaries and reviews. As of November 2017, the site contained over 4.6 million titles, 8.2 million personalities, and hosts 80 million registered users [@Needham1990]. Rotten Tomatoes.com, so named from the practice of audiences throwing rotten tomatoes when disapproving of a poor stage performance, was launched officially in April 2000 by Berkeley student, Senh Duong. It provides audience and critics ratings for some 26 million users worldwide [@Flixter]. BoxOfficeMojo was founded in 1999 tracks box office information and publishes it on its site [@Fritz2008]. The Numbers website, provided by Nash Information Services, compiles and publishes statistics for the movie industry [@Services2014].

Generalizability & Causality

The data were randomly sampled from available IMDb and Rotten Tomatoes website apis, and so the inference should be generalizable to the population. However, a priori minimum sample size calculation was conducted, assuming 80% power and a significance level of .05, over the proportions of films by genre. Frequency counts by genre were obtained for 6 million movies from the IMDb website [@IMDbstats2015].

r kfigr::figr(label = "sampleSize", prefix = TRUE, link = TRUE, type="Table"): Sample size analysis of proportions by genre

# Determine effect size
genres <- openxlsx::read.xlsx(xlsxFile = '../data/genres.xlsx')
ss <- sampleSize(movies, genres)
knitr::kable(ss$genres, digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

As indicated in r kfigr::figr(label = "sampleSize", prefix = TRUE, link = TRUE, type="Table"), the minimum required sample sizes were not met for most of the genres, according to the proportions obtained from IMDb. This could reduce generalizability of models that include genre as a predictor.

Since this was an observational study, random assignment was not performed, as such causality is not indicated by this analysis.

Data Preprocessing

As a first step, observations for TV movies were removed as they were out of the scope of this analysis. Lacking the minimum observations for this categorical level, NC-17 films were also removed.

Variables Removed
Qualitative variables with fewer than 5 observations per level were removed. This included:
film title
the url addresses
* studio, actor and director variables

Variables containing redundant or unrelated information were removed, such as :
DVD release dates, theatre release year and day
critics rating, redundant with critics score
* audience rating, redundant with audience score

Variables Added
The following variables were obtained or derived from the existing variables:
Box office revenue was obtained for a random sampling of r nrow(mdb[complete.cases(mdb),]) movies, and added to the data set
A total score variable captured a films total score and was calculated as 10 * the IMDb rating + plus the audience score from Rotten Tomatoes A director experience variable captured the number of other films a director had directed.
A director votes variable counted the number of votes for other films in which the director had a role.
A director scores variable captured the sum of total scores for other films in which the director had a role.
A cast experience variable was the sum over the top 5 cast members, of the number of other movies in which the actor had played.
A cast scores variables characterized the cast popularity by summing over the actor popularity scores for each of the top 5 actors. See appendix B for the derivation.
A cast votes variable was computed analogously to ast scores.

Cross Validation
The data were split into a training set (80%) and a validation set (20%). The following exploratory data analysis and modeling was performed on the training set. Several regression models were evaluated using the validation set and the final prediction would be performed on the ten movies from 2016.

The full resultant codebook for the data set is listed in r kfigr::figr(label = "codebook", prefix = TRUE, link = TRUE, type="Table").

r kfigr::figr(label = "codebook", prefix = TRUE, link = TRUE, type="Table"): Movie data set codebook

codebook <- openxlsx::read.xlsx("../data/features.xlsx", sheet = 1)
codebook <- codebook %>% filter(final == "yes") %>% select(Type, Variable, Description) %>% arrange(Type, Variable)
knitr::kable(codebook, align = 'l') %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center") 



DataScienceSalon/mdb documentation built on May 28, 2019, 12:23 p.m.