load(file = '../data/train.Rdata') load(file = '../data/test.Rdata') load(file = '../data/mdb2.Rdata') load(file = '../data/genres.Rdata') edaUni <- univariate(mdb = train)
This exploratory univariate and bivariate data analysis was to examine the characteristics and attributes of the data before including the variables in the multiple regression models, to confirm that categorical variables met the success/failure condition at each level, to review the distributions of quantitative variables, to identify and inspect outliers, and to measure the correlations among the variables to be included in regression models.
As indicated in r kfigr::figr(label = "best_actor", prefix = TRUE, link = TRUE, type="Figure")
, the percentages of films with best director, actor and actress oscars were r edaUni$bestDirWin$stats$Proportion[2] * 100
%, r edaUni$bestActorWin$stats$Proportion[2] * 100
%, and r edaUni$bestActressWin$stats$Proportion[2] * 100
%, respectively.
gridExtra::grid.arrange(edaUni$bestDirWin$plot, edaUni$bestActorWin$plot, edaUni$bestActressWin$plot, ncol = 3)
r kfigr::figr(label = "best_actor", prefix = TRUE, link = TRUE, type="Figure")
: Best director/actor/actress
The decision in this case was to assume that with random sampling, these ratios reflected the true population proportions. Since the minimum required number of observations required for statistical inference was met at each categorical level, these variables were retained for further analysis and modeling.
The proportions of oscar nominations and winnings are summarized in r kfigr::figr(label = "best_picture", prefix = TRUE, link = TRUE, type="Figure")
.
gridExtra::grid.arrange(edaUni$bestPicNom$plot, edaUni$bestPicWin$plot, ncol = 2)
r kfigr::figr(label = "best_picture", prefix = TRUE, link = TRUE, type="Figure")
: Best picture nominations and wins
The severe inbalance in the observations at each level was troubling; however, random sampling suggests that the sample proportions are within a reasonable margin of error of the true population. The variable would be included in further analysis without prejudice.
The drama genre represented a plurality of the releases in the sample, followed by comedy action & adventure then mystery & suspense. The top four genres account for nearly r round(sum(head(edaUni$genre$stats %>% arrange(desc(Proportion)) %>% select(Proportion), 4)) * 100, -1)
% of the films in the sample.
edaUni$genre$plot
r kfigr::figr(label = "genre", prefix = TRUE, link = TRUE, type="Figure")
: Films by genre
sampleProportions <- train %>% filter(genre != "Art House & International") %>% group_by(genre) %>% summarize(N = n()) %>% mutate(p = N /sum(N)) %>% select(genre, p) %>% arrange(genre) popProportions <- genres %>% select(genre, proportion) %>% arrange(genre) t <- chisq.test(x = sampleProportions$p, popProportions$proportion)
A chi-square test for goodness-of-fit was conducted to determine the extent to which the sample proportions modeled the assumed population proportions obtained from the IMDb site. The hypotheses were as follows:
$H_0$: $p = \hat{p}$
$H_a$: $p \neq \hat{p}$
where:
$p$ is the population proportions of films by genre
$\hat{p}$ is the sample proportions of films by genre
The results were not significant, $\chi^2$(r t$parameter
, N = r length(sampleProportions$p)
) = r round(t$statistic, 2)
, $p$ = r round(t$p.value, 2)
, as such one can infer that the sample proportions of films by genre models the population proportion.
Rated R films accounted for over r round(edaUni$mpaa$stats$Proportion[5] * 100, -1)
% of the releases, followed by PG and PG-13. Collectively, R, PG, and PG-13 rated films represent r round(sum(head(edaUni$mpaa$stats %>% arrange(desc(Proportion)) %>% select(Proportion), 3)) * 100, -1)
% of the films in the sample. NC-17 films were excluded from this analysis.
edaUni$mpaa$plot
r kfigr::figr(label = "mpaa", prefix = TRUE, link = TRUE, type="Figure")
: Films by MPAA Rating
Since the number of films per MPAA rating exceeded the minimum required for statistical inference, this variable was included in the modeling process.
Though the plurality of features in the sample (r sum(edaUni$month$stats$Proportion[c(1,6,10,12)])*100
%) were released during the months of January, June, October and December, the distribution of theatrical release months appeared fairly balanced within the sample.
edaUni$month$plot
r kfigr::figr(label = "month", prefix = TRUE, link = TRUE, type="Figure")
: Theatrical releases by month
Since the number of films per month of release exceeded the minimum required for statistical inference, this variable was included in the modeling process.
This variable captured the audience scores from Rotten Tomatoes for each film.
r kfigr::figr(label = "audience_scores_stats", prefix = TRUE, link = TRUE, type="Table")
: Audience score summary statistics
knitr::kable(edaUni$audienceScores$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$audienceScores$hist, edaUni$audienceScores$qq, ncol = 2)
r kfigr::figr(label = "audience_scores_dist", prefix = TRUE, link = TRUE, type="Figure")
: Audience score histogram and QQ Plot
edaUni$audienceScores$box
r kfigr::figr(label = "audience_scores_box", prefix = TRUE, link = TRUE, type="Figure")
: Audience score boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "audience_scores_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$audienceScores$central
Dispersion: r edaUni$audienceScores$disp
Shape of Distribution: r edaUni$audienceScores$skew
r edaUni$audienceScores$kurt
The histogram and QQ plot in r kfigr::figr(label = "audience_scores_dist", prefix = TRUE, link = TRUE, type="Figure")
reveals a left skewed distribution that departs from normality. This variable will be retained and monitored for the distribution of errors vis-a-vis the dependent variable.
Outliers: The boxplot in r kfigr::figr(label = "audience_scores_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$audienceScores$outliers) == 0, "no", " ")
outliers were extant. r edaUni$audienceScores$out
Box office revenue was obtained for a subset of r nrow(mdb[complete.cases(mdb),])
randomly selected films from the sample. This analysis is based upon r
r nrow(train[complete.cases(train),])` cases from the training set.
r kfigr::figr(label = "box_office_stats", prefix = TRUE, link = TRUE, type="Table")
: Box office revenue summary statistics
knitr::kable(edaUni$boxOffice$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$boxOffice$hist, edaUni$boxOffice$qq, ncol = 2)
r kfigr::figr(label = "box_office_dist", prefix = TRUE, link = TRUE, type="Figure")
: Box office revenue histogram and QQ Plot
edaUni$boxOffice$box
r kfigr::figr(label = "box_office_box", prefix = TRUE, link = TRUE, type="Figure")
: Box office revenue boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "box_office_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$boxOffice$central
Dispersion: r edaUni$boxOffice$disp
Shape of Distribution: r edaUni$boxOffice$skew
r edaUni$boxOffice$kurt
The histogram and QQ plot in r kfigr::figr(label = "box_office_dist", prefix = TRUE, link = TRUE, type="Figure")
reveals a left skewed distribution that departs significantly from normality. A log transformation was performed and is evaluated below.
Outliers: The boxplot in r kfigr::figr(label = "box_office_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$boxOffice$outliers) == 0, "no", " ")
outliers were extant. r edaUni$boxOffice$out
This is a log transformation of the box office variable.
r kfigr::figr(label = "box_office_log_stats", prefix = TRUE, link = TRUE, type="Table")
: Log box office revenue summary statistics
knitr::kable(edaUni$boxOfficeLog$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$boxOfficeLog$hist, edaUni$boxOfficeLog$qq, ncol = 2)
r kfigr::figr(label = "box_office_log_dist", prefix = TRUE, link = TRUE, type="Figure")
: Log box office revenue histogram and QQ Plot
edaUni$boxOfficeLog$box
r kfigr::figr(label = "box_office_log_box", prefix = TRUE, link = TRUE, type="Figure")
: Log box office revenue boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "box_office_log_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$boxOfficeLog$central
Dispersion: r edaUni$boxOfficeLog$disp
Shape of Distribution: r edaUni$boxOfficeLog$skew
r edaUni$boxOfficeLog$kurt
The histogram and QQ plot in r kfigr::figr(label = "box_office_log_dist", prefix = TRUE, link = TRUE, type="Figure")
reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in r kfigr::figr(label = "box_office_log_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$boxOfficeLog$outliers) == 0, "no", " ")
outliers were extant. r edaUni$boxOfficeLog$out
This variable captured the total number of scores allocated to each cast member for a film.
r kfigr::figr(label = "cast_scores_stats", prefix = TRUE, link = TRUE, type="Table")
: Cast scores summary statistics
knitr::kable(edaUni$castScores$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$castScores$hist, edaUni$castScores$qq, ncol = 2)
r kfigr::figr(label = "cast_scores_dist", prefix = TRUE, link = TRUE, type="Figure")
: Cast scores histogram and QQ Plot
edaUni$castScores$box
r kfigr::figr(label = "cast_scores_box", prefix = TRUE, link = TRUE, type="Figure")
: Cast scores boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "cast_scores_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$castScores$central
Dispersion: r edaUni$castScores$disp
Shape of Distribution: r edaUni$castScores$skew
r edaUni$castScores$kurt
The histogram and QQ plot in r kfigr::figr(label = "cast_scores_dist", prefix = TRUE, link = TRUE, type="Figure")
reveals a rigth skewed distribution that approximates normality.
Outliers: The boxplot in r kfigr::figr(label = "cast_scores_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$castScores$outliers) == 0, "no", " ")
outliers were extant. r edaUni$castScores$out
This variable captured the critics scores for each film
r kfigr::figr(label = "critics_scores_stats", prefix = TRUE, link = TRUE, type="Table")
: Critics score summary statistics
knitr::kable(edaUni$criticsScores$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$criticsScores$hist, edaUni$criticsScores$qq, ncol = 2)
r kfigr::figr(label = "critics_scores_dist", prefix = TRUE, link = TRUE, type="Figure")
: Critics score histogram and QQ Plot
edaUni$criticsScores$box
r kfigr::figr(label = "critics_scores_box", prefix = TRUE, link = TRUE, type="Figure")
: Critics score boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "critics_scores_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$criticsScores$central
Dispersion: r edaUni$criticsScores$disp
Shape of Distribution: r edaUni$criticsScores$skew
r edaUni$criticsScores$kurt
The histogram and QQ plot in r kfigr::figr(label = "critics_scores_dist", prefix = TRUE, link = TRUE, type="Figure")
reveals a left skewed distribution that departs from normality. It will be retained for observation of errors vis-a-vis the dependent variable.
Outliers: The boxplot in r kfigr::figr(label = "critics_scores_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$criticsScores$outliers) == 0, "no", " ")
outliers were extant. r edaUni$criticsScores$out
This variable captured the number of IMDb votes cast for each film.
r kfigr::figr(label = "imdb_votes_stats", prefix = TRUE, link = TRUE, type="Table")
: IMDb votes summary statistics
knitr::kable(edaUni$imdbVotes$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$imdbVotes$hist, edaUni$imdbVotes$qq, ncol = 2)
r kfigr::figr(label = "imdb_votes_dist", prefix = TRUE, link = TRUE, type="Figure")
: IMDb votes histogram and QQ Plot
edaUni$imdbVotes$box
r kfigr::figr(label = "imdb_votes_box", prefix = TRUE, link = TRUE, type="Figure")
: IMDb votes boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_votes_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$imdbVotes$central
Dispersion: r edaUni$imdbVotes$disp
Shape of Distribution: r edaUni$imdbVotes$skew
r edaUni$imdbVotes$kurt
The histogram and QQ plot in r kfigr::figr(label = "imdb_votes_dist", prefix = TRUE, link = TRUE, type="Figure")
reveal a distribution which departs significantly from normality.
Outliers: The boxplot in r kfigr::figr(label = "imdb_votes_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$imdbVotes$outliers) == 0, "no", " ")
outliers were extant. r edaUni$imdbVotes$out
This was a log transformation of the IMDb votes variable.
r kfigr::figr(label = "imdb_votes_log_stats", prefix = TRUE, link = TRUE, type="Table")
: Log IMDb votes summary statistics
knitr::kable(edaUni$imdbVotesLog$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$imdbVotesLog$hist, edaUni$imdbVotesLog$qq, ncol = 2)
r kfigr::figr(label = "imdb_votes_log_dist", prefix = TRUE, link = TRUE, type="Figure")
: Log IMDb votes histogram and QQ Plot
edaUni$imdbVotesLog$box
r kfigr::figr(label = "imdb_votes_log_box", prefix = TRUE, link = TRUE, type="Figure")
: Log IMDb votes boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_votes_log_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$imdbVotesLog$central
Dispersion: r edaUni$imdbVotesLog$disp
Shape of Distribution: r edaUni$imdbVotesLog$skew
r edaUni$imdbVotesLog$kurt
The histogram and QQ plot in r kfigr::figr(label = "imdb_votes_log_dist", prefix = TRUE, link = TRUE, type="Figure")
reveal a nearly normal distribution.
Outliers: The boxplot in r kfigr::figr(label = "imdb_votes_log_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$imdbVotesLog$outliers) == 0, "no", " ")
outliers were extant. r edaUni$imdbVotesLog$out
This variable captured the IMDb rating for each film
r kfigr::figr(label = "imdb_rating_stats", prefix = TRUE, link = TRUE, type="Table")
: IMDb rating summary statistics
knitr::kable(edaUni$imdbRating$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$imdbRating$hist, edaUni$imdbRating$qq, ncol = 2)
r kfigr::figr(label = "imdb_rating_dist", prefix = TRUE, link = TRUE, type="Figure")
: IMDb rating histogram and QQ Plot
edaUni$imdbRating$box
r kfigr::figr(label = "imdb_rating_box", prefix = TRUE, link = TRUE, type="Figure")
: IMDb rating boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_rating_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$imdbRating$central
Dispersion: r edaUni$imdbRating$disp
Shape of Distribution: r edaUni$imdbRating$skew
r edaUni$imdbRating$kurt
The histogram and QQ plot in r kfigr::figr(label = "imdb_rating_dist", prefix = TRUE, link = TRUE, type="Figure")
reveal a nearly normal distribution.
Outliers: The boxplot in r kfigr::figr(label = "imdb_rating_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$imdbRating$outliers) == 0, "no", " ")
outliers were extant. r edaUni$imdbRating$out
This is an analysis of moving runtimes.
r kfigr::figr(label = "runtime_stats", prefix = TRUE, link = TRUE, type="Table")
: Runtime summary statistics
knitr::kable(edaUni$runtime$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$runtime$hist, edaUni$runtime$qq, ncol = 2)
r kfigr::figr(label = "runtime_dist", prefix = TRUE, link = TRUE, type="Figure")
: Runtime histogram and QQ Plot
edaUni$runtime$box
r kfigr::figr(label = "runtime_box", prefix = TRUE, link = TRUE, type="Figure")
: Runtime boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "runtime_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$runtime$central
Dispersion: r edaUni$runtime$disp
Shape of Distribution: r edaUni$runtime$skew
r edaUni$runtime$kurt
The histogram and QQ plot in r kfigr::figr(label = "runtime_dist", prefix = TRUE, link = TRUE, type="Figure")
reveals a left skewed distribution that appears reasonably normal.
Outliers: The boxplot in r kfigr::figr(label = "runtime_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$runtime$outliers) == 0, "no", " ")
outliers were extant. r edaUni$runtime$out
This is an analysis of the log of moving runtimes.
r kfigr::figr(label = "runtime_log_stats", prefix = TRUE, link = TRUE, type="Table")
: Log runtime summary statistics
knitr::kable(edaUni$runtimeLog$stats, digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$runtimeLog$hist, edaUni$runtimeLog$qq, ncol = 2)
r kfigr::figr(label = "runtime_log_dist", prefix = TRUE, link = TRUE, type="Figure")
: Log runtime histogram and QQ Plot
edaUni$runtimeLog$box
r kfigr::figr(label = "runtime_log_box", prefix = TRUE, link = TRUE, type="Figure")
: Log runtime boxplot
Central Tendency: The summary statistics (r kfigr::figr(label = "runtime_log_stats", prefix = TRUE, link = TRUE, type="Table")
) r edaUni$runtimeLog$central
Dispersion: r edaUni$runtimeLog$disp
Shape of Distribution: r edaUni$runtimeLog$skew
r edaUni$runtimeLog$kurt
The histogram and QQ plot in r kfigr::figr(label = "runtime_log_dist", prefix = TRUE, link = TRUE, type="Figure")
reveals a slightly right skewed distribution that approximates normality.
Outliers: The boxplot in r kfigr::figr(label = "runtime_log_box", prefix = TRUE, link = TRUE, type="Figure")
, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$runtimeLog$outliers) == 0, "no", " ")
outliers were extant. r edaUni$runtimeLog$out
As mentioned above, the first objective was to identify an available variable that would proxy for box office success.
r kfigr::figr(label = "correlation1", prefix = TRUE, link = TRUE, type="Table")
: Variables most highly correlated with the log box office revenue
mData <- process(mdb2, stage = "a", y = "box_office_log") rLogBoxOffice <- correlate(x = mData$numeric, y = mData$dependent, yLab = "Log Box Office", plot = FALSE) save(x = rLogBoxOffice, file = '../data/rLogBoxOffice.Rdata') knitr::kable(head(rLogBoxOffice$tests, 50), digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
Pearson product-moment correlation coefficients were computed to determine which of the available variables most highly correlated with the log of box office revenue. r kfigr::figr(label = "correlation1", prefix = TRUE, link = TRUE, type="Table")
reveals the log number of IMDb votes, r = r rLogBoxOffice$tests$Correlation[1]
, n = r nrow(mData$numeric)
, p = r rLogBoxOffice$tests[1,5]
as having the highest correlation with log box office revenue. A scatterplot summarizes the results (r kfigr::figr(label = "scatterplot", prefix = TRUE, link = TRUE, type="Figure")
) Overall, there was a strong, positive correlation between the log number of IMDb votes and the log box office.
plotData <- data.frame(y = mData$dependent, x = mData$numeric$imdb_num_votes_log) plotScatter(data = plotData, xLab = "Log IMDb Votes", yLab = "Log Box Office")
r kfigr::figr(label = "scatterplot", prefix = TRUE, link = TRUE, type="Figure")
: Log Box Office vs. Log IMDb Votes
Having designated the log number of IMDb votes as the dependent variable, an analysis was conducted on the training set to determine the correlations between available predictors and log IMDb votes.
r kfigr::figr(label = "correlations", prefix = TRUE, link = TRUE, type="Table")
: Predictor correlations with log IMDb votes
mData <- process(train, stage = "b", y = "imdb_num_votes_log") rLogIMDbVotes <- correlate(x = mData$numeric, y = mData$dependent, yLab = "Log IMDb Votes", plot = FALSE) save(x = rLogIMDbVotes, file = '../data/rLogIMDbVotes.Rdata') knitr::kable(head(rLogIMDbVotes$tests, 10), digits = 2) %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
r kfigr::figr(label = "correlations", prefix = TRUE, link = TRUE, type="Table")
summarizes the results of several Pearson product-moment correlation tests. Indicators of cast popularity in terms of votes and scores were the most highly correlated predictors of log IMDb votes. The relatively high correlation with runtime was unexpected.
The next section describes two linear models: (1) a multiregression model to predict log IMDb votes, and (2) a simple linear regression model to predict log box office revenue based upon the log of IMDb votes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.