load(file = '../data/train.Rdata')
load(file = '../data/test.Rdata')
load(file = '../data/mdb2.Rdata')
load(file = '../data/genres.Rdata')
edaUni <- univariate(mdb = train)

Part 4: Exploratory Data Analysis

This exploratory univariate and bivariate data analysis was to examine the characteristics and attributes of the data before including the variables in the multiple regression models, to confirm that categorical variables met the success/failure condition at each level, to review the distributions of quantitative variables, to identify and inspect outliers, and to measure the correlations among the variables to be included in regression models.

Univariate Analysis

Best Director / Actor / Actress

As indicated in r kfigr::figr(label = "best_actor", prefix = TRUE, link = TRUE, type="Figure"), the percentages of films with best director, actor and actress oscars were r edaUni$bestDirWin$stats$Proportion[2] * 100%, r edaUni$bestActorWin$stats$Proportion[2] * 100%, and r edaUni$bestActressWin$stats$Proportion[2] * 100%, respectively.

gridExtra::grid.arrange(edaUni$bestDirWin$plot, edaUni$bestActorWin$plot, edaUni$bestActressWin$plot, ncol = 3)

r kfigr::figr(label = "best_actor", prefix = TRUE, link = TRUE, type="Figure"): Best director/actor/actress

The decision in this case was to assume that with random sampling, these ratios reflected the true population proportions. Since the minimum required number of observations required for statistical inference was met at each categorical level, these variables were retained for further analysis and modeling.

Best Picture

The proportions of oscar nominations and winnings are summarized in r kfigr::figr(label = "best_picture", prefix = TRUE, link = TRUE, type="Figure").

gridExtra::grid.arrange(edaUni$bestPicNom$plot, edaUni$bestPicWin$plot, ncol = 2)

r kfigr::figr(label = "best_picture", prefix = TRUE, link = TRUE, type="Figure"): Best picture nominations and wins

The severe inbalance in the observations at each level was troubling; however, random sampling suggests that the sample proportions are within a reasonable margin of error of the true population. The variable would be included in further analysis without prejudice.

Genre

The drama genre represented a plurality of the releases in the sample, followed by comedy action & adventure then mystery & suspense. The top four genres account for nearly r round(sum(head(edaUni$genre$stats %>% arrange(desc(Proportion)) %>% select(Proportion), 4)) * 100, -1)% of the films in the sample.

edaUni$genre$plot

r kfigr::figr(label = "genre", prefix = TRUE, link = TRUE, type="Figure"): Films by genre

sampleProportions <- train %>% filter(genre != "Art House & International") %>% group_by(genre) %>% summarize(N = n()) %>% 
  mutate(p = N /sum(N)) %>% select(genre, p)  %>% arrange(genre)
popProportions <- genres %>% select(genre, proportion) %>% arrange(genre)
t <- chisq.test(x = sampleProportions$p, popProportions$proportion)

A chi-square test for goodness-of-fit was conducted to determine the extent to which the sample proportions modeled the assumed population proportions obtained from the IMDb site. The hypotheses were as follows:
$H_0$: $p = \hat{p}$
$H_a$: $p \neq \hat{p}$
where:
$p$ is the population proportions of films by genre $\hat{p}$ is the sample proportions of films by genre

The results were not significant, $\chi^2$(r t$parameter, N = r length(sampleProportions$p)) = r round(t$statistic, 2), $p$ = r round(t$p.value, 2), as such one can infer that the sample proportions of films by genre models the population proportion.

MPAA Rating

Rated R films accounted for over r round(edaUni$mpaa$stats$Proportion[5] * 100, -1)% of the releases, followed by PG and PG-13. Collectively, R, PG, and PG-13 rated films represent r round(sum(head(edaUni$mpaa$stats %>% arrange(desc(Proportion)) %>% select(Proportion), 3)) * 100, -1)% of the films in the sample. NC-17 films were excluded from this analysis.

edaUni$mpaa$plot

r kfigr::figr(label = "mpaa", prefix = TRUE, link = TRUE, type="Figure"): Films by MPAA Rating

Since the number of films per MPAA rating exceeded the minimum required for statistical inference, this variable was included in the modeling process.

Theatrical Release Month

Though the plurality of features in the sample (r sum(edaUni$month$stats$Proportion[c(1,6,10,12)])*100%) were released during the months of January, June, October and December, the distribution of theatrical release months appeared fairly balanced within the sample.

edaUni$month$plot

r kfigr::figr(label = "month", prefix = TRUE, link = TRUE, type="Figure"): Theatrical releases by month

Since the number of films per month of release exceeded the minimum required for statistical inference, this variable was included in the modeling process.

Audience Scores

This variable captured the audience scores from Rotten Tomatoes for each film.

r kfigr::figr(label = "audience_scores_stats", prefix = TRUE, link = TRUE, type="Table"): Audience score summary statistics

knitr::kable(edaUni$audienceScores$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$audienceScores$hist, edaUni$audienceScores$qq, ncol = 2)

r kfigr::figr(label = "audience_scores_dist", prefix = TRUE, link = TRUE, type="Figure"): Audience score histogram and QQ Plot

edaUni$audienceScores$box

r kfigr::figr(label = "audience_scores_box", prefix = TRUE, link = TRUE, type="Figure"): Audience score boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "audience_scores_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$audienceScores$central

Dispersion: r edaUni$audienceScores$disp

Shape of Distribution: r edaUni$audienceScores$skew r edaUni$audienceScores$kurt The histogram and QQ plot in r kfigr::figr(label = "audience_scores_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a left skewed distribution that departs from normality. This variable will be retained and monitored for the distribution of errors vis-a-vis the dependent variable.

Outliers: The boxplot in r kfigr::figr(label = "audience_scores_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$audienceScores$outliers) == 0, "no", " ") outliers were extant. r edaUni$audienceScores$out

Box Office

Box office revenue was obtained for a subset of r nrow(mdb[complete.cases(mdb),]) randomly selected films from the sample. This analysis is based upon rr nrow(train[complete.cases(train),])` cases from the training set.

r kfigr::figr(label = "box_office_stats", prefix = TRUE, link = TRUE, type="Table"): Box office revenue summary statistics

knitr::kable(edaUni$boxOffice$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$boxOffice$hist, edaUni$boxOffice$qq, ncol = 2)

r kfigr::figr(label = "box_office_dist", prefix = TRUE, link = TRUE, type="Figure"): Box office revenue histogram and QQ Plot

edaUni$boxOffice$box

r kfigr::figr(label = "box_office_box", prefix = TRUE, link = TRUE, type="Figure"): Box office revenue boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "box_office_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$boxOffice$central

Dispersion: r edaUni$boxOffice$disp

Shape of Distribution: r edaUni$boxOffice$skew r edaUni$boxOffice$kurt The histogram and QQ plot in r kfigr::figr(label = "box_office_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a left skewed distribution that departs significantly from normality. A log transformation was performed and is evaluated below.

Outliers: The boxplot in r kfigr::figr(label = "box_office_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$boxOffice$outliers) == 0, "no", " ") outliers were extant. r edaUni$boxOffice$out

Box Office (Log)

This is a log transformation of the box office variable.

r kfigr::figr(label = "box_office_log_stats", prefix = TRUE, link = TRUE, type="Table"): Log box office revenue summary statistics

knitr::kable(edaUni$boxOfficeLog$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$boxOfficeLog$hist, edaUni$boxOfficeLog$qq, ncol = 2)

r kfigr::figr(label = "box_office_log_dist", prefix = TRUE, link = TRUE, type="Figure"): Log box office revenue histogram and QQ Plot

edaUni$boxOfficeLog$box

r kfigr::figr(label = "box_office_log_box", prefix = TRUE, link = TRUE, type="Figure"): Log box office revenue boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "box_office_log_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$boxOfficeLog$central

Dispersion: r edaUni$boxOfficeLog$disp

Shape of Distribution: r edaUni$boxOfficeLog$skew r edaUni$boxOfficeLog$kurt The histogram and QQ plot in r kfigr::figr(label = "box_office_log_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a left skewed distribution that approximates normality.

Outliers: The boxplot in r kfigr::figr(label = "box_office_log_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$boxOfficeLog$outliers) == 0, "no", " ") outliers were extant. r edaUni$boxOfficeLog$out

Cast Scores

This variable captured the total number of scores allocated to each cast member for a film.

r kfigr::figr(label = "cast_scores_stats", prefix = TRUE, link = TRUE, type="Table"): Cast scores summary statistics

knitr::kable(edaUni$castScores$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$castScores$hist, edaUni$castScores$qq, ncol = 2)

r kfigr::figr(label = "cast_scores_dist", prefix = TRUE, link = TRUE, type="Figure"): Cast scores histogram and QQ Plot

edaUni$castScores$box

r kfigr::figr(label = "cast_scores_box", prefix = TRUE, link = TRUE, type="Figure"): Cast scores boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "cast_scores_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$castScores$central

Dispersion: r edaUni$castScores$disp

Shape of Distribution: r edaUni$castScores$skew r edaUni$castScores$kurt The histogram and QQ plot in r kfigr::figr(label = "cast_scores_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a rigth skewed distribution that approximates normality.

Outliers: The boxplot in r kfigr::figr(label = "cast_scores_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$castScores$outliers) == 0, "no", " ") outliers were extant. r edaUni$castScores$out

Critics Scores

This variable captured the critics scores for each film

r kfigr::figr(label = "critics_scores_stats", prefix = TRUE, link = TRUE, type="Table"): Critics score summary statistics

knitr::kable(edaUni$criticsScores$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$criticsScores$hist, edaUni$criticsScores$qq, ncol = 2)

r kfigr::figr(label = "critics_scores_dist", prefix = TRUE, link = TRUE, type="Figure"): Critics score histogram and QQ Plot

edaUni$criticsScores$box

r kfigr::figr(label = "critics_scores_box", prefix = TRUE, link = TRUE, type="Figure"): Critics score boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "critics_scores_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$criticsScores$central

Dispersion: r edaUni$criticsScores$disp

Shape of Distribution: r edaUni$criticsScores$skew r edaUni$criticsScores$kurt The histogram and QQ plot in r kfigr::figr(label = "critics_scores_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a left skewed distribution that departs from normality. It will be retained for observation of errors vis-a-vis the dependent variable.

Outliers: The boxplot in r kfigr::figr(label = "critics_scores_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$criticsScores$outliers) == 0, "no", " ") outliers were extant. r edaUni$criticsScores$out

IMDb Number of Votes

This variable captured the number of IMDb votes cast for each film.

r kfigr::figr(label = "imdb_votes_stats", prefix = TRUE, link = TRUE, type="Table"): IMDb votes summary statistics

knitr::kable(edaUni$imdbVotes$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$imdbVotes$hist, edaUni$imdbVotes$qq, ncol = 2)

r kfigr::figr(label = "imdb_votes_dist", prefix = TRUE, link = TRUE, type="Figure"): IMDb votes histogram and QQ Plot

edaUni$imdbVotes$box

r kfigr::figr(label = "imdb_votes_box", prefix = TRUE, link = TRUE, type="Figure"): IMDb votes boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_votes_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$imdbVotes$central

Dispersion: r edaUni$imdbVotes$disp

Shape of Distribution: r edaUni$imdbVotes$skew r edaUni$imdbVotes$kurt The histogram and QQ plot in r kfigr::figr(label = "imdb_votes_dist", prefix = TRUE, link = TRUE, type="Figure") reveal a distribution which departs significantly from normality.

Outliers: The boxplot in r kfigr::figr(label = "imdb_votes_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$imdbVotes$outliers) == 0, "no", " ") outliers were extant. r edaUni$imdbVotes$out

IMDb Number of Votes (Log)

This was a log transformation of the IMDb votes variable.

r kfigr::figr(label = "imdb_votes_log_stats", prefix = TRUE, link = TRUE, type="Table"): Log IMDb votes summary statistics

knitr::kable(edaUni$imdbVotesLog$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$imdbVotesLog$hist, edaUni$imdbVotesLog$qq, ncol = 2)

r kfigr::figr(label = "imdb_votes_log_dist", prefix = TRUE, link = TRUE, type="Figure"): Log IMDb votes histogram and QQ Plot

edaUni$imdbVotesLog$box

r kfigr::figr(label = "imdb_votes_log_box", prefix = TRUE, link = TRUE, type="Figure"): Log IMDb votes boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_votes_log_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$imdbVotesLog$central

Dispersion: r edaUni$imdbVotesLog$disp

Shape of Distribution: r edaUni$imdbVotesLog$skew r edaUni$imdbVotesLog$kurt The histogram and QQ plot in r kfigr::figr(label = "imdb_votes_log_dist", prefix = TRUE, link = TRUE, type="Figure") reveal a nearly normal distribution.

Outliers: The boxplot in r kfigr::figr(label = "imdb_votes_log_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$imdbVotesLog$outliers) == 0, "no", " ") outliers were extant. r edaUni$imdbVotesLog$out

IMDb Ratings

This variable captured the IMDb rating for each film

r kfigr::figr(label = "imdb_rating_stats", prefix = TRUE, link = TRUE, type="Table"): IMDb rating summary statistics

knitr::kable(edaUni$imdbRating$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$imdbRating$hist, edaUni$imdbRating$qq, ncol = 2)

r kfigr::figr(label = "imdb_rating_dist", prefix = TRUE, link = TRUE, type="Figure"): IMDb rating histogram and QQ Plot

edaUni$imdbRating$box

r kfigr::figr(label = "imdb_rating_box", prefix = TRUE, link = TRUE, type="Figure"): IMDb rating boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_rating_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$imdbRating$central

Dispersion: r edaUni$imdbRating$disp

Shape of Distribution: r edaUni$imdbRating$skew r edaUni$imdbRating$kurt The histogram and QQ plot in r kfigr::figr(label = "imdb_rating_dist", prefix = TRUE, link = TRUE, type="Figure") reveal a nearly normal distribution.

Outliers: The boxplot in r kfigr::figr(label = "imdb_rating_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$imdbRating$outliers) == 0, "no", " ") outliers were extant. r edaUni$imdbRating$out

Runtime

This is an analysis of moving runtimes.

r kfigr::figr(label = "runtime_stats", prefix = TRUE, link = TRUE, type="Table"): Runtime summary statistics

knitr::kable(edaUni$runtime$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$runtime$hist, edaUni$runtime$qq, ncol = 2)

r kfigr::figr(label = "runtime_dist", prefix = TRUE, link = TRUE, type="Figure"): Runtime histogram and QQ Plot

edaUni$runtime$box

r kfigr::figr(label = "runtime_box", prefix = TRUE, link = TRUE, type="Figure"): Runtime boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "runtime_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$runtime$central

Dispersion: r edaUni$runtime$disp

Shape of Distribution: r edaUni$runtime$skew r edaUni$runtime$kurt The histogram and QQ plot in r kfigr::figr(label = "runtime_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a left skewed distribution that appears reasonably normal.

Outliers: The boxplot in r kfigr::figr(label = "runtime_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$runtime$outliers) == 0, "no", " ") outliers were extant. r edaUni$runtime$out

Runtime (Log)

This is an analysis of the log of moving runtimes.

r kfigr::figr(label = "runtime_log_stats", prefix = TRUE, link = TRUE, type="Table"): Log runtime summary statistics

knitr::kable(edaUni$runtimeLog$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$runtimeLog$hist, edaUni$runtimeLog$qq, ncol = 2)

r kfigr::figr(label = "runtime_log_dist", prefix = TRUE, link = TRUE, type="Figure"): Log runtime histogram and QQ Plot

edaUni$runtimeLog$box

r kfigr::figr(label = "runtime_log_box", prefix = TRUE, link = TRUE, type="Figure"): Log runtime boxplot

Central Tendency: The summary statistics (r kfigr::figr(label = "runtime_log_stats", prefix = TRUE, link = TRUE, type="Table")) r edaUni$runtimeLog$central

Dispersion: r edaUni$runtimeLog$disp

Shape of Distribution: r edaUni$runtimeLog$skew r edaUni$runtimeLog$kurt The histogram and QQ plot in r kfigr::figr(label = "runtime_log_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a slightly right skewed distribution that approximates normality.

Outliers: The boxplot in r kfigr::figr(label = "runtime_log_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$runtimeLog$outliers) == 0, "no", " ") outliers were extant. r edaUni$runtimeLog$out

Bivariate Analysis

Dependent Variable

As mentioned above, the first objective was to identify an available variable that would proxy for box office success.

r kfigr::figr(label = "correlation1", prefix = TRUE, link = TRUE, type="Table"): Variables most highly correlated with the log box office revenue

mData <- process(mdb2, stage = "a", y = "box_office_log")
rLogBoxOffice <- correlate(x = mData$numeric, y = mData$dependent, yLab = "Log Box Office", plot = FALSE)
save(x = rLogBoxOffice, file = '../data/rLogBoxOffice.Rdata')
knitr::kable(head(rLogBoxOffice$tests, 50), digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Pearson product-moment correlation coefficients were computed to determine which of the available variables most highly correlated with the log of box office revenue. r kfigr::figr(label = "correlation1", prefix = TRUE, link = TRUE, type="Table") reveals the log number of IMDb votes, r = r rLogBoxOffice$tests$Correlation[1], n = r nrow(mData$numeric), p = r rLogBoxOffice$tests[1,5] as having the highest correlation with log box office revenue. A scatterplot summarizes the results (r kfigr::figr(label = "scatterplot", prefix = TRUE, link = TRUE, type="Figure")) Overall, there was a strong, positive correlation between the log number of IMDb votes and the log box office.

plotData <- data.frame(y = mData$dependent, x = mData$numeric$imdb_num_votes_log)
plotScatter(data = plotData, xLab = "Log IMDb Votes", yLab = "Log Box Office")

r kfigr::figr(label = "scatterplot", prefix = TRUE, link = TRUE, type="Figure"): Log Box Office vs. Log IMDb Votes

Predictors

Having designated the log number of IMDb votes as the dependent variable, an analysis was conducted on the training set to determine the correlations between available predictors and log IMDb votes.

r kfigr::figr(label = "correlations", prefix = TRUE, link = TRUE, type="Table"): Predictor correlations with log IMDb votes

mData <- process(train, stage = "b", y = "imdb_num_votes_log")
rLogIMDbVotes  <- correlate(x = mData$numeric, y = mData$dependent, yLab = "Log IMDb Votes", plot = FALSE)
save(x = rLogIMDbVotes, file = '../data/rLogIMDbVotes.Rdata')
knitr::kable(head(rLogIMDbVotes$tests, 10), digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

r kfigr::figr(label = "correlations", prefix = TRUE, link = TRUE, type="Table") summarizes the results of several Pearson product-moment correlation tests. Indicators of cast popularity in terms of votes and scores were the most highly correlated predictors of log IMDb votes. The relatively high correlation with runtime was unexpected.

The next section describes two linear models: (1) a multiregression model to predict log IMDb votes, and (2) a simple linear regression model to predict log box office revenue based upon the log of IMDb votes.




DataScienceSalon/mdb documentation built on May 28, 2019, 12:23 p.m.