Exploratory Data Analysis

This exploratory data analysis (EDA) is comprised of three parts: (1) the univariate analysis which examines the frequencies and proportions of categorical variables and the centrality, variability, spread and shape of the distributions of quantitative variables, (2) the bivariate analysis which explores the relationships between the independent variables and audience scores, and (3) an association/correlation analysis which reveals any potential collinearity that may arise as a consequence of the relationships among the independent variables.

Univariate Analysis

Qualitative Analysis

Summary Variables

r kfigr::figr(label = "films_summary", prefix = TRUE, link = TRUE, type="Figure"), shows the counts and percentages for three summary variables: the drama, the feature film, and the MPAA R rating indicator variables. The drama genre constituted nearly half of the observations in the sample. A plurality of films were R rated and nearly all films surveyed were indeed, feature films.

edaUni$qual$summary$plot

r kfigr::figr(label = "films_summary", prefix = TRUE, link = TRUE, type="Figure"): Drama, features and R Rated Films

Film Performance Variables

r kfigr::figr(label = "performance", prefix = TRUE, link = TRUE, type="Figure") summarizes the counts and proportions of films in the sample which have achieved notability at the box office or with the Academy. Best Picture winners, top 200 box office earners, and Best Picture nominees claimed the top one, two, and three percent of the films, respectively. Films earning Best Director, Best Actress, and Best Actor Oscars were slightly less rarefied at 7%, 11%, and 14% of the sample respectively.

edaUni$qual$performance$plot

r kfigr::figr(label = "performance", prefix = TRUE, link = TRUE, type="Figure"): Oscar Awards and Top 200 Box Office Class

Theatrical Release Season

The Oscar season, starting in October and lasting until 31 December, marks the period in which Hollywood studios release their more critically acclaimed films. As indicated in r kfigr::figr(label = "Oscar", prefix = TRUE, link = TRUE, type="Figure"), approximately 30% of films in the dataset were released during the Oscar season.

edaUni$qual$season$plot

r kfigr::figr(label = "season", prefix = TRUE, link = TRUE, type="Figure"): Oscar and Summer Season Releases

The summer season, which starts the first weekend of May and ends on Labour Day, accounts for a disproportionate share of Hollywood studios' annual box office revenue. Similarly, some 32% of the feature films in the dataset were launched during the summer months.

Year of Theatrical Release

years <- edaUni$qual$year$data
low <- years %>% filter(N == min(N))
ave <- years %>% summarize(Mean = mean(N))
med <- years %>% summarize(Median = median(N))
high <- years %>% filter(N == max(N))
last <- years %>% filter(Category == "2014")

The dataset contained some r nrow(preprocessed) feature films released between 1970 and 2014. As presented in r kfigr::figr(label = "year", prefix = TRUE, link = TRUE, type="Figure"), the number of films in the sample by year of release, tended to grow somewhat linearly from r low$N film in 1970 to a peak of approximately r high$N[1] films in 2006 and 2007, then drops until settling at r last$N films in 2014. The number of films per year centered at a mean and median of r round(ave$Mean,1) and r med$Median films, respectively.

edaUni$qual$year$plot

r kfigr::figr(label = "year", prefix = TRUE, link = TRUE, type="Figure"): Theatrical Releases by Year

Quantitative Analysis

Critics Score

Moving on to the quantitative variables, critics' score, ranging from 1 to 100, was obtained from the Rotten Tomatoes website and its summary statistics are described below in r kfigr::figr(label = "critics_score_stats", prefix = TRUE, link = TRUE, type="Table").

r kfigr::figr(label = "critics_score_stats", prefix = TRUE, link = TRUE, type="Table"): Critics score summary statistics

knitr::kable(edaUni$quant$critics_score$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The distribution of critics scores represented in r kfigr::figr(label = "critics_score_dist", prefix = TRUE, link = TRUE, type="Figure") and further supported by r kfigr::figr(label = "critics_score_box", prefix = TRUE, link = TRUE, type="Figure") departs rather substantively from normality. That said, Bayesian inference does not rely upon an assumption of normality with respect to the distribution of predictors.

gridExtra::grid.arrange(edaUni$quant$critics_score$hist, edaUni$quant$critics_score$qq, ncol = 2)

r kfigr::figr(label = "critics_score_dist", prefix = TRUE, link = TRUE, type="Figure"): Critics score histogram and QQ Plot

edaUni$quant$critics_score$box

r kfigr::figr(label = "critics_score_box", prefix = TRUE, link = TRUE, type="Figure"): Critics score box plot

Central Tendency: r kfigr::figr(label = "critics_score_stats", prefix = TRUE, link = TRUE, type="Table") reports that r edaUni$quant$critics_score$central

Dispersion: r edaUni$quant$critics_score$disp

Shape of Distribution: r edaUni$quant$critics_score$skew r edaUni$quant$critics_score$kurt The histogram and QQ plot in r kfigr::figr(label = "critics_score_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a left skewed distribution that departs from normality. Fortunately, Bayesian inference is not based upon an assumption of normality of predictors.

Outliers: The box plot in r kfigr::figr(label = "critics_score_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$quant$critics_score$outliers) == 0, "no", " ") outliers were extant. r edaUni$quant$critics_score$out

IMDb Number of Votes

This variable, obtained from the IMDb website represents the number of IMDb votes cast for each film.

r kfigr::figr(label = "imdb_votes_stats", prefix = TRUE, link = TRUE, type="Table"): IMDb votes summary statistics

knitr::kable(edaUni$quant$imdb_num_votes$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$quant$imdb_num_votes$hist, edaUni$quant$imdb_num_votes$qq, ncol = 2)

r kfigr::figr(label = "imdb_votes_dist", prefix = TRUE, link = TRUE, type="Figure"): IMDb votes histogram and QQ Plot

edaUni$quant$imdb_num_votes$box

r kfigr::figr(label = "imdb_votes_box", prefix = TRUE, link = TRUE, type="Figure"): IMDb votes box plot

Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_votes_stats", prefix = TRUE, link = TRUE, type="Table")) show that r edaUni$quant$imdb_num_votes$central

Dispersion: r edaUni$quant$imdb_num_votes$disp

Shape of Distribution: r edaUni$quant$imdb_num_votes$skew r edaUni$quant$imdb_num_votes$kurt The histogram and QQ plot in r kfigr::figr(label = "imdb_votes_dist", prefix = TRUE, link = TRUE, type="Figure") reveal a distribution which departs significantly from normality.

Outliers: The box plot in r kfigr::figr(label = "imdb_votes_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$quant$imdb_num_votes$outliers) == 0, "no", " ") outliers were extant. r edaUni$quant$imdb_num_votes$out

IMDb Number of Votes (Log)

This was a log transformation of the IMDb votes variable.

r kfigr::figr(label = "imdb_votes_log_stats", prefix = TRUE, link = TRUE, type="Table"): Log IMDb votes summary statistics

knitr::kable(edaUni$quant$imdb_num_votes_log$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$quant$imdb_num_votes_log$hist, edaUni$quant$imdb_num_votes_log$qq, ncol = 2)

r kfigr::figr(label = "imdb_votes_log_dist", prefix = TRUE, link = TRUE, type="Figure"): Log IMDb votes histogram and QQ Plot

edaUni$quant$imdb_num_votes_log$box

r kfigr::figr(label = "imdb_votes_log_box", prefix = TRUE, link = TRUE, type="Figure"): Log IMDb votes box plot

Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_votes_log_stats", prefix = TRUE, link = TRUE, type="Table")) report that r edaUni$quant$imdb_num_votes_log$central

Dispersion: r edaUni$quant$imdb_num_votes_log$disp

Shape of Distribution: r edaUni$quant$imdb_num_votes_log$skew r edaUni$quant$imdb_num_votes_log$kurt The histogram and QQ plot in r kfigr::figr(label = "imdb_votes_log_dist", prefix = TRUE, link = TRUE, type="Figure") reveal a nearly normal distribution.

Outliers: The box plot in r kfigr::figr(label = "imdb_votes_log_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$quant$imdb_num_votes_log$outliers) == 0, "no", " ") outliers were extant. r edaUni$quant$imdb_num_votes_log$out

IMDb Ratings

This variable captured the IMDb rating for each film

r kfigr::figr(label = "imdb_rating_stats", prefix = TRUE, link = TRUE, type="Table"): IMDb rating summary statistics

knitr::kable(edaUni$quant$imdb_rating$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$quant$imdb_rating$hist, edaUni$quant$imdb_rating$qq, ncol = 2)

r kfigr::figr(label = "imdb_rating_dist", prefix = TRUE, link = TRUE, type="Figure"): IMDb rating histogram and QQ Plot

edaUni$quant$imdb_rating$box

r kfigr::figr(label = "imdb_rating_box", prefix = TRUE, link = TRUE, type="Figure"): IMDb rating box plot

Central Tendency: The summary statistics (r kfigr::figr(label = "imdb_rating_stats", prefix = TRUE, link = TRUE, type="Table")) shows that r edaUni$quant$imdb_rating$central

Dispersion: r edaUni$quant$imdb_rating$disp

Shape of Distribution: r edaUni$quant$imdb_rating$skew r edaUni$quant$imdb_rating$kurt The histogram and QQ plot in r kfigr::figr(label = "imdb_rating_dist", prefix = TRUE, link = TRUE, type="Figure") reveal a nearly normal distribution.

Outliers: The box plot in r kfigr::figr(label = "imdb_rating_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$quant$imdb_rating$outliers) == 0, "no", " ") outliers were extant. r edaUni$quant$imdb_rating$out

Runtime

This is an analysis of moving runtimes.

r kfigr::figr(label = "runtime_stats", prefix = TRUE, link = TRUE, type="Table"): Runtime summary statistics

knitr::kable(edaUni$quant$runtime$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$quant$runtime$hist, edaUni$quant$runtime$qq, ncol = 2)

r kfigr::figr(label = "runtime_dist", prefix = TRUE, link = TRUE, type="Figure"): Runtime histogram and QQ Plot

edaUni$quant$runtime$box

r kfigr::figr(label = "runtime_box", prefix = TRUE, link = TRUE, type="Figure"): Runtime box plot

Central Tendency: The summary statistics (r kfigr::figr(label = "runtime_stats", prefix = TRUE, link = TRUE, type="Table")) show that r edaUni$quant$runtime$central

Dispersion: r edaUni$quant$runtime$disp

Shape of Distribution: r edaUni$quant$runtime$skew r edaUni$quant$runtime$kurt The histogram and QQ plot in r kfigr::figr(label = "runtime_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a left skewed distribution that appears reasonably normal.

Outliers: The box plot in r kfigr::figr(label = "runtime_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$quant$runtime$outliers) == 0, "no", " ") outliers were extant. r edaUni$quant$runtime$out

Audience Score

At last, the dependent variable, "audience_score" is examined.

r kfigr::figr(label = "audience_score_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Score Summary Statistics

knitr::kable(edaUni$quant$audience_score$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")
gridExtra::grid.arrange(edaUni$quant$audience_score$hist, edaUni$quant$audience_score$qq, ncol = 2)

r kfigr::figr(label = "audience_score_dist", prefix = TRUE, link = TRUE, type="Figure"): Audience Score Histogram and QQ Plot

edaUni$quant$audience_score$box

r kfigr::figr(label = "audience_score_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Score Box plot

Central Tendency: The summary statistics (r kfigr::figr(label = "audience_score_stats", prefix = TRUE, link = TRUE, type="Table")) shows that r edaUni$quant$audience_score$central

Dispersion: r edaUni$quant$audience_score$disp

Shape of Distribution: r edaUni$quant$audience_score$skew r edaUni$quant$audience_score$kurt The histogram and QQ plot in r kfigr::figr(label = "audience_score_dist", prefix = TRUE, link = TRUE, type="Figure") reveals a slightly right skewed distribution that approximates normality.

Outliers: The box plot in r kfigr::figr(label = "audience_score_box", prefix = TRUE, link = TRUE, type="Figure"), which graphically depicts the median, the IQR, and maximum and minimum values, suggested that r ifelse(nrow(edaUni$audience_score$outliers) == 0, "no", " ") outliers were extant. r edaUni$audience_score$out

Bivariate Analysis

Next, the relationships between the independent variables and audience scores are studied. The analysis continues with an exploration of the categorical variables vis-a-vis audience score, then an examination of the quantitative variables and the dependent variable.

Qualitative Analysis

Best Actor Oscar

The summary statistics in r kfigr::figr(label = "bivariate_best_actor_stats", prefix = TRUE, link = TRUE, type="Table") evince very similar distributions of audience scores between those films which have won the best actor Oscar and those that had not.

r kfigr::figr(label = "bivariate_best_actor_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores by Best Actor Oscar Win Summary Statistics

knitr::kable(edaBi$best_actor_win$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot shown in r kfigr::figr(label = "bivariate_best_actor_box", prefix = TRUE, link = TRUE, type="Figure") supports an initial impression that best actor Oscar winnings have no statistically significant association with audience scores.

edaBi$best_actor_win$boxPlot

r kfigr::figr(label = "bivariate_best_actor_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores by Best Actor Oscar Win

r edaBi$best_actor_win$statement As such, the data does not support an association between best actor Oscar award and audience score.

Best Actress Oscar

Similarly, the summary statistics in r kfigr::figr(label = "bivariate_best_actress_stats", prefix = TRUE, link = TRUE, type="Table") reveal almost identical distributions for audience score for both groups.

r kfigr::figr(label = "bivariate_best_actress_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores by Best Actress Oscar Win Summary Statistics

knitr::kable(edaBi$best_actress_win$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Again, the box plot shown in r kfigr::figr(label = "bivariate_best_actress_box", prefix = TRUE, link = TRUE, type="Figure") graphically supports an assertion of little to no association between best actress Oscar winning and audience scores.

edaBi$best_actress_win$boxPlot

r kfigr::figr(label = "bivariate_best_actress_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores by Best Actress Oscar Win

r edaBi$best_actress_win$statement

Best Director Oscar

The summary statistics shown in r kfigr::figr(label = "bivariate_best_director_stats", prefix = TRUE, link = TRUE, type="Table") suggest that films which have been awarded the best director Oscar are also associated with higher average audience scores.

r kfigr::figr(label = "bivariate_best_director_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores by Best Director Oscar Win Summary Statistics

knitr::kable(edaBi$best_dir_win$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot in r kfigr::figr(label = "bivariate_best_director_box", prefix = TRUE, link = TRUE, type="Figure") reveals a slightly higher center for audience scores among those films with best director acclaim.

edaBi$best_dir_win$boxPlot

r kfigr::figr(label = "bivariate_best_director_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores by Best Director Oscar Win

r edaBi$best_dir_win$statement Though winning films tended to have slightly higher audience scores, the data do not indicate a statistically significant difference.

Best Picture Nomination

The association between Oscar performance and audience scores becomes extant for the first time with the best picture nomination. The summary statistics shown in r kfigr::figr(label = "bivariate_best_pic_nom_stats", prefix = TRUE, link = TRUE, type="Table") reveal a sizable difference in average and median audience scores between the films so nominated and those that were not.

r kfigr::figr(label = "bivariate_best_pic_nom_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores by Best Picture Nomination Summary Statistics

knitr::kable(edaBi$best_pic_nom$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot in r kfigr::figr(label = "bivariate_best_pic_nom_box", prefix = TRUE, link = TRUE, type="Figure") supports a priori hypothesis that best picture nominations are associated with higher audience scores.

edaBi$best_pic_nom$boxPlot

r kfigr::figr(label = "bivariate_best_pic_nom_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores by Best Picture Nomination

r edaBi$best_pic_nom$statement Indeed the data support an association between best picture nomination and audience scores.

Best Picture Oscar

As one my expect, given prior results, the summary statistics shown in r kfigr::figr(label = "bivariate_best_pic_win_stats", prefix = TRUE, link = TRUE, type="Table") suggests an association between Oscar best picture acclaim and higher audience scores.

r kfigr::figr(label = "bivariate_best_pic_win_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores by Best Picture Oscar Summary Statistics

knitr::kable(edaBi$best_pic_win$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

Similarly, the box plot in r kfigr::figr(label = "bivariate_best_pic_win_box", prefix = TRUE, link = TRUE, type="Figure") clarifies the potential association between Oscar performance and audience scores.

edaBi$best_pic_win$boxPlot

r kfigr::figr(label = "bivariate_best_pic_win_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores by Best Picture Oscar

r edaBi$best_pic_win$statement Surely, the data do in fact support an association between best picture award and audience scores.

Drama

The summary statistics in r kfigr::figr(label = "bivariate_drama_stats", prefix = TRUE, link = TRUE, type="Table") report a slight difference in the central audience scores between dramas and non-drama films.

r kfigr::figr(label = "bivariate_drama_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores by Genre Summary Statistics

knitr::kable(edaBi$drama$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot in r kfigr::figr(label = "bivariate_drama_box", prefix = TRUE, link = TRUE, type="Figure") also shows a slight tendency towards higher audience scores for dramas, but it significant?

edaBi$drama$boxPlot

r kfigr::figr(label = "bivariate_drama_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores by Genre

r edaBi$drama$statement Dramas are indeed, associated with slightly higher audience scores.

Features Films

The summary statistics in r kfigr::figr(label = "bivariate_feature_film_stats", prefix = TRUE, link = TRUE, type="Table") expose a rather sizable difference in audience scores between feature films and other film types. In fact, feature films appear to be associated with significantly lower average audience scores.

r kfigr::figr(label = "bivariate_feature_film_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores by Film Type Summary Statistics

knitr::kable(edaBi$feature_film$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot in r kfigr::figr(label = "bivariate_feature_film_box", prefix = TRUE, link = TRUE, type="Figure") visually characterizes the difference in the distribution of audience scores between the film types.

edaBi$feature_film$boxPlot

r kfigr::figr(label = "bivariate_feature_film_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores by Film Type

r edaBi$feature_film$statement The data shows that non feature films are associated with higher audience scores.

Rated R

The summary statistics in r kfigr::figr(label = "bivariate_mpaa_rating_R_stats", prefix = TRUE, link = TRUE, type="Table") suggest very similar distributions of audience scores between rated R and other films.

r kfigr::figr(label = "bivariate_mpaa_rating_R_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores by MPAA Rating Summary Statistics

knitr::kable(edaBi$mpaa_rating_R$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot in r kfigr::figr(label = "bivariate_mpaa_rating_R_box", prefix = TRUE, link = TRUE, type="Figure") shows a slightly lower central audience score for R rated films.

edaBi$mpaa_rating_R$boxPlot

r kfigr::figr(label = "bivariate_mpaa_rating_R_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores by MPAA Rating

r edaBi$mpaa_rating_R$statement As such the data do not support an association between audience scores and MPAA R ratings.

Oscar Season Release

Again, the summary statistics in r kfigr::figr(label = "bivariate_Oscar_season_stats", prefix = TRUE, link = TRUE, type="Table") report nearly identical distributions of audience scores between films released during the Oscar season and those that were not.

r kfigr::figr(label = "bivariate_Oscar_season_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores and Oscar Season Release Summary Statistics

knitr::kable(edaBi$oscar_season$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot in r kfigr::figr(label = "bivariate_Oscar_season_box", prefix = TRUE, link = TRUE, type="Figure") echo the summary statistics.

edaBi$oscar_season$boxPlot

r kfigr::figr(label = "bivariate_Oscar_season_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores and Oscar Season Release

r edaBi$oscar_season$statement As such the data do not support an association between audience scores and Oscar season release dates.

Summer Season Release

Similarly, the summary statistics in r kfigr::figr(label = "bivariate_summer_season_stats", prefix = TRUE, link = TRUE, type="Table") report nearly identical distributions of audience scores between films released during the summer season and those that were not.

r kfigr::figr(label = "bivariate_summer_season_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores and Summer Season Release Summary Statistics

knitr::kable(edaBi$summer_season$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot in r kfigr::figr(label = "bivariate_summer_season_box", prefix = TRUE, link = TRUE, type="Figure") backs the summary statistics.

edaBi$summer_season$boxPlot

r kfigr::figr(label = "bivariate_summer_season_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores and Summer Season Release

r edaBi$summer_season$statement Therefore the data do not indicate that an association between audience scores and summer release dates is extant.

Top 200 Box Office

As one might conjecture, the top 200 box office films might be associated with higher audience scores, almost be definition. This is indicated by the summary statistics in r kfigr::figr(label = "bivariate_top200_box_stats", prefix = TRUE, link = TRUE, type="Table") which show significantly higher central audience scores for the highest grossing films.

r kfigr::figr(label = "bivariate_top200_box_stats", prefix = TRUE, link = TRUE, type="Table"): Audience Scores and Summer Season Release Summary Statistics

knitr::kable(edaBi$top200_box$stats, digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The box plot in r kfigr::figr(label = "bivariate_top200_box_box", prefix = TRUE, link = TRUE, type="Figure") illuminates this difference.

edaBi$top200_box$boxPlot

r kfigr::figr(label = "bivariate_top200_box_box", prefix = TRUE, link = TRUE, type="Figure"): Audience Scores and Summer Season Release

r edaBi$top200_box$statement Notwithstanding, the data doesn't support the assertion that the highest grossing films are more popular from an audience score perspective.

Quantitative Analysis

Critics Score

The scatter plot (r kfigr::figr(label = "bivariate_critics_score", prefix = TRUE, link = TRUE, type="Figure")) indicates a positive correlation between critics score and audience score.

edaBi$critics_score$scatterPlot

r kfigr::figr(label = "bivariate_critics_score", prefix = TRUE, link = TRUE, type="Figure"): Audience Score and Critics Score

r edaBi$critics_score$statement

IMDB Num Votes

The scatter plot (r kfigr::figr(label = "bivariate_imdb_num_votes", prefix = TRUE, link = TRUE, type="Figure")) indicates a moderate positive correlation between the number of IMDB votes and audience score.

edaBi$imdb_num_votes$scatterPlot

r kfigr::figr(label = "bivariate_imdb_num_votes", prefix = TRUE, link = TRUE, type="Figure"): Audience Score and IMDB Num Votes

r edaBi$imdb_num_votes$statement

IMDB Num Votes (Log)

The scatter plot (r kfigr::figr(label = "bivariate_imdb_num_votes_log", prefix = TRUE, link = TRUE, type="Figure")) reveals a slight positive correlation between the log of IMDB number of votes and audience score.

edaBi$imdb_num_votes_log$scatterPlot

r kfigr::figr(label = "bivariate_imdb_num_votes_log", prefix = TRUE, link = TRUE, type="Figure"): Audience Score and IMDB Number of Votes (Log)

r edaBi$imdb_num_votes_log$statement

IMDB Rating

The scatter plot (r kfigr::figr(label = "bivariate_imdb_rating", prefix = TRUE, link = TRUE, type="Figure")) suggests a strong positive correlation between IMDB rating and audience score.

edaBi$imdb_rating$scatterPlot

r kfigr::figr(label = "bivariate_imdb_rating", prefix = TRUE, link = TRUE, type="Figure"): Audience Score and IMDB Rating

r edaBi$imdb_rating$statement

Runtime

The scatter plot (r kfigr::figr(label = "bivariate_runtime", prefix = TRUE, link = TRUE, type="Figure")) suggests a weak positive correlation between runtime and audience score.

edaBi$runtime$scatterPlot

r kfigr::figr(label = "bivariate_runtime", prefix = TRUE, link = TRUE, type="Figure"): Audience Score and Runtime

r edaBi$runtime$statement

Year of Theatrical Release

The scatter plot (r kfigr::figr(label = "bivariate_thtr_rel_year", prefix = TRUE, link = TRUE, type="Figure")) suggests the lack of a correlation between the year of theatrical release and audience score.

edaBi$thtr_rel_year$scatterPlot

r kfigr::figr(label = "bivariate_thtr_rel_year", prefix = TRUE, link = TRUE, type="Figure"): Audience Score and Year of Theatrical Release

r edaBi$thtr_rel_year$statement

Association and Correlation

Next, the significance and strength of associations between categorical variables were examined. Pairwise chi-squared and association tests were conducted to reveal the significance (p.value) and the strength, Cramer's V [@Cramer1946] of each association. The chi-square results summarized in r kfigr::figr(label = "chi", prefix = TRUE, link = TRUE, type="Table") reveal several associations that could present as collinearity issues for regression. Focusing on those regressors most highly associated with audience score, the degree of association among Academy awarded films was significant. There was also a strong association between films that won Best Picture and those that were nominated. Both significant and strong associations were observed among films with theatrical releases in the Oscar and Summer seasons.

r kfigr::figr(label = "chi", prefix = TRUE, link = TRUE, type="Table"): Chi-squared test of independence between categorical variables.

chi <- x2(preprocessed)
vars <- data.frame(Terms = as.character(rownames(chi)))
chi <- lapply(chi, function(x) {
  kableExtra::cell_spec(x, "html", bold = ifelse(x < .05 ,T,F))
  })
chi <- cbind(vars, chi)

knitr::kable(chi, escape = F) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

r kfigr::figr(label = "cramer", prefix = TRUE, link = TRUE, type="Table"): Cramer's V measure of association between categorical variables.

cv <- cramers(preprocessed)
vars <- data.frame(Terms = as.character(rownames(cv)))
cv <- lapply(cv, function(x) {
  kableExtra::cell_spec(x, "html", bold = ifelse(x > .3 ,T,F))
  })
cv <- cbind(vars, cv)
knitr::kable(cv, escape = F) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

As shown in r kfigr::figr(label = "cp", prefix = TRUE, link = TRUE, type="Figure"), the correlations among the quantitative variables did not surprise. As expected, a moderate correlation between critics scores and IMDb rating was observed.

cp <- pearsons(preprocessed)
cp

r kfigr::figr(label = "cp", prefix = TRUE, link = TRUE, type="Figure"): Correlations among quantitative predictors

EDA Summary

Whereas acknowledgments of individual achievement from the Academy had no statistically significant correlation with audience scores, films that were nominated for, or won Best Picture, were associated with higher audience scores to a statistically significant degree. It would appear that audiences prefer teamwork. Non-feature films were also associated with higher audience scores. As one might expect, high positive correlations were extant between critics scores, IMDb ratings, and audience scores. Weak positive correlations between IMDb number of votes and runtime were observed and finally, audience scores saw a slight, but statistically significant downward trend over the time period observed. The associations between the qualitative and quantitative variables revealed potential collinearity among the Oscar award variables. There was a moderate to strong correlation between critics scores and IMDb ratings.




DataScienceSalon/Bayesian-Regression documentation built on May 29, 2019, 12:06 a.m.