knitr::opts_chunk$set( echo = TRUE, tidy = 'styler', fig.align = 'center', cache = TRUE, cache.path = '.cache/', include = FALSE )
This package leverages The Numbers database to obtain information about movies. It can be used to extract a lot of interesting numbers such as box-office successes (or failures!) and contains functions to plot some results for individual movies or to compare multiple movies to each other.
Firstly, we need to get a list of all available movies in the database. We
can scrape their movies page using the availableTitles()
function,
indicating how many pages we want to scrape (each page contains 100 movies).
Importantly, the pages here are ranked by the movies budgets. If you are
looking for some low-budget indie movies, you might be facing some issues...
The availableTitles()
function return an object of moviesDB class, which is a
list of length 2, containing a list of available titles in the first element,
and the associated URLs in the second element.
## Required libraries require(magrittr) require(rvest) ## Get the 10.000 first movies in The Numbers database top100.titles <- availableTitles(n.page = 1) head(top100.titles$movies.titles) head(top100.titles$movies.urls)
For convenience, a list of 5200 available movies has already been compiled
and is accessible in the moviesDB
object. This has been done in May
2019, so bare in mind that it will not contain information on more recent movies.
#moviesDB <- availableTitles(n.page = 52) ## These URLs are not formatted the same way than the others... #faulty.urls <- c( # "https://www.the-numbers.com/movie/Waterworld#tab=box-office", # "https://www.the-numbers.com/movie/Asterix-aux-Jeux-Olympiques#tab=box-office", # "https://www.the-numbers.com/movie/Dantes-Peak#tab=box-office", # "https://www.the-numbers.com/movie/Army-of-the-Dead-(TBA-ID)#tab=box-office", # "https://www.the-numbers.com/movie/Asterix-et-Obelix-Au-service-de-Sa-Majeste#tab=box-office", # "https://www.the-numbers.com/movie/Renegades-(2017)#tab=box-office", # "https://www.the-numbers.com/movie/Ridiculous-6-The#tab=box-office" #) #moviesDB <- moviesDB[-match(faulty.urls, moviesDB$movies.url)]
One can easily lookup some titles using the searchTitle()
function. The function
only returns a proper result if it finds a unique movie matching the input pattern,
otherwise it just prints the matched titles. For ambiguous cases as depicted below,
the handy exact.match
logical argument can be used.
# Search for a title searchTitle(pattern = 'expendables', moviesDB = moviesDB) searchTitle(pattern = 'The Expendables', moviesDB = moviesDB) searchTitle(pattern = 'The Expendables', moviesDB = moviesDB, exact.match = TRUE)
Once the exact title of a movie is known, information can be retrieved. Note
that the exact title must be entered, otherwise the fetchMovie()
function
will fail.
# Won't work with the wrong title #theexpendables <- fetchMovie(title = 'The expendables', moviesDB = moviesDB) # With the right title - Note the case sensitivity theexpendables <- fetchMovie(title = 'The Expendables', moviesDB = moviesDB) class(theexpendables) str(theexpendables)
As you can see, the output of the fetchMovie()
is a movieMetrics
object.
Just entering the name of the movieMetrics in R console will output a brief
summary of the data associated with the movie, and will show the first five
days of box-office results. The metrics()
function allows to get the full
box-office results.
metrics(theexpendables) %>% head identical(metrics(theexpendables), theexpendables$metrics)
Several functions methods have been adapted for the movieMetrics class. For instance, one can subset a movieMetrics object as follows:
# Get the first 2 days theexpendables[1:2] # Get the second week in boxoffice theexpendables[8:14] # Bit more complex: filter by date theexpendables %>% filter.movieMetrics(as.Date(c("2010-09-01", "2010-09-30")))
One can also get the number of days recorded in the movieMetrics boxoffice
using the length()
function:
theexpendables %>% length() theexpendables %>% filter(as.Date(c("2010-09-01", "2010-09-30"))) %>% length()
require(ggplot2) # Let's compare The Expendables (2010) and The Greatest Showman (2017) theexpendables <- fetchMovie(title = 'The Expendables', moviesDB = moviesDB) greatestshowman <- fetchMovie(title = 'The Greatest Showman', moviesDB = moviesDB) p1 <- plot(theexpendables, plotly = FALSE) p2 <- plot(greatestshowman, plotly = FALSE) dev.new(width = 10, height = 5) gridExtra::grid.arrange(p1, p2, ncol = 2) # To plot graphs side-by-side # Wow. looks like The Expendables had a better start. # But let's scale the x axis to the same time window. p1 <- plot(theexpendables, plotly = FALSE, time.range = length(greatestshowman)) + ylim(c(0, 10)) p2 <- plot(greatestshowman, plotly = FALSE, time.range = length(greatestshowman)) + ylim(c(0, 10)) gridExtra::grid.arrange(p1, p2, ncol = 2) # Now we can see that The Greatest Showman, despite starting lower, stayed in the box-office for much longer! # What about if we plot the total gross? p1 <- plot(theexpendables, yaxis = 'domestic_total_gross', plotly = FALSE, time.range = length(greatestshowman)) + ylim(c(0, 200)) p2 <- plot(greatestshowman, yaxis = 'domestic_total_gross', plotly = FALSE, time.range = length(greatestshowman)) + ylim(c(0, 200)) gridExtra::grid.arrange(p1, p2, ncol = 2) # And we can also plot the rank in box-office over 90 days from the release date p1 <- plot(theexpendables, yaxis = 'rank', plotly = FALSE, time.range = 90) + ylim(c(0, 20)) p2 <- plot(greatestshowman, yaxis = 'rank', plotly = FALSE, time.range = 90) + ylim(c(0, 20)) gridExtra::grid.arrange(p1, p2, ncol = 2)
We can easily expend our analysis to lists of movies. To fetch data from lots
of movies (> 100), a decent amount of time is needed. We provide an already
generated list of ~1000 movies with the highest budget, saved in the top1000.moviesList
movieMetricsLongList
object. This list has been generated using the following code.
The movieMetricsLongList
class is adapted from the movieMetrics
class and
contains information for multiple movies.
top1000.moviesList <- fetchMoviesList(moviesDB$movies.titles[1:1000], moviesDB) str(top1000.moviesList) infos(top1000.moviesList) %>% head() metrics(top1000.moviesList) %>% head()
The structure of the movieMetricsLongList
object makes it really convenient
to manipulate and enrich. Data can quickly be appended and plots can be done
in a breeze!
MCU.movies <- c( "Iron Man", "The Incredible Hulk", "Iron Man 2", "Thor", "Captain America: The First Avenger", "The Avengers (2012)", "Iron Man 3", "Thor: The Dark World", "Captain America: The Winter Soldier", "Guardians of the Galaxy", "Avengers: Age of Ultron", "Ant-Man", "Captain America: Civil War", "Doctor Strange", "Guardians of the Galaxy Vol 2", "Spider-Man: Homecoming", "Thor: Ragnarok", "Black Panther", "Avengers: Infinity War", "Ant-Man and the Wasp", "Captain Marvel" ) # Fetch data marvelLongList <- fetchMoviesList(MCU.movies, moviesDB = moviesDB) # Or, if we have the entire database already here, you can subset it using # title names! marvelLongList <- top1000.moviesList[MCU.movies] # Add MCU phase MCU.phase <- factor(c(rep(1, 6), rep(2, 6), rep(3, 9))) marvelLongList %<>% addFeature(MCU.phase, "phase") # Add year of release years <- infos(marvelLongList)$release_date %>% substr(start = 1, stop = 4) %>% as.numeric() %>% factor(., levels = 2008:2019) marvelLongList %<>% addFeature(years, "year_of_release") # Plot results p <- plot.movieMetricsLongList(marvelLongList, x = "day", y = "rank", facet = "year_of_release", xlim = c(0, 120)) p <- plot.movieMetricsLongList(marvelLongList, x = "day", y = "domestic_total_gross", facet = "year_of_release", xlim = c(0, 120)) # Compare two movies form the MCU p1 <- plot(marvelLongList['The Incredible Hulk'], yaxis = "domestic_total_gross", plotly = FALSE, time.range = length(marvelLongList["Avengers: Infinity War"])) + ylim(c(0, 700)) p2 <- plot(marvelLongList["Avengers: Infinity War"], yaxis = "domestic_total_gross", plotly = FALSE, time.range = length(marvelLongList["Avengers: Infinity War"])) + ylim(c(0, 700)) gridExtra::grid.arrange(p1, p2, ncol = 2) ```p1 <- plot.movieMetrics(top1000.moviesList['Gifted'], yaxis = "domestic_total_gross", plotly = TRUE, time.range = length(top1000.moviesList["Gifted"])) + ylim(c(0, 700)) ```r # append.to.metrics = FALSE is used to only add the gains to the movies infos data.frame gains <- infos(top1000.moviesList)$total_gross - infos(top1000.moviesList)$budget years <- infos(top1000.moviesList)$release_date %>% substr(start = 1, stop = 4) %>% as.numeric() %>% factor() ratios <- infos(top1000.moviesList)$total_gross / infos(top1000.moviesList)$budget %>% as.numeric() top1000.moviesList %<>% addFeature(years, "year_of_release", append.to.metrics = FALSE, overwrite = TRUE) %>% addFeature(gains, "gain", append.to.metrics = FALSE, overwrite = TRUE) %>% addFeature(ratios, "ratio", append.to.metrics = FALSE, overwrite = TRUE) p <- ggplot(infos(top1000.moviesList), aes_string(x = "budget", y = "gains", col = "distributor")) + geom_point() + facet_wrap("distributor") + theme_bw() + theme(legend.position = "bottom") df <- infos(top1000.moviesList) %>% '['(.$distributor == 'Walt Disney',) p <- ggplot(df, aes_string(x = "year_of_release", y = "gain", fill = "year_of_release", label = "title")) + geom_violin(scale = "width") + geom_jitter(position = position_jitter(0.2), size = 0.5, aes_string(col = "distributor")) + theme_bw() + theme(legend.position = "bottom") + ggrepel::geom_label_repel() + scale_fill_grey(start = 0.2, end = 0.8)
results <- list() for (title in infos(top1000.moviesList)$title) { message(title) results[title] <- findIMDbIndex(title, top1000.moviesList) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.