In js2264/moviestats: A package to get infos about movies

Introduction to moviestats

    knitr::opts_chunk$set(
        echo = TRUE, 
        tidy = 'styler', 
        fig.align = 'center', 
        cache = TRUE, 
        cache.path = '.cache/', 
        include = FALSE
    )

This package leverages The Numbers database to obtain information about movies. It can be used to extract a lot of interesting numbers such as box-office successes (or failures!) and contains functions to plot some results for individual movies or to compare multiple movies to each other.

Building a database

Firstly, we need to get a list of all available movies in the database. We can scrape their movies page using the availableTitles() function, indicating how many pages we want to scrape (each page contains 100 movies). Importantly, the pages here are ranked by the movies budgets. If you are looking for some low-budget indie movies, you might be facing some issues...

The availableTitles() function return an object of moviesDB class, which is a list of length 2, containing a list of available titles in the first element, and the associated URLs in the second element.

    ## Required libraries
    require(magrittr)
    require(rvest)
    ## Get the 10.000 first movies in The Numbers database
    top100.titles <- availableTitles(n.page = 1)
    head(top100.titles$movies.titles)
    head(top100.titles$movies.urls)

For convenience, a list of 5200 available movies has already been compiled and is accessible in the moviesDB object. This has been done in May 2019, so bare in mind that it will not contain information on more recent movies.

    #moviesDB <- availableTitles(n.page = 52)
    ## These URLs are not formatted the same way than the others...
    #faulty.urls <- c(
    #    "https://www.the-numbers.com/movie/Waterworld#tab=box-office",
    #    "https://www.the-numbers.com/movie/Asterix-aux-Jeux-Olympiques#tab=box-office", 
    #    "https://www.the-numbers.com/movie/Dantes-Peak#tab=box-office",
    #    "https://www.the-numbers.com/movie/Army-of-the-Dead-(TBA-ID)#tab=box-office", 
    #    "https://www.the-numbers.com/movie/Asterix-et-Obelix-Au-service-de-Sa-Majeste#tab=box-office",
    #    "https://www.the-numbers.com/movie/Renegades-(2017)#tab=box-office", 
    #    "https://www.the-numbers.com/movie/Ridiculous-6-The#tab=box-office"
    #)
    #moviesDB <- moviesDB[-match(faulty.urls, moviesDB$movies.url)]

One can easily lookup some titles using the searchTitle() function. The function only returns a proper result if it finds a unique movie matching the input pattern, otherwise it just prints the matched titles. For ambiguous cases as depicted below, the handy exact.match logical argument can be used.

    # Search for a title
    searchTitle(pattern = 'expendables', moviesDB = moviesDB)
    searchTitle(pattern = 'The Expendables', moviesDB = moviesDB)
    searchTitle(pattern = 'The Expendables', moviesDB = moviesDB, exact.match = TRUE)

Fetching data about a movie

Once the exact title of a movie is known, information can be retrieved. Note that the exact title must be entered, otherwise the fetchMovie() function will fail.

    # Won't work with the wrong title
    #theexpendables <- fetchMovie(title = 'The expendables', moviesDB = moviesDB)
    # With the right title - Note the case sensitivity
    theexpendables <- fetchMovie(title = 'The Expendables', moviesDB = moviesDB)
    class(theexpendables)
    str(theexpendables)

As you can see, the output of the fetchMovie() is a movieMetrics object. Just entering the name of the movieMetrics in R console will output a brief summary of the data associated with the movie, and will show the first five days of box-office results. The metrics() function allows to get the full box-office results.

    metrics(theexpendables) %>% head
    identical(metrics(theexpendables), theexpendables$metrics)

Adapted functions

Several functions methods have been adapted for the movieMetrics class. For instance, one can subset a movieMetrics object as follows:

    # Get the first 2 days
    theexpendables[1:2]
    # Get the second week in boxoffice
    theexpendables[8:14]
    # Bit more complex: filter by date
    theexpendables %>% 
        filter.movieMetrics(as.Date(c("2010-09-01", "2010-09-30")))

One can also get the number of days recorded in the movieMetrics boxoffice using the length() function:

    theexpendables %>% 
        length()
    theexpendables %>% 
        filter(as.Date(c("2010-09-01", "2010-09-30"))) %>% 
        length()

Plotting options

    require(ggplot2)
    # Let's compare  The Expendables (2010) and The Greatest Showman (2017)
    theexpendables <- fetchMovie(title = 'The Expendables', moviesDB = moviesDB)
    greatestshowman <- fetchMovie(title = 'The Greatest Showman', moviesDB = moviesDB)
    p1 <- plot(theexpendables, plotly = FALSE)
    p2 <- plot(greatestshowman, plotly = FALSE)
    dev.new(width = 10, height = 5)
    gridExtra::grid.arrange(p1, p2, ncol = 2) # To plot graphs side-by-side

    # Wow. looks like The Expendables had a better start. 
    # But let's scale the x axis to the same time window. 
    p1 <- plot(theexpendables, plotly = FALSE, time.range = length(greatestshowman)) + ylim(c(0, 10))
    p2 <- plot(greatestshowman, plotly = FALSE, time.range = length(greatestshowman)) + ylim(c(0, 10))
    gridExtra::grid.arrange(p1, p2, ncol = 2)
    # Now we can see that The Greatest Showman, despite starting lower, stayed in the box-office for much longer!

    # What about if we plot the total gross? 
    p1 <- plot(theexpendables, yaxis = 'domestic_total_gross', plotly = FALSE, time.range = length(greatestshowman)) + ylim(c(0, 200))
    p2 <- plot(greatestshowman, yaxis = 'domestic_total_gross', plotly = FALSE, time.range = length(greatestshowman)) + ylim(c(0, 200))
    gridExtra::grid.arrange(p1, p2, ncol = 2)

    # And we can also plot the rank in box-office over 90 days from the release date
    p1 <- plot(theexpendables, yaxis = 'rank', plotly = FALSE, time.range = 90) + ylim(c(0, 20))
    p2 <- plot(greatestshowman, yaxis = 'rank', plotly = FALSE, time.range = 90) + ylim(c(0, 20))
    gridExtra::grid.arrange(p1, p2, ncol = 2)

movieMetricsLongList class

We can easily expend our analysis to lists of movies. To fetch data from lots of movies (> 100), a decent amount of time is needed. We provide an already generated list of ~1000 movies with the highest budget, saved in the top1000.moviesList movieMetricsLongList object. This list has been generated using the following code. The movieMetricsLongList class is adapted from the movieMetrics class and contains information for multiple movies.

    top1000.moviesList <- fetchMoviesList(moviesDB$movies.titles[1:1000], moviesDB)
    str(top1000.moviesList)
    infos(top1000.moviesList) %>% head()
    metrics(top1000.moviesList) %>% head()

The structure of the movieMetricsLongList object makes it really convenient to manipulate and enrich. Data can quickly be appended and plots can be done in a breeze!

    MCU.movies <- c(
        "Iron Man", 
        "The Incredible Hulk", 
        "Iron Man 2", 
        "Thor", 
        "Captain America: The First Avenger", 
        "The Avengers (2012)",
        "Iron Man 3", 
        "Thor: The Dark World", 
        "Captain America: The Winter Soldier", 
        "Guardians of the Galaxy", 
        "Avengers: Age of Ultron", 
        "Ant-Man", 
        "Captain America: Civil War", 
        "Doctor Strange", 
        "Guardians of the Galaxy Vol 2", 
        "Spider-Man: Homecoming", 
        "Thor: Ragnarok", 
        "Black Panther",
        "Avengers: Infinity War", 
        "Ant-Man and the Wasp", 
        "Captain Marvel"
    )
    # Fetch data
    marvelLongList <- fetchMoviesList(MCU.movies, moviesDB = moviesDB) 
    # Or, if we have the entire database already here, you can subset it using
    #   title names!
    marvelLongList <- top1000.moviesList[MCU.movies]
    # Add MCU phase
    MCU.phase <- factor(c(rep(1, 6), rep(2, 6), rep(3, 9)))
    marvelLongList %<>% addFeature(MCU.phase, "phase")
    # Add year of release
    years <- infos(marvelLongList)$release_date %>% substr(start = 1, stop = 4) %>% as.numeric() %>% factor(., levels = 2008:2019)
    marvelLongList %<>% addFeature(years, "year_of_release")
    # Plot results
    p <- plot.movieMetricsLongList(marvelLongList, x = "day", y = "rank", facet = "year_of_release", xlim = c(0, 120))
    p <- plot.movieMetricsLongList(marvelLongList, x = "day", y = "domestic_total_gross", facet = "year_of_release", xlim = c(0, 120))
    # Compare two movies form the MCU
    p1 <- plot(marvelLongList['The Incredible Hulk'], yaxis = "domestic_total_gross", plotly = FALSE, time.range = length(marvelLongList["Avengers: Infinity War"])) + ylim(c(0, 700))
    p2 <- plot(marvelLongList["Avengers: Infinity War"], yaxis = "domestic_total_gross", plotly = FALSE, time.range = length(marvelLongList["Avengers: Infinity War"])) + ylim(c(0, 700))
    gridExtra::grid.arrange(p1, p2, ncol = 2)
```p1 <- plot.movieMetrics(top1000.moviesList['Gifted'], yaxis = "domestic_total_gross", plotly = TRUE, time.range = length(top1000.moviesList["Gifted"])) + ylim(c(0, 700))

```r
    # append.to.metrics = FALSE is used to only add the gains to the movies infos data.frame
    gains <- infos(top1000.moviesList)$total_gross - infos(top1000.moviesList)$budget
    years <- infos(top1000.moviesList)$release_date %>% substr(start = 1, stop = 4) %>% as.numeric() %>% factor()
    ratios <- infos(top1000.moviesList)$total_gross / infos(top1000.moviesList)$budget %>% as.numeric()
    top1000.moviesList %<>% 
        addFeature(years, "year_of_release", append.to.metrics = FALSE, overwrite = TRUE) %>% 
        addFeature(gains, "gain", append.to.metrics = FALSE, overwrite = TRUE) %>% 
        addFeature(ratios, "ratio", append.to.metrics = FALSE, overwrite = TRUE)
    p <- ggplot(infos(top1000.moviesList), aes_string(x = "budget", y = "gains", col = "distributor")) +
        geom_point() + 
        facet_wrap("distributor") + 
        theme_bw() +
        theme(legend.position = "bottom")

    df <- infos(top1000.moviesList) %>%
        '['(.$distributor == 'Walt Disney',)
    p <- ggplot(df, aes_string(x = "year_of_release", y = "gain", fill = "year_of_release", label = "title")) +
        geom_violin(scale = "width") + 
        geom_jitter(position = position_jitter(0.2), size = 0.5, aes_string(col = "distributor")) +
        theme_bw() +
        theme(legend.position = "bottom") +
        ggrepel::geom_label_repel() +
        scale_fill_grey(start = 0.2, end = 0.8)

Adding IMDb data

    results <- list()
    for (title in infos(top1000.moviesList)$title) {
        message(title)
        results[title] <- findIMDbIndex(title, top1000.moviesList)
    }