Roll your own IMDB SQL database

library(imdb)

Since the IMDB is very large (many gigabytes), it is best to store the data in a persistent SQL database. By default, etl will create an RSQLite for you in a temp directory -- but this is not a very safe place to store these data. Instead, we will connect to an existing (but empty) PostgreSQL database.

if (require(RPostgreSQL)) {
  # must have pre-existing database "imdb_fresh"
  db <- src_postgres(host = "localhost", user = "postgres", password = "postgres", dbname = "imdb")
}

Since you will be downloading lots of data, you will probably want to specify a directory to store the raw data (which will take up several gigabytes on disk). Again, etl will create a directory for you if you don't, but that directory will be in a temp directory that is not safe.

imdb <- etl("imdb", db = db, dir = "~/dumps/imdb/")

Performing the ETL steps

The first phase is to Extract the data from IMDB. This may take a while. There are 49 files that take up approximately 2 GB on disk. By default, only the movies, actors, actresses, and directors files will be downloaded, but even these take up more than 500 MB of disk space.

imdb %>%
  etl_extract(tables = c("movies"))

Mercifully, there is no Transform phase for these data. However, the Load phase can take a loooooong time.

The load phase leverages the Python module IMDbPy, which also has external dependencies. Please see the .travis.yml file for a list of those dependencies (on Ubuntu -- your configuration may be different).

You'll want to leave this running overnight. To load the full set of files it took about 54 hours!

imdb %>%
  etl_load()

Query the database

Once everything is completed, you can query your fresh copy of the IMDB.

movies <- imdb %>%
  tbl("title")
movies %>%
  filter(title == "Star Wars")


beanumber/imdb documentation built on May 12, 2019, 9:43 a.m.