library(imdb)
Since the IMDB is very large (many gigabytes), it is best to store the data in a persistent SQL database. By default, etl
will create an RSQLite
for you in a temp directory -- but this is not a very safe place to store these data. Instead, we will connect to an existing (but empty) PostgreSQL database.
if (require(RPostgreSQL)) { # must have pre-existing database "imdb_fresh" db <- src_postgres(host = "localhost", user = "postgres", password = "postgres", dbname = "imdb") }
Since you will be downloading lots of data, you will probably want to specify a directory to store the raw data (which will take up several gigabytes on disk). Again, etl
will create a directory for you if you don't, but that directory will be in a temp directory that is not safe.
imdb <- etl("imdb", db = db, dir = "~/dumps/imdb/")
The first phase is to Extract the data from IMDB. This may take a while. There are 49 files that take up approximately 2 GB on disk. By default, only the movies
, actors
, actresses
, and directors
files will be downloaded, but even these take up more than 500 MB of disk space.
imdb %>% etl_extract(tables = c("movies"))
Mercifully, there is no Transform phase for these data. However, the Load phase can take a loooooong time.
The load phase leverages the Python module IMDbPy
, which also has external dependencies. Please see the .travis.yml file for a list of those dependencies (on Ubuntu -- your configuration may be different).
You'll want to leave this running overnight. To load the full set of files it took about 54 hours!
imdb %>% etl_load()
Once everything is completed, you can query your fresh copy of the IMDB.
movies <- imdb %>% tbl("title") movies %>% filter(title == "Star Wars")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.