Travis-CI Build Status

Getting Started

imdb is a light package that leverages the the Python module IMDbPy and the etl framework to make mirroring the IMDB in SQL painless, with user interaction taking place entirely within R.

Prerequisities

You must install the Python module IMDbPy, which also has external dependencies. For Ubuntu, the following command should install everything you need. Binaries for Mac OS X and Windows are available from the project's Download page. [You may want to consult the .travis.yml file for a list of those dependencies.]

```{bash, eval=FALSE} sudo apt-get install python-imdbpy python-sqlalchemy python-sqlobject python-psycopg2 python-mysqldb

You will also need to install the `etl` package from GitHub. 

```r
install.packages("etl")

Installation

Similarly, imdb must be installed from GitHub.

devtools::install_github("beanumber/imdb")
library(imdb)

Instantiate an object

Since the IMDB is very large (many gigabytes), it is best to store the data in a persistent SQL database. By default, etl will create an RSQLite for you in a temp directory -- but this is not a very safe place to store these data. Instead, we will connect to an existing (but empty) MySQL database using a local option file.

# must have pre-existing database "imdb"
db <- src_mysql_cnf(dbname = "imdb")

Since you will be downloading lots of data, you will probably want to specify a directory to store the raw data (which will take up several gigabytes on disk). Again, etl will create a directory for you if you don't, but that directory will be in a temp directory that is not safe.

imdb <- etl("imdb", db = db, dir = "~/dumps/imdb/")

Performing the ETL steps

The first phase is to Extract the data from IMDB. This may take a while. There are 47 files that take up approximately 2 GB on disk. By default, only the movies, actors, actresses, and directors files will be downloaded, but even these take up more then 500 MB of disk space.

imdb %>%
  etl_extract()

Mercifully, there is no Transform phase for these data. However, the Load phase can take a loooooong time.

The load phase leverages the Python module IMDbPy, which also has external dependencies. Please see the .travis.yml file for a list of those dependencies (on Ubuntu -- your configuration may be different).

You may want to leave this running. To load the full set of files it took about 90 minutes and occupied about 9.5 gigabytes on disk.

imdb %>%
  etl_load()
# TIME FINAL : 88min, 20sec (wall) 25min, 46sec (user) 0min, 11sec (system)
summary(imdb)

Query the database

Once everything is completed, you can query your fresh copy of the IMDB to find all of the Star Wars movies:

movies <- imdb %>%
  tbl("title")
movies %>%
  filter(title == "Star Wars" & kind_id == 1)


beanumber/imdb documentation built on May 12, 2019, 9:43 a.m.