etl_extract.etl_imdb: Set up local IMDB

Description Usage Arguments Details Source Examples

Description

Download the raw data files from IMDB

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
## S3 method for class 'etl_imdb'
etl_extract(obj, tables = c("movies", "actors",
  "actresses", "directors"), all.tables = FALSE, ...)

## S3 method for class 'etl_imdb'
etl_load(obj, path_to_imdbpy2sql = NULL,
  password = "", ...)

etl_load_data(obj, ...)

## S3 method for class 'src_mysql'
etl_load_data(obj, ...)

Arguments

obj

an etl object

tables

a character vector of files from IMDB to download. The default is movies, actors, actresses, and directors. These four files alone will occupy more than 500 MB of disk space. There are 49 total files available on IMDB. See ftp://ftp.fu-berlin.de/pub/misc/movies/database/ for the complete list.

all.tables

a logical indicating whether you want to download all of the tables. Default is FALSE.

...

arguments passed to methods

path_to_imdbpy2sql

a path to the IMDB2SQL Python script provided by IMDBPy. If NULL – the default – will attempt to find it using findimdbpy2sql.

password

Must re-enter password unless your password is blank. The real password will not be shown in messages.

Details

For best performance, set the MySQL default collation to utf8_unicode_ci. See the IMDbPy2sql documentation at http://imdbpy.sourceforge.net/docs/README.sqldb.txt for more details.

Please be aware that IMDB contains information about *all* types of movies.

Source

IMDB: ftp://ftp.fu-berlin.de/pub/misc/movies/database/temporaryaccess/

IMDbPy: http://imdbpy.sourceforge.net/

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Connect using default RSQLite database
imdb <- etl("imdb")

# Connect using pre-configured PostgreSQL database
## Not run: 
 if (require(RPostgreSQL)) {
   # must have pre-existing database "imdb"
   db <- src_postgres(host = "localhost", user="postgres", 
                      password="postgres", dbname = "imdb")
  }
  imdb <- etl("imdb", db = db, dir = "~/dumps/imdb/")
  imdb %>%
    etl_extract(tables = "movies") %>%
    etl_load()

## End(Not run)
## Not run: 
 if (require(RMySQL)) {
   # must have pre-existing database "imdb"
   db <- src_mysql_cnf(dbname = "imdb")
  }
  imdb <- etl("imdb", db = db, dir = "~/dumps/imdb/")
  imdb %>%
    etl_extract(tables = "movies") %>%
    etl_load()
    
  movies <- imdb %>%
    tbl("title") 
  movies %>%
    filter(title == 'star wars')
    
  people <- imdb %>%
    tbl("name") 
  roles <- imdb %>%
    tbl("cast_info") 
  movies %>%
    inner_join(cast_info, by = c("id" = "movie_id")) %>%
    inner_join(people, by = c("person_id" = "id")) %>%
    filter(title == 'star wars') %>%
    filter(production_year == 1977) %>%
    arrange(nr_order)
  

## End(Not run)

beanumber/imdb documentation built on May 12, 2019, 9:43 a.m.