etl_create: ETL functions for working with medium sized data

View source: R/etl_create.R

etl_cleanupR Documentation

ETL functions for working with medium sized data

Description

These generic functions provide a systematic approach for performing ETL (exchange-transform-load) operations on medium sized data.

Usage

etl_cleanup(obj, ...)

## Default S3 method:
etl_cleanup(
  obj,
  delete_raw = FALSE,
  delete_load = FALSE,
  pattern = "\\.(csv|zip)$",
  ...
)

etl_create(obj, ...)

## Default S3 method:
etl_create(obj, ...)

etl_update(obj, ...)

## Default S3 method:
etl_update(obj, ...)

etl_extract(obj, ...)

## Default S3 method:
etl_extract(obj, ...)

## S3 method for class 'etl_mtcars'
etl_extract(obj, ...)

## S3 method for class 'etl_cities'
etl_extract(obj, ...)

etl_load(obj, ...)

## Default S3 method:
etl_load(obj, ...)

etl_transform(obj, ...)

## Default S3 method:
etl_transform(obj, ...)

## S3 method for class 'etl_cities'
etl_transform(obj, ...)

Arguments

obj

an etl object

...

arguments passed to methods

delete_raw

should files be deleted from the raw_dir?

delete_load

should files be deleted from the load_dir?

pattern

regular expression matching file names to be deleted. By default, this matches filenames ending in .csv and .zip.

Details

The purposes of these functions are to download data from a particular data source from the Internet, process it, and load it into a SQL database server.

There are five primary functions:

etl_init

Initialize the database schema.

etl_extract

Download data from the Internet and store it locally in its raw form.

etl_transform

Manipulate the raw data such that it can be loaded into a database table. Usually, this means converting the raw data to (a series of) CSV files, which are also stored locally.

etl_load

Load the transformed data into the database.

etl_cleanup

Perform housekeeping, such as deleting unnecessary raw data files.

Additionally, two convenience functions chain these operations together:

etl_create

Run all five functions in succession. This is useful when you want to create the database from scratch.

etl_update

Run the etl_extract-etl_transform-etl_load functions in succession. This is useful when the database already exists, but you want to insert some new data.

Value

Each one of these functions returns an etl object, invisibly.

See Also

etl, etl_init

Examples


## Not run: 
if (require(RPostgreSQL)) {
  db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")
  cars <- etl("mtcars", db)
}
if (require(RMySQL) && mysqlHasDefault()) {
  db <- src_mysql(dbname = "mtcars", user = "r-user",
                  host = "localhost", password = "mypass")
  cars <- etl("mtcars", db)
}

## End(Not run)
cars <- etl("mtcars")
cars %>%
 etl_extract() %>%
 etl_transform() %>%
 etl_load() %>%
 etl_cleanup()
cars

cars %>%
 tbl(from = "mtcars") %>%
 group_by(cyl) %>%
 summarise(N = n(), mean_mpg = mean(mpg))

 # do it all in one step, and peek at the SQL creation script
 cars %>%
   etl_create(echo = TRUE)
 # specify a directory for the data
 ## Not run: 
 cars <- etl("mtcars", dir = "~/dumps/mtcars/")
 str(cars)
 
## End(Not run)
cars <- etl("mtcars")
# Do it step-by-step
cars %>%
  etl_extract() %>%
  etl_transform() %>%
  etl_load()

# Note the somewhat imprecise data types for the columns. These are the default.
tbl(cars, "mtcars")

# But you can also specify your own schema if you want
schema <- system.file("sql", "init.sqlite", package = "etl")
cars %>%
  etl_init(schema) %>%
  etl_load()

etl documentation built on Oct. 13, 2023, 1:08 a.m.