The etl
package provides a framework for working with medium data. These data are typically:
The etl
package provides a series of S3 generics and convenience functions that facilitate consistent development of etl
-style packages. Current examples include:
Please see these packages for examples of how to develop an etl
-dependent package. This vignette documents how these extensions can be crafted.
Suppose that you want to write the etl
-dependent package foo
. In order for foo
to work, it should implement at least one of these methods (and sometimes all three):
etl_extract.etl_foo()
- downloads data from Internetetl_transform.etl_foo()
- takes downloaded data and puts it into CSV format suitable for import to SQLetl_load.etl_foo()
- imports data into a SQL databaseThat's basically it. The rest of the machinery should be taken care of for you. In particular, etl_extract()
, etl_transform()
, and etl_load()
are all generic functions that have sensible default methods.
etl_extract.default()
- pulls all of the data.frame
s available through the data()
function for foo
, and writes them as CSVsetl_extract.default()
- simply copies all of the CSVs to the load directory. etl_load.default()
- imports all of those CSVs into a SQL databaseNote that you don't have to write an etl
method to handle foo
. You do, however, have to have the foo
package installed in order for the etl
instantiation function to work.
library(etl) foo <- etl("foo")
To see the default methods in action, pick a package with some data and import it.
ggplots <- etl("ggplot2") %>% etl_update() src_tbls(ggplots)
etl
methodsEach of the three main etl
methods must take an etl_foo
object as it's first argument, and (should invisibly) return an etl_foo
object. These methods are pipeable and predictable, but not pure, since they by design have side-effects (i.e. downloading files, etc.) Your major task in writing the foo
package will be to write these functions. How you write them is entirely up to you, and the particular implementation will of course depend on what the purpose of foo
is.
All three of the main etl
methods should take the same set or arguments. Most commonly these define the span of time for the files that you want to extract, transform, or load. For example, in the airlines
package, these functions take optional year
and month
arguments.
We illustrate with cities
, which unfortunately takes only ...
. Also, etl_cities
uses etl_load.default()
, so there is no etl:::etl_load.etl_cities()
method.
etl_extract.etl_cities %>% args() etl_transform.etl_cities %>% args() etl_load.etl_cities %>% args()
etl
methodsThere are four additional functions in the etl
toolchain:
etl_init()
- initialize the databaseetl_cleanup()
- delete unnecessary filesetl_update()
- run etl_extract
, etl_transform()
and etl_load()
in succession with the same argumentsetl_create()
- run etl_init()
, etl_update()
, and etl_cleanup()
in successionThese functions can generally be used without modification and thus are not commonly extended by foo
.
The etl_init()
function will initialize the SQL database.
If you want to contribute your own hard-coded SQL initialization script, it must be placed in inst/sql/
. The etl_init()
function will look there, and find files whose file extensions match the database type. For example, scripts written for MySQL should have the .mysql
file extension, while scripts written for PostgreSQL should have the .postgresql
file extension.
If no such file exists, all of the tables and views in the database will be deleted, and new tables schemas will be created on-the-fly by dplyr
.
etl_foo
object attributesEvery etl_foo
object has a directory where it can store files and a DBIConnection
where it can write to a database. By default, these come from tempdir()
and RSQLite::SQLite()
, but the user can alternatively specify other locations.
cities <- etl("cities") str(cities)
Note that an etl_foo
object is also a src_dbi
object and a src_sql
object. Please see the dbplyr
vignette for more information about these database connections.
citation("etl") citation("dplyr") citation("dbplyr")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.