etl is an R package to facilitate Extract - Transform - Load (ETL) operations for medium data. The end result is generally a populated SQL database, but the user interaction takes place solely within R.
etlInstantiate an etl object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial mtcars database is built into etl.
library(etl) cars <- etl("mtcars") class(cars)
Pay careful attention to where the SQLite database is stored. The default location is a temporary directory, but you will want to move this to a more secure location if you want this storage to be persistent. See file.copy() for examples on how to move a file.
etl works with a local or remote database to store your data. Every etl object extends a dplyr::src_dbi object. If, as in the example above, you do not specify a SQL source, a local RSQLite database will be created for you. However, you can also specify any source that inherits from dplyr::src_dbi.
Note: If you want to use a database other than a local RSQLite, you must create the
mtcarsdatabase and have permission to write to it first!
# For PostgreSQL library(RPostgreSQL) db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost") # Alternatively, for MySQL library(RMariaDB) db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost") cars <- etl("mtcars", db)
At the heart of etl are three functions: etl_extract(), etl_transform(), and etl_load().
The first step is to acquire data from an online source.
cars |> etl_extract()
This creates a local store of raw data.
These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).
cars |> etl_transform()
Populate the SQL database with the transformed data.
cars |> etl_load()
To populate the whole database from scratch, use etl_create.
cars |> etl_create()
You can also update an existing database without re-initializing, but watch out for primary key collisions.
cars |> etl_update()
Under the hood, there are three functions that etl_update chains together:
getS3method("etl_update", "default")
etl_create is simply a call to etl_update that forces the SQL database to be written from scratch.
getS3method("etl_create", "default")
Now that your database is populated, you can work with it as a src data table just like any other dplyr source.
cars |> tbl("mtcars") |> group_by(cyl) |> summarise(N = n(), mean_mpg = mean(mpg))
etlSuppose you want to create your own ETL package called pkgname. All you have to do is write a package that requires etl, and then you have to write one S3 methods:
etl_extract.etl_pkgname()
You may also wish to write
etl_transform.etl_pkgname() etl_load.etl_pkgname()
All of these functions must take and return an object of class etl_pkgname that inherits from etl. Please see the "Extending etl" vignette for more information.
Packages that use the etl framework are available on CRAN and/or GitHub:
tools::dependsOnPkgs("etl")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.