ptools
is a package to help you organize your data pipeline project.
Since setting up a project follows recurrent step, a default procedure
is suggested here to save time. The purpose is also to ease project
upgrades and allow unit testing.
This package will be not be deployed to CRAN, you need to install it from github.
# To install from github you need the package devtools first
if(!require("devtools", character.only = TRUE)){install.packages("devtools")}
# Then you may intall ptools
devtools::install_github("ND-open/ptools")
The first step of one project is to define what you want to do with your data. If there are several uses to your data you may want to cut your project in pieces. Straightforward runs will allow you to get quicker wins and also to keep your teams (end user, devs,… ) motivated.
By default the following structure is created to keep track of your data processing :
# Assuming HDFS organisation
hdfs
|- landing
|- data
|- raw
|- intermediate
|- final
landing
: contains only outside raw data that will processed then
archived. If things goes to worst you will be able to start from
scratch from there.data
: your operations starts here, the default here is to process
only csv files before building Impala or Hive tables.raw
: for primary operations like processing the extension of
your data (e.g from json to csv).intermediate
: optionnal, depending on your pipeline you may
want to reshape the data or add more cleaning steps.final
: here will be stored the cleaned data that you want to
build Impala or Hive tables uppon. The final clean data will be
automatically converted to Impala (or Hive) tables from the
final folder(s) on HDFS with the corresponding type. Then you
may aggregate it using Hadoop.There are several operations that you cannot perform easily or not at all in Impala/Hive (such as complex data transformation, reshaping, …). But you should perform joins using Hadoop to ensure that all the data is matched with what is intended no matter how delayed this data was uploaded and insure that your project is viable in time (e.g 5 years from now your R code might not be optimized to join on much more data).
Thus you can only focus on cleaning your data while not messing with the
landing folder
. The structure of your pipeline should look like :
project_name
|- .gitignore
|- data
|- references.csv
|- documents
|- meeting_notes.md
|- R
|- raw_*.R # e.g : raw_to_csv.csv
|- inter_*.R # e.g : inter_reshape.R
|- final_*.R # e.f : final_types
|- project_name.Rproj
|- README.md
|- README.Rmd
|- vignette
|- report.Rmd
inter_reshape.R
for an intermediate job than
takes data from the raw
folder, reshape it (e.g from long to wide
format) and write to the intermediate
folder.data
folder one can put variables dictionnary.README.Rmd
file is the minimal documentation for you or a peer
to retake your project later on.It is your choice to build a package from this structure to ensure reproducibility and stability of your code in time. You can also source your script for each pipeline job.
If you rely often on this pipeline or plan to have a stable code in
time, the best thing to do is to package your code (from the structure
above it is to add a Description
file and compile + debug) and add a
docker file.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.