README.md

ptools

license lifecycle Travis build
status Documentation

Overview

ptools is a package to help you organize your data pipeline project. Since setting up a project follows recurrent step, a default procedure is suggested here to save time. The purpose is also to ease project upgrades and allow unit testing.

Installation

This package will be not be deployed to CRAN, you need to install it from github.

# To install from github you need the package devtools first
if(!require("devtools", character.only = TRUE)){install.packages("devtools")}

# Then you may intall ptools
devtools::install_github("ND-open/ptools")

Organize your projects

The first step of one project is to define what you want to do with your data. If there are several uses to your data you may want to cut your project in pieces. Straightforward runs will allow you to get quicker wins and also to keep your teams (end user, devs,… ) motivated.

By default the following structure is created to keep track of your data processing :

# Assuming HDFS organisation
hdfs
|- landing
|- data
    |- raw
    |- intermediate
    |- final

There are several operations that you cannot perform easily or not at all in Impala/Hive (such as complex data transformation, reshaping, …). But you should perform joins using Hadoop to ensure that all the data is matched with what is intended no matter how delayed this data was uploaded and insure that your project is viable in time (e.g 5 years from now your R code might not be optimized to join on much more data).

Focus on cleaning the data

Thus you can only focus on cleaning your data while not messing with the landing folder. The structure of your pipeline should look like :

project_name
|- .gitignore
|- data
   |- references.csv
|- documents
   |- meeting_notes.md
|- R
   |- raw_*.R # e.g : raw_to_csv.csv
   |- inter_*.R # e.g : inter_reshape.R 
   |- final_*.R # e.f : final_types
|- project_name.Rproj
|- README.md
|- README.Rmd
|- vignette
   |- report.Rmd

To package or not to package

It is your choice to build a package from this structure to ensure reproducibility and stability of your code in time. You can also source your script for each pipeline job.

If you rely often on this pipeline or plan to have a stable code in time, the best thing to do is to package your code (from the structure above it is to add a Description file and compile + debug) and add a docker file.

What could be usefull as follow up



ND-open/ptools documentation built on July 9, 2019, 10:55 p.m.