knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  error = FALSE,
  tidy = FALSE,
  cache = FALSE
)

options("getSymbols.warning4.0"=FALSE)

“If I could do it all again, I'd be a plumber.”

-- Albert Einstein

Preliminary Remarks

Document Version

Introduction

datap is a lightweight DSL (Domain Specific Language) to define configurable, modular, and re-usable data processes for use in the R programming language. datap contexts can be used to acquire, pre-process, quality-assure, and merge data in a way that is completely transparent to the user.

In practice, each datap setup will consist of the following elements:

  1. A datap context
  2. R functions (your own or from packages)
  3. The datap interpreter, i.e. te R datap package

Event though this document is about the first part mainly (the datap context file definition), let us briefly illustrate what is behind each of the three elements.

datap context

Each context is defined in in a yaml file and contains a series of hierarchically organised taps. Each tap represents a specific dataset, together with its source and pre-processing steps. Consider the following, very simple context that is provided with the datap package as context2.yaml:

filePath <- system.file("extdata", "context2.yaml", package="datap")
yamlString <- paste0(readLines(filePath), collapse = "\n")
cat(yamlString)

It defines three taps (Apple, Tesla, and S&P500), and organises stocks and indices neatly in a hierarchical structure.

R functions

The R functions do the actual units of work of the pre-processing steps defined in (1), like e.g. downloading data from the internet, data cleaning, merging, etc. The packages are typically datap-agnostic.

datap interpreter

The interpreter parses the datap context, and maps pre-processing steps defined in (1) to actual library functions available in (2), so as to provide - for each tap - an R function that can be called by the user of the library. If you have the datap package installed, you can load the context into memory using the datap::Load function:

library(datap)
filePath <- system.file("extdata", "context2.yaml", package="datap")
context <- Load(filePath)

The context looks like this:

context

And you can directly navigate to a tap and tap into the data:

teslaBars <- context$stocks$Tesla$tap()
head(teslaBars)

For the user of the context, it is completely transparent where the data is coming from and how it is pre-processed. For example, the S&P500 index is downloaded from Yahoo finance and not from Quandl. Yet, the user accesses the dataset in exactly the same way:

spx <- context$indices$`S&P500`$tap()
head(spx$GSPC.Open)

However, in a real world scenario additional pre-processing steps are necessary to make sure that the structure of the data is indeed the same for datasets from different sources.

In general, however, your data can be anything, and it can come from any source (the internet, a file, from memory, by calling an R function, or generated on the fly by your context, etc.).

Syntax Description Conventions

In this document, the datap syntax is described using the following conventions:

datap Syntax

context

A datap >context is defined in a single YAML document. A YAML document can contain at most one >context.

A >context spans a tree whose nodes are each one the following types of joints:

Each joint consists of the following:

  1. a mandatory type (tap, structure, pipe, junction, processor, warning, error)
  2. named elements, namely (depending on the >joint type);
  3. other, nested >joints

The flow of data is from leafs towards the root, and ends at a >tap. Thus, each sub-tree below a >tap defines the processing steps of a >tap. In line with data flow, we use the term upstream to denote joints that are processed before a given joint. We use downstream to denote joints that are processed after a given joint.

variables

Variables can be defined in a >structure, >tap, >pipe, and >junction.

A >variables section is an associative list, called "variables". Each variable is an entry in that list, with the key defining the variable name, and the value defining the variable value:

>structure|>tap|>pipe|>junction
  variables:
    n* $variableName: $value

The names of special references cannot be used as variable name (namely: "inflow", "joint", "context").

The scope of a variable is the sub-tree spanned by the joint in which the variable is defined. A variable value can be overwritten by an upstream joint.

Example:

Closing Prices:
  type: structure
  variables:
    series: Close
    startDate: 2000-01-01

reference

variable and parameter reference

A >reference has an $ prefix, and refers to a downstream >variable, a >parameter, or a >special reference.

You can use a >reference in a >parameter or in a >variable.

>parameters|>variables:
  $name: $@variableReferenceName

For example:

AAPL:
  type: tap
  variables:
    #variable reference
    #maxNaRatioDefault must be defined upstream
    maxNaRatio: $maxNaRatioDefault

You can also use a >reference in a >function.

For example:

AAPL:
  type: tap
  variables:
    ticker: "'YAHOO/AAPL'"
  download:
    type: processor
    function: Quandl::Quandl(code = $ticker, type = 'xts')

function|

special reference

The following variable references can be used without defining the variables downstream:

They are reserved words and cannot be used as variable names.

$inflow

The $inflow reference refers to the output of the upstream joints. For a pipe, there is a single upstream joint. For a junction, there can be more than one. In that case, the $inflow refers to the set of upstream outputs.

Example:

MinLength:
  type: error
  function: MinLength
  arguments:
    timeseries: '$inflow'
    minLength: 10

$joint

The $joint reference refers to the upstream joints. This is particularly useful in connection with factory joints.

Example:

Cache:
  type: factory
  function: Cache
  arguments:
    f: '$joint'
    timeout: 3600

$context

The $context reference refers to its surrounding >context.

It is useful to source data from within a >context, and to re-use it as an input into another >tap.

For example:

Tap:
  type: processor
  function: Tap
  arguments:
    context: '$context'
    tapPath: 'Closing Prices/Indices/SPX'

attributes

Attributes can contain information and/or meta data that is not part of the datap processing. For example, for each data series you can store a long name, description, etc. The datap interpreter then provide additional functionality, e.g. to find a >tap by attribute.

>pipe|>junction|>processor|>factory|>warning|>error|>structure
  attributes:
    n* @attributeName: @value

function

Definition

Functions are mapped to normal R functions.

In case of a >processor, the function actually does the unit of work and passes on the result to the downstream joint.

In case of an >error, or >warning, the function contains the logic to test for the error condition.

In case of an >attribute, >variable or >parameter the function

>processor|>error|>warning
  function: @functionName(n* arg1 [= @default1])

and

>attribute|>variable|>parameter
  @name: @functionName(n* arg1 [= @default1])

Syntax

The function syntax is similar to R, with a few differences:

  1. datap variables can be used, but they must be referenced using '$'. e.g. sum(2, $param1)
  2. eclipsis (three dots / ...) are not supported

However, you may nest functions, e.g. sum(2, sum(3, 5)), or sum(seq(1, 10)). You may also use named parameters, e.g. sum(2, 3, na.rm = TRUE).

Package Reference

You can use package notation to refer to a function in a package. For example:

Fill NAs:
  type: processor
  function: zoo::na.locf($inflow)

Execution Time

By default, functions are executed on tap time, i.e. when a user calls a tap. However, you may provide the interpreter with a directive to execute the function already at build time, i.e. when reading the context file and creating the context. This is defined by a . preceding the function name.

For example, the following variable takes on a new value each time the downstream tap is called:

variables:
  time: Sys.time()

To avoid that, you can use the dot-directive:

variables:
  time: .Sys.time()

However, if the function expression contains references to elements that are available at tap time only, then the dot-directive is ignored. For example:

MyStructure:
  type: structure
  variables:
    time: Sys.time()
  Substructure:
    type: structure
    variables:
      # $time cannot be resolved at build time, so . is
      # ignored and the execution time is tap
      yesterday: .subtract($time, -24*60*60)

Joints

Joints are the building blocks of any datap context, as explained in the Context section.

structure

>structure joints fulfil two purposes:

In terms of data processing, structures are of no relevance.

library(datap)
s <- datap:::GetSyntaxDefinition()
print(s$structure)

Consequentially:

  • a structure may never be upstream from a >tap
  • a structure has no other recognizable type declaration than being a named associative list. Thus, any named associative list inside a structure is itself a structure.
  • a pipe may be defined directly on a structure, without a tap. Such a pipe will not be accessible through the context, and its only purpose is to define a re-usable module

tap

A >tap defines an entry point to specific data, within a context.

Conceptually, you can think of a tap as a public function: when you open a tap (think "call the function"), data pours out (think: "data is returned as an output/return value").

print(s$tap)

There are only >structure joints downstream from a tap. There are no other >tap joints upstream from a tap.

parameters

A >tap may have 0 to n >parameters, allowing the caller of the tap to provide tap arguments.

By default, when you define a >tap in a >context, all non-resolved upstream function parameters are added as >parameter to the >tap.

However, you may also wish to define >parameters explicitly, mainly for the following reasons:

>tap
  parameters:
    n* $parameterName: [$defaultArgument]

For example:

AAPL: #tap name
  type: tap
  parameters:
    startDate: 2000-01-01
    endDate: Sys.Date()
    includeWeekends:

processor

Processors are the work-horse of a >tap. Each >processor defines a unit of work, such as data acquisition, cleaning, or other forms of pre-processing.

print(s$tap)

error and warning

>error and >warning joints allow testing the results of the upstream >processor joint.

>error and >warning joints are pass-through: the downstream $inflow and $joint variable references the joint's upstream joint.

An >error condition is a directive to the interpreter to stop execution and display an error message. A >warning condition is a directive to continue execution, and display a warning message.

print(s$error)
print(s$warning)

Example:

MinLength:
  type: error
  function: MinLength
  arguments:
    timeseries: '$inflow'
    minLength: 10

factory

>factory adds functional programming elements to datap.

A >factory is similar to a >processor. The difference is that:

  1. a factory's >function is executed only once, at >context creation time (and not at >tap call time)
  2. the result of the >function is expected to be itself a >function. That >function will then be invoked at >tap call time.
print(s$factory)

Example:

Cache:
  type: factory
  function: Cache
  arguments:
    f: '$joint'
    timeout: 3600

Interpretation: The function of the upstream joint is passed into the Cache function as its f argument. Cache is expected to be a function factory that returns, as an output a memoised version of $joint.

pipe

A >pipe joint lets you arrange a number of upstream joints sequentially.

print(s$pipe)

For example, the following >pipe first checks if the number of NAs in a series is below an inacceptable threshold (NA Ratio), then it backfills missing values (Fill NAs):

NA handling: &NaHandling
  type: pipe
  Fill NAs:
    type: processor
    function: zoo::na.locf
    arguments:
      object: '$inflow'
  NA Ratio:
    type: warning
    function: NaRatio
    arguments:
      timeseries: '$inflow'
      variable: '@series'
      maxRatio: '@maxNaRatio'

junction

A >junction merges multiple upstream joints into a single stream.

Unlike the >pipe, the >junction has a >function, which is a directive how to merge the upstream joints.

print(s$junction)

module

Modularization is achieved with YAML anchors and references. Modules that are not used in a tap can be put in a module section.

print(s$module)

For example:

modules:
  type: module
  #this module has no tap
  #it only serves as anchors for other taps
  NA handling: &NaHandling
    type: pipe
    Fill NAs:
      type: processor
      function: zoo::na.locf
      arguments:
        object: '$inflow'
    NA Ratio:
      type: warning
      function: NaRatio
      arguments:
        timeseries: '$inflow'
        variable: '@series'
        maxRatio: '@maxNaRatio'

Example

filePath <- system.file("extdata", "context1.yaml", package="datap")
yamlString <- paste0(readLines(filePath), collapse = "\n")
cat(yamlString)


gluc/datapR documentation built on May 17, 2019, 6:41 a.m.