knitr::opts_chunk$set( message = FALSE, warning = FALSE, error = FALSE, tidy = FALSE, cache = FALSE ) options("getSymbols.warning4.0"=FALSE)
“If I could do it all again, I'd be a plumber.”
-- Albert Einstein
datap is a lightweight DSL (Domain Specific Language) to define configurable, modular, and re-usable data processes for use in the R programming language. datap contexts can be used to acquire, pre-process, quality-assure, and merge data in a way that is completely transparent to the user.
In practice, each datap setup will consist of the following elements:
Event though this document is about the first part mainly (the datap context file definition), let us briefly illustrate what is behind each of the three elements.
Each context is defined in in a yaml file and contains a series of hierarchically organised taps. Each tap represents a specific dataset, together with its source and pre-processing steps. Consider the following, very simple context that is provided with the datap package as context2.yaml:
filePath <- system.file("extdata", "context2.yaml", package="datap") yamlString <- paste0(readLines(filePath), collapse = "\n") cat(yamlString)
It defines three taps (Apple, Tesla, and S&P500), and organises stocks and indices neatly in a hierarchical structure.
The R functions do the actual units of work of the pre-processing steps defined in (1), like e.g. downloading data from the internet, data cleaning, merging, etc. The packages are typically datap-agnostic.
The interpreter parses the datap context, and maps pre-processing steps defined in (1) to actual library functions available in (2), so as to provide - for each tap - an R function that can be called by the user of the library.
If you have the datap package installed, you can load the context into memory using the datap::Load
function:
library(datap) filePath <- system.file("extdata", "context2.yaml", package="datap") context <- Load(filePath)
The context looks like this:
context
And you can directly navigate to a tap and tap into the data:
teslaBars <- context$stocks$Tesla$tap() head(teslaBars)
For the user of the context, it is completely transparent where the data is coming from and how it is pre-processed. For example, the S&P500 index is downloaded from Yahoo finance and not from Quandl. Yet, the user accesses the dataset in exactly the same way:
spx <- context$indices$`S&P500`$tap() head(spx$GSPC.Open)
However, in a real world scenario additional pre-processing steps are necessary to make sure that the structure of the data is indeed the same for datasets from different sources.
In general, however, your data can be anything, and it can come from any source (the internet, a file, from memory, by calling an R function, or generated on the fly by your context, etc.).
In this document, the datap syntax is described using the following conventions:
>
: a reference to a specific datap element[]
: optional elements@
: replace the following string with an appropriate namen*
: repeat the element n times, where n can be any positive number|
: orA datap >context
is defined in a single YAML document. A YAML document can contain at most one >context
.
A >context
spans a tree whose nodes are each one the following types of joints:
>tap
: entry point to data, can have parameters>processor
: unit of work (data acquisition and pre-processing)>module
: define re-usable pipes>structure
: organise taps into hierarchiesEach joint consists of the following:
>joint
type);
>joints
The flow of data is from leafs towards the root, and ends at a >tap
. Thus, each sub-tree below a >tap
defines the processing steps of a >tap
. In line with data flow, we use the term upstream to denote joints that are processed before a given joint. We use downstream to denote joints that are processed after a given joint.
Variables can be defined in a >structure
, >tap
, >pipe
, and >junction
.
A >variables
section is an associative list, called "variables". Each variable is an entry in that list, with the key defining the variable name, and the value defining the variable value:
>structure|>tap|>pipe|>junction variables: n* $variableName: $value
The names of special references cannot be used as variable name (namely: "inflow", "joint", "context").
The scope of a variable is the sub-tree spanned by the joint in which the variable is defined. A variable value can be overwritten by an upstream joint.
Example:
Closing Prices: type: structure variables: series: Close startDate: 2000-01-01
A >reference
has an $
prefix, and refers to a downstream >variable
, a >parameter
, or a >special reference
.
You can use a >reference
in a >parameter
or in a >variable
.
>parameters|>variables: $name: $@variableReferenceName
For example:
AAPL: type: tap variables: #variable reference #maxNaRatioDefault must be defined upstream maxNaRatio: $maxNaRatioDefault
You can also use a >reference
in a >function
.
For example:
AAPL: type: tap variables: ticker: "'YAHOO/AAPL'" download: type: processor function: Quandl::Quandl(code = $ticker, type = 'xts')
function|
The following variable references can be used without defining the variables downstream:
They are reserved words and cannot be used as variable names.
$inflow
The $inflow
reference refers to the output of the upstream joints. For a pipe, there is a single upstream joint. For a junction, there can be more than one. In that case, the $inflow
refers to the set of upstream outputs.
Example:
MinLength: type: error function: MinLength arguments: timeseries: '$inflow' minLength: 10
$joint
The $joint
reference refers to the upstream joints. This is particularly useful in connection with factory joints.
Example:
Cache: type: factory function: Cache arguments: f: '$joint' timeout: 3600
$context
The $context
reference refers to its surrounding >context
.
It is useful to source data from within a >context
, and to re-use it as an input into another >tap
.
For example:
Tap: type: processor function: Tap arguments: context: '$context' tapPath: 'Closing Prices/Indices/SPX'
Attributes can contain information and/or meta data that is not part of the datap processing. For example, for each data series you can store a long name, description, etc. The datap interpreter then provide additional functionality, e.g. to find a >tap
by attribute.
>pipe|>junction|>processor|>factory|>warning|>error|>structure attributes: n* @attributeName: @value
Functions are mapped to normal R functions.
In case of a >processor
, the function actually does the unit of work and passes on the result to the downstream joint.
In case of an >error
, or >warning
, the function contains the logic to test for the error condition.
In case of an >attribute
, >variable
or >parameter
the function
>processor|>error|>warning function: @functionName(n* arg1 [= @default1])
and
>attribute|>variable|>parameter @name: @functionName(n* arg1 [= @default1])
The function syntax is similar to R, with a few differences:
sum(2, $param1)
However, you may nest functions, e.g. sum(2, sum(3, 5))
, or sum(seq(1, 10))
.
You may also use named parameters, e.g. sum(2, 3, na.rm = TRUE)
.
You can use package notation to refer to a function in a package. For example:
Fill NAs: type: processor function: zoo::na.locf($inflow)
By default, functions are executed on tap time, i.e. when a user calls a tap. However, you may provide the
interpreter with a directive to execute the function already at build time, i.e. when reading the context file
and creating the context. This is defined by a .
preceding the function name.
For example, the following variable takes on a new value each time the downstream tap is called:
variables: time: Sys.time()
To avoid that, you can use the dot-directive:
variables: time: .Sys.time()
However, if the function expression contains references to elements that are available at tap time only, then the dot-directive is ignored. For example:
MyStructure: type: structure variables: time: Sys.time() Substructure: type: structure variables: # $time cannot be resolved at build time, so . is # ignored and the execution time is tap yesterday: .subtract($time, -24*60*60)
Joints are the building blocks of any datap context, as explained in the Context section.
>structure
joints fulfil two purposes:
>tap
>variables
In terms of data processing, structures are of no relevance.
library(datap) s <- datap:::GetSyntaxDefinition() print(s$structure)
Consequentially:
- a structure may never be upstream from a
>tap
- a structure has no other recognizable type declaration than being a named associative list. Thus, any named associative list inside a structure is itself a structure.
- a pipe may be defined directly on a structure, without a tap. Such a pipe will not be accessible through the context, and its only purpose is to define a re-usable module
A >tap
defines an entry point to specific data, within a context.
Conceptually, you can think of a tap as a public function: when you open a tap (think "call the function"), data pours out (think: "data is returned as an output/return value").
print(s$tap)
There are only
>structure
joints downstream from a tap. There are no other>tap
joints upstream from a tap.
A >tap
may have 0 to n >parameters
, allowing the caller of the tap to provide tap arguments.
By default, when you define a >tap
in a >context
, all non-resolved upstream function parameters are added as >parameter
to the >tap
.
However, you may also wish to define >parameters
explicitly, mainly for the following reasons:
>tap parameters: n* $parameterName: [$defaultArgument]
For example:
AAPL: #tap name type: tap parameters: startDate: 2000-01-01 endDate: Sys.Date() includeWeekends:
Processors are the work-horse of a >tap
. Each >processor
defines a unit of work, such as data acquisition, cleaning, or other forms of pre-processing.
print(s$tap)
>error
and >warning
joints allow testing the results of the upstream >processor
joint.
>error
and >warning
joints are pass-through: the downstream $inflow and $joint variable references the joint's upstream joint.
An >error
condition is a directive to the interpreter to stop execution and display an error message.
A >warning
condition is a directive to continue execution, and display a warning message.
print(s$error)
print(s$warning)
Example:
MinLength: type: error function: MinLength arguments: timeseries: '$inflow' minLength: 10
>factory
adds functional programming elements to datap.
A >factory
is similar to a >processor
. The difference is that:
>function
is executed only once, at >context
creation time (and not at >tap
call time)>function
is expected to be itself a >function
. That >function
will then be invoked at >tap
call time.print(s$factory)
Example:
Cache: type: factory function: Cache arguments: f: '$joint' timeout: 3600
Interpretation: The function of the upstream joint is passed into the Cache function as its f argument. Cache is expected to be a function factory that returns, as an output a memoised version of $joint.
A >pipe
joint lets you arrange a number of upstream joints sequentially.
print(s$pipe)
For example, the following >pipe
first checks if the number of NAs in a series is below an inacceptable threshold (NA Ratio), then it backfills missing values (Fill NAs):
NA handling: &NaHandling type: pipe Fill NAs: type: processor function: zoo::na.locf arguments: object: '$inflow' NA Ratio: type: warning function: NaRatio arguments: timeseries: '$inflow' variable: '@series' maxRatio: '@maxNaRatio'
A >junction
merges multiple upstream joints into a single stream.
Unlike the >pipe
, the >junction
has a >function
, which is a directive how to merge the upstream joints.
print(s$junction)
Modularization is achieved with YAML anchors and references. Modules that are not used in a tap can be put in a module section.
print(s$module)
For example:
modules: type: module #this module has no tap #it only serves as anchors for other taps NA handling: &NaHandling type: pipe Fill NAs: type: processor function: zoo::na.locf arguments: object: '$inflow' NA Ratio: type: warning function: NaRatio arguments: timeseries: '$inflow' variable: '@series' maxRatio: '@maxNaRatio'
filePath <- system.file("extdata", "context1.yaml", package="datap") yamlString <- paste0(readLines(filePath), collapse = "\n") cat(yamlString)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.