knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# installing library
library(devtools)
install_github("castels/interpTools")

# Loading library
library(interpTools)

Introduction

The intended use of this R package is to provide a set of tools for the systematic testing of time series interpolators subject to changes in gap structure. The basic framework is as follows:

  1. Generate artificial time series
  2. Impose a gap structure
  3. Interpolate the gappy data
  4. Compare the interpolated series with the original

This vignette will demonstrate the appropriate workflow for such analyses. The example code is that which was used to generate the built-in 'toy data' contained in this package.

Step 1 --- Generate artificial time series with simXt()

The simXt() function generates time series, $x_t$, according to the general additive model:

\begin{equation} x_t = m_t + t_t + \xi_t \end{equation}

where $m_t$ is the mean function, $t_t$ is the trend function, and $\xi_t$ is the noise function. Subroutines of simXt() are functions simMt(), simTt(), and simWt(), which generate the mean, trend, and noise components, respectively, before their combination is returned in simXt(). The following section provides a brief description of each component, along with its set of defining parameters.

$m_t$: The mean component

The mean component $m_t$ is composed of a constant, non-varying mean element (the grand mean), and a varying polynomial trend element: \begin{align} m_t &= \mu +\mu_t \ &= \mu + \sum_{i=1}^{\phi} a_i \left(\frac{t-c}{N}\right)^i \end{align} where $\mu \sim U(\frac{-N}{100},\frac{N}{100})$, and $\textbf{a} = \left[a_1,a_2,...,a_\phi\right]$ with $a_i \sim N(0,\frac{N}{20i})$ and are randomly sampled, unless the user specifies otherwise.

The table below provides a description of each parameter. Note the "user control in simMt(),simXt()" column, which identifies parameters which can be set by the user.

|parameter|range|description |user control in
simMt(),simXt()| |:-------:|:---:|:--------------------|:---------------:| |$\mu$ | $\frac{-N}{100}\leq\mu\leq\frac{N}{100}$ | grand mean| mu | |$\phi$ | $\phi\in\mathbb{Z}$, $\phi\geq0$ | degree of polynomial $\mu_t$ | numTrend | |$a_i$ | $a_i\in\mathbb{R},\text{ }i=1,...,\phi$ | coefficient on $i^\text{th}$ polynomial | a | |$c$ | $c\in1,...,N$,
fixed $\forall i$ | centering parameter | center | |$N$ | $N\in\mathbb{N}$ | length of time series | N |

This component can be generated independently using the function simMt(). For example:

# Simulating a mean vector
mt <- simMt(N = 100, numTrend = 4, a = c(1,2,3,4), mu = 10, center = c(1,2,3,4))

Where the resulting vector of values can be accessed with mt$value:

head(mt$value, 10)

and the functional definition (as a character string) of $m_t$ can be shown with mt$fn:

mt$fn

$t_t$: The trend component

The trend component $t_t$ is constructed as a finite linear combination of sinusoids: \begin{equation} t_t = \sum_{i=1}^{\psi} b_i \sin(\omega_it) \end{equation} where $\textbf{b} = \left[b_1,b_2,...,b_\psi\right]$ with $b_i \sim N(1,\frac{N}{200})$ and $\omega = \left[\omega_1,\omega_2,...,\omega_\psi\right]$ (unless otherwise specified by the user) where $\omega_i$ defines the period of the $i$th sinusoid. The default is to sample $\psi$ unique values for each $\omega_i \in \left[\frac{2\pi}{N},\pi \right]$, where $\frac{2\pi}{n}$ is the fundamental Fourier frequency and $\pi$ is the Nyquist frequency. The frequency is related to the period $\omega$ through the relation $f = \frac{\omega}{2\pi}$. Thus, the lowest possible frequency that could be sampled is $f = \frac{\frac{2\pi}{N}}{2\pi} = \frac{1}{N}$, and the highest possible frequency is $f = \frac{\pi}{2\pi} = \frac{1}{2}$.

There is an additional parameter, $\eta$, that defines a bandwidth of the order $10^{-\eta}$ from which to sample $\omega$ from $\frac{\psi}{2}$ random non-overlapping subintervals in the interval $\left[\frac{2\pi}{N},\pi \right]$, unless the user has already explicitly defined these values. For bandwidth-specified default simulations, two $\omega$'s are sampled from each subinterval, and so for this reason $\psi$ must be an even integer.

The table below provides a description of each parameter. Note the "user control in simTt(),simXt()" column, which identifies parameters which can be set by the user.

|parameter|range|description |user control in
simTt(),simXt()| |:-------:|:---:|:------------------------|:--------------:| |$\psi$ | $\psi\geq0,\text{ }\psi\in2\mathbb{Z}$ | total number of sinusoids | numFreq | |$b_i$ | $b_i\in\mathbb{R},\text{ }i=1,...,\psi$ | amplitude of $i^\text{th}$ sinusoid | b | |$\omega_i$ | $\frac{2\pi}{N}\leq\omega_i\leq\pi,\text{ }i=1,...,\psi$ | period of $i^\text{th}$ sinusoid | w | |$\eta$ | $\eta\geq1$ | bandwidth $= 10^{-\eta}$ | bandwidth | |$N$ | $N\in\mathbb{N}$ | length of time series | N |

This component can be generated on its own using the function simTt(). For example:

# Simulating a trend vector
tt <- simTt(N = 100, numFreq = 4, b = c(1,2,3,4), w = c(1,2,3,4))

Where the resulting vector of values can be accessed with tt$value:

head(tt$value, 10)

and the functional definition (as a character string) of $t_t$ can be shown with tt$fn:

tt$fn

$\xi_t$: The noise component

The noise component $\xi_t$ is constructed as an $ARIMA(P,Q)$ stochastic process: \begin{equation} \xi_t = \alpha_1X_{t-1} + ... + \alpha_P X_{t-P} + Z_t + \beta_1Z_{t-1} + ... + \beta_Q Z_{t-Q} \end{equation} where $P$ is the autoregressive order, and $Q$ is the moving-average order.

The table below provides a description of each parameter. Note the "user control in simWt(),simXt()" column, which identifies parameters which can be set by the user.

|parameter|range|description |user control in
simWt(),simXt()| |:-------:|:---:|:------------------------|:-----------:| |$P$ | $P\in\mathbb{Z},\text{ }P\geq 0$ | $AR$ order | p | |$\alpha_i$ | $\alpha_i\in\mathbb{R},\text{ }i = 1,...,P$ | coefficient on $i$th component of $AR(P)$ | | |$Q$ | $Q\in\mathbb{Z},\text{ }Q\geq 0$ | $MA$ order | q | |$\beta_j$ | $\beta_j\in\mathbb{R},\text{ }j=1,...,Q$ | coefficient on $j$th component of $MA(Q)$ | | |$\sigma^2_{\xi_t}$ | $\sigma^2_{\xi_t}\in\mathbb{R}$, $\sigma^2_{\xi_t}>0$ | variance of $\xi_t$ | var | |$N$ | $N\in\mathbb{N}$ | length of time series | N |

This component can be generated on its own using the function simWt(). For example:

# Simulating a noise vector
wt <- simWt(N = 100, p = 0, q = 0 , var = 5) # white noise process

Where the resulting vector of values can be accessed with wt$value:

head(wt$value, 10)

If the user so-chooses, these components can be added manually to produce the time series, $x_t:

xt <- mt$value + tt$value + wt$value

head(xt, 10)

However, it is recommended to use simXt(), which will be demonstrated in the next section.

Using simXt()

It is recommended to use simXt() to generate time series objects as the resulting output of this function is an object with special class simList. Such objects contain the time series themselves, as well as other important metadata such as the parametric definitions of each component. \emph{Note that these are the only types of objects compatible with the plotXt() function}.

Here we demonstrate the use of simXt() to produce a time series all in one step:

xt <- simXt(N = 100, 
            numTrend = 4, a = c(1,2,3,4), mu = 10, center = c(1,2,3,4),  # mt component
            numFreq = 4, b = c(1,2,3,4), w = c(1,2,3,4), # tt component
            p = 0, q = 0, snr = 1.5) # wt component

class(xt)

The user should carefully note the snr argument in simXt(), which represents the desired signal-to-noise ratio of $x_t$, where: \begin{equation} \mathbf{SNR} = \frac{\sigma^2_{t_t}}{\sigma^2_{\xi_t}} \end{equation} When the user specifies the snr, the var argument in simWt() will be overridden and need not be set, since $\sigma^2_{\xi_t}$ is calculated based on the variance of $t_t$.

The actual numeric vector representing $x_t$ can be extracted with xt$Xt. To proceed with analysis, each time series that the user simulates should be stored in a list object.

OriginalData = list(my_data = xt$Xt)

The toy original time series are saved as ~/toyData/toy_OriginalData.rda.

Step 2: Imposing a gap structure with simulateGaps()

Gap structure is defined by two parameters: $p$, the overall percentage of data missing, and $g$, the width of the gaps. The result is a total of $I=p\cdot N$ missing observations, appearing structurally as $\leq\frac{p\cdot N}{g}$ randomly-spaced non-overlapping holes of width $g$, where a 'hole' is defined as a sequence of adjacent missing observations.

Since the other principal objective of this package is to assess statistical performance in light of changes to gap structure, the user should initialize vectors contaning numerous values of $p$ and $g$, such that each $(p,g)$ combination can be applied. Under the constraint that only integer numbers of gaps can be removed, the actual number of missing observations may not always be consistent with the theoretical number due to rounding. Thus, care should be taken when selecting selection of these values.

# Initializing vectors of p and g
p <- c(0.1, 0.20, 0.30)
g <- c(1, 5, 10)

A single time series can be passed into simulateGaps(), along with arguments p and g to remove observations according to the desired gap specification. The user can also define K to set the number of gappy series to generate for each original dataset. Note that in each iteration, gap locations are chosen randomly under the specified gap parameterization.

set.seed(23)

# Using a loop to apply gaps to each of the five original time series
GappyData <- simulateGaps(OriginalData = OriginalData, p = p, g = g, K = 10)

The result is a nested list of dimension $D\times P \times G \times K$, saved as ~/toyData/toy_GappyData.rda.

Step 3: Interpolating the gappy data with parInterpolate()

The next step is to interpolate the list of gappy data using parInterpolate(). The user can choose from a list of 18 algorithms pulled from external packages, shown in the table below, and/or pass their own functions into FUN_CALL. Calling multiple methods is useful for facilitating comparative assessment.

Built-in interpolation algorithms

| Package | Function | Algorithm name | Abbreviation | |:--------|:--------------------|:--------------------------------|----------:| | interpTools | nearestNeighbor() | Nearest Neighbor | NN |
| zoo | na.approx() | Linear Interpolation | LI | | zoo | na.spline() | Natural Cubic Spline | NCS | | zoo | na.spline() | FMM Cubic Spline | FMM | | zoo | na.spline() | Hermite Cubic Spline | HCS | | imputeTS | na_interpolation() | Stineman Interpolation | SI | | imputeTS | na_kalman() | Kalman - ARIMA | KAF |
| imputeTS | na_kalman() | Kalman - StructTS | KKSF | | imputeTS | na.locf() | Last Observation Carried Forward | LOCF | | imputeTS | na.locf() | Next Observation Carried Backward | NOCB | | imputeTS | na_ma() | Simple Moving Average | SMA | | imputeTS | na_ma() | Linear Weighted Moving Average | LWMA | | imputeTS | na_ma() | Exponential Weighted Moving Average | EWMA | | imputeTS | na_mean() | Replace with Mean | RMEA | | imputeTS | na_mean() | Replace with Median | RMED | | imputeTS | na_mean() | Replace with Mode | RMOD | | imputeTS | na_random() | Replace with Random | RRND | | tsinterp | interpolate() | Hybrid Wiener Interpolator | HWI |

Passing in user-defined algorithms with FUN_CALL

Should the user decide to pass in their own algorithms via FUN_CALL, they must ensure that the argument is a character vector in the format FUN_CALL = c("foo1(x = ", "foo2(x = " ) , and that the output of each function is a single numeric vector. The interpolations will occur in parallel over the specified index (parallel = "p", parallel = "g", or parallel = "k"), where the number of cores to use can be defined via num_cores (the default is to use as many as are available, as determined by parallel::detectCores()).

Here is an example using two user-defined functions: plus() returns the index position of each NA, and minus() returns the negative index position of each NA.

# EXAMPLE: User-defined functions to pass to FUN_CALL

## Toy function 1: Convert each value of x to its index position
plus <- function(x){
vec <- numeric(length(x))
for(i in 1:length(vec)){
 vec[i] <- i
}
return(vec)
}

## Toy function 2: Convert each value of x to its negative index position
minus <- function(x){
 vec <- numeric(length(x))
 for(i in 1:length(vec)){
   vec[i] <- -i
 }
 return(vec)
}

The appropriate use of FUN_CALL in parInterpolate() would then be:

parInterpolate(GappyData = GappyData, FUN_CALL = c("plus(x = ", "minus(x = "))

A toy example, using built-in methods

To generate the toy data found in the package files, we will interpolate using the Linear Interpolator (LI), the Nearest Neighbor (NN), and the Exponential Weighted Moving Average (EWMA).

methods <- c("LI", "NN", "EWMA")
IntData <- parInterpolate(GappyData = GappyData, methods = methods)

The resulting object is a nested list of dimension $D \times M \times P \times G \times K$, saved as ~/toyData/toy_IntData.rda.

Reducing the computational burden: splitData() and stitchData()

Considering the dimensionality of the data and complexity of some interpolation algorithms (especially that of the Hybrid Wiener Interpolator), the system requirements for this number of computations can be quite large.

Splitting the gappy data with splitData()

The function splitData() can be used to "break apart" the gappy data at the $K$-level into more manageable chunks (the number of sets specified by the split argument) prior to being passed into parInterpolate().

See the usage of the function in the following example, where the Gappy Data is divided into sets of size $K/2 = 5$:

split = 2
GappyData_split <- splitData(GappyData = GappyData, split = split)

The toy split data is saved as ~/toyData/toy_GappyData_split.rda.

Interpolating the split data with parInterpolate()

The 'split' object can then be passed into parInterpolate() as normal, except with an additional nested loop, iterating over the number of sets. It is also wise to save each set of interpolated data with each iteration in an appropriately organized directory, where the parent folder should contain only .rda files organized into subfolders by dataset. See the example below:

methods <- c("LI", "NN", "EWMA")

IntData_split <- list()

for(d in 1:D){
  for(n in 1:split){

    IntData_split <- parInterpolate(GappyData = GappyData_split[[d]][[n]], methods = methods, 
                                    numCores = detectCores())

    filename <- paste0("D",d,"_",letters[n],".rda")
    setwd(paste0("~/toyData/toy_IntData_split/D",d))
    save(IntData_split, file = filename)
    gc()
  }
}

Recommended directory structure

The files are saved in the folders contained in ~/toy_IntData_split, which have the recommended structure:

Re-stitching the interpolated data with stitchData()

Once the interpolations have completed, the split data can be recombined into a single $D \times M \times P \times G \times K$ list using the stitchData() function and referencing the directory containing the files:

# Pasting together the segmented interpolated data
IntData_stitch1 <- stitchData("toyData/toy_IntData_split/D1")
IntData_stitch2 <- stitchData("toyData/toy_IntData_split/D2")
IntData_stitch3 <- stitchData("toyData/toy_IntData_split/D3")
IntData_stitch4 <- stitchData("toyData/toy_IntData_split/D4")
IntData_stitch5 <- stitchData("toyData/toy_IntData_split/D5")

IntData_stitch <- list(D1 = IntData_stitch1,
                       D2 = IntData_stitch2,
                       D3 = IntData_stitch3,
                       D4 = IntData_stitch4,
                       D5 = IntData_stitch5)

The resulting object should be the same as the one processed without splitting, and is saved to the local directory as ~/toyData/toy_IntData_split/toy_IntData_stitch.rda.

Step 4: Compare the interpolated series with the original

The performance of a single interpolation can be assessed by comparing the original time series, $x_t$, with the interpolated series, $X_t$. Here, statistical performance is defined as some measure of the bias, represented by some function: \begin{equation} C(\mathbf{X}, \mathbf{x}), \end{equation} called a performance criterion. Generally speaking, $C$ quantifies how well the interpolated series, $\mathbf{X} = \left{X_t\right}{t=0}^{N-1}$, captures the essence of the original series, $\mathbf{x} = \left{x_t\right}{t=0}^{N-1}$.

Generating performance matrices with performance()

A set of performance criteria are computed for each $(x_t,X_t)$ pair using the performance() function. The table below shows the metrics that are built-in to the package.

| Criterion | Abbreviation | Optimal | |:----------|:-------------|:--------| | Correlation Coefficient | $r$ | max | | Coefficient of Determination | $r^2$ | max | | Absolute Differences | $\mathbf{AD}$ | min | | Mean Bias Error | $\mathbf{MBE}$ | min | | Mean Error | $\mathbf{ME}$ | min | | Mean Absolute Error | $\mathbf{MAE}$ | min | | Mean Relative Error | $\mathbf{MRE}$ | min | | Mean Absolute Relative Error | $\mathbf{MARE}$ | min | | Mean Absolute Percentage Error | $\mathbf{MAPE}$ | min | | Sum of Squared Errors | $\mathbf{SSE}$ | min | | Mean Square Error | $\mathbf{MSE}$ | min | | Root Mean Squares | $\mathbf{RMS}$ | min | | Normalized Mean Square Error | $\mathbf{NMSE}$ | min | | Nash-Sutcliffe Coefficient | $\mathbf{RE}$ | max | | Root Mean Square Error | $\mathbf{RMSE}$ | min | | Normalized Root Mean Square Deviations | $\mathbf{NRMSD}$ | min | | Root Mean Square Standardized Error | $\mathbf{RMSS}$ | min | | Median Absolute Percentage Error | $\mathbf{MdAPE}$ | min |

The full set of performance criteria can be generated as follows:

# Generating performance matrices
pf <- performance(OriginalData = OriginalData, GappyData = GappyData, IntData = IntData)

The resulting output is a list of dimension $D \times M \times P \times G \times K$ of class pmat, where the terminal node is a matrix of performance metrics, saved as ~/toyData/toy_pmats.rda.

Aggregating the performance data with agEvaluate()

Consider that for any $(d,m,p,g)$ set, there are a set of $K$ values for each criterion. Thus, each performance metric has a sampling distribution containing $K$ elements. The performance matrices can be condensed by aggregating over $K$ to reduce dimensionality. The comprehensive set of sample statistics that are offered in this package are listed below:

The aggregated performance matrices are computed via the agEvaluate() function:

# Aggregating the performance matrices
ag_pf <- agggregate_pf(pf = pf)

The result is an object of class aggregate_pf, and is saved as ~/toyData/toy_ag_pf.rda. This is the object that will be piped into all of the functions used to conduct the analyses, to follow.

Data visualization tools

This package provides a number of different data visualization tools, which are described in this section.

Plotting the original data with plotXt()

The original time series can be visualized using plotXt() by passing in an object of class simList. Below is a plot of $x_{t,1}$.

plotXt(simData = xt, cptwise = F, axisLabels = T)

Since this object stores all of the constituent pieces, the original data can also be visualized componentwise by toggling cptwise to TRUE. The user can select which component to display via return. Note that if the user wishes to leave the default return = NULL and return all components in a grid, then axisLabels must be set to FALSE to prevent redundant labelling.

Here is an example of such a grid, showing $m_{t,1}$, $t_{t,1}$, $\xi_{t,1}$, and the constituent frequencies:

plotXt(simData = xt, cptwise = T, axisLabels = F, plot.title = T, return = NULL)

An example of extracting and visualizing the noise component only (toggling axisLabels = TRUE):

plotXt(simData = xt, cptwise = T, axisLabels = T, return = "Wt")

An example of extracting and visualizing the frequency distribution: (toggling axisLabels = TRUE):

plotXt(simData = xt, cptwise = T, axisLabels = T, return = "freq")

Possible values of return are NULL, "Mt", "Mt", "Mt", and "freq".

Generating surface plots with plotSurface()

A three-dimensional surface is used to visualize the performance of an interpolator as the structure of the gappy data changes. We will refer to such a surface as $f(p,g)$ in the following, with $f$ an undefined bivariate function describing the value of a criterion. 'Optimal' performance corresponds to an extreme point on the surface; either a maximum or minimum, depending on the definition of 'optimal' (see above table of performance criteria). Multiple interpolations can be compared by layering surfaces on top of one another, where the `best' interpolation for a particular gap structure will be at an extremum at the corresponding $(p,g)$ coordinate point.

This visualization is generated using the function plotSurface(), where the user can specify algorithms and datasets of interest, the sample statistic and criterion to model. The user can highlight certain surfaces via the highlight argument. The function will only accept objects of class agEvaluate in the agEval argument.

The first example is for the visualization of the performance of the Linear Interpolation (LI) method on the first dataset according to the Mean Squared Error (MSE) criterion, where $f(p,g)$ represents the median value:

p <- plotSurface2(agObject = ag_pf,
                  m = "LI", 
                  metric = "MSE",,
                  toggle = "method",
                  f = "median", 
                  highlight = "LI", 
                  highlight_color = "#33FFF9")
Surface plots with multiple layers

Multiple surfaces can be layered one on top of the other to facilitate comparison. The user can specify whether to layer surfaces by dataset (for a single method), or by method (for a single dataset).

Layering by method

Below is an example of a surface plot, layered by method, for the performance of the Linear Interpolation (LI, highlighted), Nearest Neighbor (NN), and Exponential Weighted Moving Average (EWMA) on the first dataset according to the Mean Squared Error (MSE) criterion, where $f(p,g)$ represents the median value:

p <- plotSurface(d = 1, 
            m = c("LI","NN","EWMA"), 
            layer_type = "method", 
            crit = "MSE", 
            agEval = agEval, 
            f = "median", 
            highlight = "LI", 
            highlight_color = "#33FFF9")
p$MSE$D1

Layering by dataset

Here is another example of a surface plot, layered by dataset, for the performance of the Linear Interpolation (LI) on all five datasets (dataset 3 highlighted) according to the Mean Squared Error (MSE) criterion, where $f(p,g)$ represents the median value:

p <- plotSurface(d = 1:5, 
            m = "LI", 
            layer_type = "dataset", 
            crit = "MSE", 
            agEval = agEval, 
            f = "median", 
            highlight = 3, 
            highlight_color = "#33FFF9")
p$MSE$LI
Arranging multiple surface plots with multiSurface()

The function multiSurface() allows the user to combine multiple surface plot objects into one grid to add another dimension for comparison. The following is an example of a set of surface plots describing datasets 1-5 of the method-layered plots.

multiSurface(d = 1:5, 
             agEval = agEval, 
             layer_type = "method", 
             m = c("LI","NN","EWMA"), 
             crit = "MSE", 
             f = "median", 
             highlight = "LI", 
             highlight_color = "#33FFF9")

Grouping by dataset was achieved by setting layer_type = "method". To group the performances by method instead, the layer_type argument can be set to "dataset", where each plot represents the performance on any combination of original datasets, by a particular interpolation algorithm. See the example below:

multiSurface(d = 1:5, 
             agEval = agEval, 
             layer_type = "dataset", 
             m = c("LI","NN","EWMA"), 
             crit = "MSE", 
             f = "median", 
             highlight = 3, 
             highlight_color = "#33FFF9")

Generating heatmaps with heatmapGrid()

Through R's interface, the user can interact with the surface plot widgets by manipulating the camera perspective, adjusting the zoom, and hovering over data points for precise numerical information. When constrained to static visualizations, like when writing papers, heatmap are more effective at communicating the data. Using heatmapGrid2(), a three-dimensional surface can be collapsed into a heatmap through conversion of the third dimension to colour, to which the value of the metric is proportional.

See below an example of a heatmap of the performance of the Linear Interpolation (LI) method on the first dataset according to the median value of the Mean Squared Error (MSE) criterion. Convince yourself that this graphic corresponds to the first surface plot shown in this vignette.

heatmapGrid(d = 1, 
             agEval = agEval, 
             f = "median", 
             crit = c("MSE"), 
             m = "LI")

The heatmapGrid() function allows for d to be vectorized to show heatmaps across any number of datasets:

heatmapGrid(d = c(1,3,5), 
             agEval = agEval, 
             f = "median", 
             crit = "MSE", 
             m = "LI")

Arranging multiple heatmaps with multiHeatmap()

The multiHeatmap() function enables the user to arrange multiple heatmaps into a grid to facilitate cross-comparison between criteria:

multiHeatmap(crit = c("abs_differences","MSE","MAPE"), 
             agEval = agEval, 
             m = "LI", 
             by = "crit", 
             f = "median", 
             d = 1:5)

or methods:

multiHeatmap(crit = "MSE", 
             agEval = agEval, 
             m = c("LI","NN","EWMA"), 
             by = "method", 
             f = "median", 
             d = 1:5)

The user can specify which kind of comparison they would like to see via the by argument.

Generating cross-section plots with plotCS()

We can take a closer look at the surface plots by examining cross-sections, generated by plotCS(). We encourage the reader to imagine changing their perspective angle on the surface plots by rotating them such that the surface is viewed perpendicular to the $p-z$ or $g-z$ plane and the $p$-axis or $g$-axis is collapsed. Effectively, this allows us to examine changes in one variable across all values of the other. Each ribbon represents the performance of a particular interpolation method in a dataset $d$, where the upper and lower bounds are the largest and smallest values observed across the set of sample statistics contained in the collapsed variable (the highest and lowest points on the corresponding surface plot), and the central line is the median value of this set.

Cross-section plot of gap width

When the $p$-axis is collapsed, we are looking at a cross-section of gap width. From this perspective, we can observe how the overall performance of an algorithm on a particular dataset changes as the width of the gaps increases ($g\rightarrow G$), as well as get a rough sense of how sensitive the algorithm is to the proportion of data missing. For example, ribbons that are 'thicker' indicate a greater disparity in performance between the best and worst interpolations, and `thinner' ribbons correspond to algorithms that seem to perform similarly, regardless of proportion missing.

Below is an example of a $g$ cross-section plot of dataset 1, where each ribbon represents the sampling distribution $\left[ Q_{2.5\%}, \text{median}, Q_{97.5\%} \right]$ of performance, according to the Mean Squared Error (MSE) criterion, of interpolation method (LI, NN, EWMA) with respect to gap width, across all values of $p$.

p <- plotCS(d = 1,
       m = c("LI","NN","EWMA"),
       crit = "MSE", 
       agEval = agEval, 
       layer_type = "method", 
       f = "median", 
       cross_section = "g",
       highlight = "LI", 
       highlight_color = "#33FFF9")
p$MSE$D1

Cross-section plot of proportion missing

When the $g$-axis is collapsed, the plot is a cross-section of proportion missing. Here, we can see how the performance of an algorithm on a particular dataset changes as the sheer amount of missing data points increases ($p\rightarrow P$), as well as get a rough sense of how sensitive the algorithm is to gap width. For example, ribbons that are 'thicker' indicate a greater disparity in performance between the best and worst interpolations, and `thinner' ribbons correspond to algorithms that seem to perform similarly regardless of gap width.

Below is an example of a $p$ cross-section plot of dataset 1, where each ribbon represents the sampling distribution $\left[ Q_{2.5\%}, \text{median}, Q_{97.5\%} \right]$ of performance, according to the Mean Squared Error (MSE) criterion, of interpolation method (LI, NN, EWMA) with respect to proportion missing, across all values of $g$.

p <- plotCS(d = 1,
       m = c("LI","NN","EWMA"),
       crit = "MSE", 
       agEval = agEval, 
       layer_type = "method", 
       f = "median", 
       cross_section ="p",
       highlight = "LI",
       highlight_color = "#33FFF9")
p$MSE$D1

Arranging multiple cross-section plots with multiCS()

The multiCS() function enables the user to arrange a grid of plots such that cross-sectional performance across either dataset or method can be assessed simultaneously. To obtain a grid of plots grouped by dataset, then each interior layer must represent a method (layer_type = "method"), as in the following example, where we have a set of $g$ cross-section plots describing the performance of each method in datasets 1-5:

multiCS(d = 1:5,
        agEval = agEval,
        m = c("LI","NN","EWMA"), 
        crit = "MSE", 
        layer_type = "method", 
        f = "median", 
        cross_section = "g",
        highlight = "LI",
        highlight_color = "#33FFF9")

To obtain a grid of plots grouped by method, then each interior layer must be represent a dataset (layer_type = "dataset"), as in the following example, where we have a set of $g$ cross-section plots describing the performance on each dataset by the LI, EWMA, and KAF:

multiCS(d = 1:5,
        agEval = agEval, 
        m = c("LI","NN","EWMA"), 
        crit = "MSE", 
        layer_type = "dataset", 
        f = "median", 
        cross_section = "g",
        highlight = 3,
        highlight_color = "#33FFF9")

Generating uncollapsed cross-section plots with plotCS_un()

The collapsed cross section plots can be further deconstructed using plotCS_un() such that the performance can be assessed with respect to changes in one variable ($p$ or $g$) across each individual value of the other. Interpretable as 'slices' of the surface plot, these uncollapsed cross-section plots will give insight as to whether there are specific combinations of ($p,g$) for which the performance of a particular method is particularly sensitive. Each ribbon represents the distribution of a metric across the $K$ simulations, where the upper and lower bounds represent the 2.5% and 97.5% quantiles, and the interior line is is formed by the set of sample statistics from corresponding points on the parent surface plot. An advantage of viewing the data in this way is that it allows us to view the error bars at each ($p,g$) coordinate point without creating an over-crowded and hard-to-read surface plot visualization.

Below is an example of a $p$ cross-section plot of dataset 1, where each ribbon represents the sampling distribution $\left[ Q_{2.5\%}, \text{median}, Q_{97.5\%} \right]$ of performance, according to the Mean Squared Error (MSE) criterion, of interpolation method (LI, NN, EWMA) with respect to each value of $g$.

plotCS_un(d = 1,
          agEval = agEval,
          cross_section = "p", 
          crit = "MSE", 
          m = c("LI","NN","EWMA"), 
          f = "median", 
          layer_type = "method", 
          highlight = "LI",
          highlight_color = "#33FFF9")

Computing the gradient with gradient()

The gradient of a surface plot describes the degree of resilience of an algorithm to changes in gap structure. A low-grade slope in the neighborhood of $p,g$ gives evidence that an algorithm is more stable at that point, whereas a steeper slope may indicate that the algorithm is more sensitive to changes in the same region. Computing the gradient, $\nabla f(p,g)$, will help identify the particular gap configurations at which the statistical performance breaks down, according to a particular criterion.

Since the data lies on a discrete mesh, the gradient can be approximated by calculating the slope of the line that connects all pairs of adjacent points on the surface. For any $i,j$ pair of coordinate points ($p_i,g_i,f(p_i,g_i)$) and ($p_j,g_j,f(p_j,g_j)$), the slope between them is given by: \begin{equation} \partial_{i,j} = \frac{f(p_j,g_j)-f(p_i,g_i)}{\sqrt{(p_j-p_i)^2 + (g_j-g_i)^2}} \end{equation}

The complete set of slopes for each adjacent pair of points, generated by gradient(), defines our approximation of $\nabla f$. The user must specify the datasets, methods, criteria, and statistic of interest. The output is a list of tables, of dimension $C \times M \times D$, where $C$ represents the number of criteria of interest. Note that since the slopes are directional, half of them will have redundant magnitude, but opposite sign.

Here is an example of a gradient calculation for the first surface plot generated in this vignette: the performance of the Linear Interpolation (LI) method on the first dataset according to the Mean Squared Error (MSE) criterion, where $f(p,g)$ represents the median value:

grad <- gradient(d = 1,
                 m = "LI",
                 crit = "MSE",
                 agEval = agEval,
                 f = "median")

The output is a list of tables of values, in the following format:

| g1 | p1 | z1 | g2 | p2 | z2 | slope | |:-------|:-------|:-------|:-------|:-------|:-------|:----------| | ... | ... | ... | ... | ... | ... | ... |

Evaluating areas of statistical instability with gradEval()

The gradient() object can be passed into gradEval() to identify regions of statistical instability on a particular surface, which can then be compared to the performances of other select algorithms in that same ($p,g$) region using a ratio. The user can choose which method to compare to the others via cm, and the dataset of interest, via cd.

Below is an example of the use of this tool, where the performance of the Linear Interpolation method on the third dataset, according to the Absolute Differences (AD) metric, is assessed with respect to its region of maximum instability (i.e., at which value of $p$ and $g$), and how it compares to the other metrics in that same region.

gradEval(agEval = agEval, 
         crit = "MSE", 
         m = c("LI","NN","EWMA"), 
         d = 1:5, 
         cm = "LI", 
         cd = 3, 
         f = "median")

Determine skewness measures of the sampling distributions with plotSkew()

The user may wish to evaluate the degree of skewness of the sampling distributions of each performance criterion. This may give helpful insight as to which statistic will best-measure central tendency (i.e., the sample mean or sample median), assuming of course that the distributions are unimodal (this can be verified via the p-value of the diptest shown in the agEvaluate object).

On any set of $K$ interpolations, the skewness of the sampling distribution of a particular performance metric can be calculated as: \begin{equation} \frac{\sum_i^n (x_i-\bar{x})^3}{n\sigma_x^3} \end{equation}

By plotting the distribution of the skewness values across all possible ($d,m,p,g$) sets, grouped by performance metric, the user can see determine the average skewness of each, using some threshold. The convention in this package is as follows:

-Sampling distributions with a sample mean roughly centered at zero $(\bar{X}\text{skew}\leq\left|s\text{skew}/3\right|)$ indicate that on average the criterion has a reliably symmetric distribution, and the mean can be used to describe the behaviour of the performance.

-Distributions of skewness values whose center is not in the neighbourhood of zero $(\bar{X}\text{skew}>\left|s\text{skew}/3\right|)$ indicate that on average the criterion has a skewed distribution, and the median will better capture the essence of the data.

Plotting the distribution of the skewness measures

The plotSkew() function can be used to plot the distributions of the skewness values for the chosen performance metrics by toggling output = "plots". The user can toggle the symmetric argument to view subsets of skewness distributions: symmetric only, asymmetric only, or the full set.

plotSkew(agEval = agEval, cptwise=T, symmetric= NULL, output = "plots")

Dotted lines represent the sample medians and solid lines represent the sample means. A darker colour is indicative of criteria which have a symmetric sampling distribution.

Producing a table of average skewness measures

The output argument can be changed such that a table of values is shown instead, where bolded values are indicative of criteria that have a reasonably symmetric sampling distribution, on average.

plotSkew(agEval = agEval, cptwise=T, symmetric= NULL, output = "table")

Retrieving the raw data

As a supplement to the surface plots, the user can easily retrieve the comprehensive list of raw data in table-form using the compileMatrix() function.

compileMatrix(agEval = agEval)

$\LaTeX$ tables of $f(p,g)$ with ztable()

The ztable() function can be used to produce $\LaTeX$ data tables corresponding to the entire set of values $f(p,g)$ defining the performance surface. Note that in order for this action ito be performed, the user must toggle sdist = F and define the sample statistic of interest with f. For example, the table of values corresponding to each $(p,g)$ point on the the first surface plot generated in this vignette:

ztable(agEval = agEval, d = 1, crit = "MSE", m = "LI", sdist = F, f = "median")

$\LaTeX$ tables of sampling distributions with ztable()

The ztable() function can also be used to generate a summary of the sampling distribution of the chosen metric at each $(p,g)$, with respect to the 2.5% quantile, the median, and the 97.5% quantile. This action is performed by toggling sdist = T. For example, that of the Linear Interpolation method, on the first dataset according to the Mean Squared Error (MSE) criterion:

ztable(agEval = agEval, d = 1 , crit = "MSE", m = "LI", sdist = T)

Generate tables of cross-sectional data with CStable()

The CStable() function shows the values used to construct the cross-section plots, with the optimal methods bolded.

To view the data for a particular collapsed cross-section plot, the user can set the argument collapse = T. For example, the data for the collapsed $g$ cross-section plot for the first dataset according to the median Mean Squared Error (MSE) metric would be computed as follows:

CStable(d = 1, 
        agEval = agEval, 
        m = c("LI","NN","EWMA"), 
        crit="MSE",
        f = "median",
        cross_section = "g",
        collapse = T)

To view the data for a particular uncollapsed cross-section plot, the user can set the argument collapse = F and fixedIndex to be the index on which to fix the other variable. For example, if the user wished to see the data for the uncollapsed $g$ cross-section plot in the same specification as before, except fixed at $p = 20\%$, then fixedIndex = 2, as shown:

CStable(d = 1, 
        agEval = agEval, 
        m = c("LI","NN","EWMA"), 
        crit="MSE",
        f = "median",
        cross_section = "g",
        fixedIndex = 2,
        collapse = F)

The tables are produced in $\LaTeX$ format.

Need support?

Please email me at stmcastel@gmail.com if you have any additional questions.



castels/interpTools documentation built on June 7, 2024, 4:20 p.m.