interpol_splines: interpolated splines algorithm to fill missing values
In proto4426/ValUSunSSN: Val-U-Sun project functions

Description Usage Arguments Details Value Author(s) Examples

This main function fills gaps in monovariate or multivariate data by SVD-imputation which is closely related to expectation-maximization (EM) algorithm with splines interpolation

1 2	interpol_splines(y, nembed = 1, nsmo = 0, ncomp = 0, threshold1 = 1e-05, niter = 30, displ = F)

`y`	a numeric data.frame or matrix of data with gaps
`nembed`	integer value controlling embedding dimension (must be > 1 for monovariate data)
`nsmo`	integer value controlling cutoff time scale in number of samples. Set it to 0 if only one single time scale is desired.
`ncomp`	controls the number of significant components. It has to be specified for running in automatic mode. Default (0) leads to manual selection during the algorithm
`threshold1`	numeric value controllingthe stop of the iterations after the relative energy change is < threshold
`niter`	numeric value controlling the maximum number of iterations
`displ`	boolean controlling the display of some information in the console during the algorithm

The method decomposes the data into two time scales, which are processed separately and then merged at the end. The cutoff time scale (nsmo) is expressed in number of samples. A splines "filter" is used for filtering. Monovariate data must be embedded first (nembed>1). In the initial data set, gaps are supposed to be filled in with NA !!.

The three tuneable (hyper)parameters are :

ncomp
nsmo
nembed

A list with the following elements:

y.filled: The same dataset as y but with gaps filled
w.distSVD: The distribution of the weights of the SVD
errorByComp: Numeric vector of length niter (??) containing the errors associated to each iterations( or comp?)

But only the first one really affects the outcome. A separation into two scales only (with a threshold between 50–100 days) isenough to properly capture both short- and long-term evolutions, and embedding dimensions of D = 2−5 are usually adequate for reconstructing daily averages. The determination of the optimum parameters and validation of the results is preferably made by cross-validation.

Antoine Pissoort, antoine.pissoort@student.uclouvain.be

# Take this for input, as advised in the test.m file
y <- sqrt(data.mat2.fin+1) # Selected randomly here, for testing

options(mc.cores=parallel::detectCores()) # all available cores
z_splines <- interpol_splines(y, nembed = 2, nsmo = 8, ncomp = 4,
                             niter = 30, displ = F)
# 80 sec for the whole dataset
z_splines <- z_splines$y.filled
z_splines = z_splines*z_splines - 1
z_splines[z_splines<0] <- 0
ssn_splines <- z_splines