knitr::opts_chunk$set(echo = TRUE)
devtools::load_all()
library(mlr)
library(tidyr)
library(dplyr)
library(stringr)

Introduction

The tool to assess which spatialization method is the most appropriate is a batch of benchmark experiments performed on multiple set of records from an historical dataset of observed weather data recorded by the stations of interest where the predictive performance of various learners are assessed on each set of records by an iterated leave-one-out cross validation.

This approach relies heavily on the mlr package [@bischl_mlr:_2016] which is an unified interface to perform machine learning analysis using R. We strongly recommand you to read the mlr documentation in order to understand its terminology and principles. Doing so you'll moreover get valuable knowledge about machine learning theory.

In this package we will refer a lot to the mlr terminology. The borrowed terms from mlr terminology must be understood in terms of their mlr definition (e.g. learner, benchmark experiment, etc.)

We have decided to use data from 01 jan 2016 to 31 dev 2017 as these two years cover two very distinct situations (2016 = wet and 2017 = dry). To conduct these experiments we will use the makeBenchmark function integrated with this package. Many parameters can influence the quality of the spatial predictions and these will be tested in multiple benchmark experiments. This is why the exploration field must be restricted and the investigated parameters must be prioritized. This article present our investigations roadmap. Our global philosophy is to start with a simple approach and gradually add complexity to it.

Goals

We must assess which is the best spatialization technique for :

We don't need to asses the best spatialization technique for :

Terminology

A specific terminology is important as we need precise definitions to avoid confusions in our future interpretations and discussions. As already stated earlier, we will need to conduct :

a batch of benchmark experiments performed on multiple set of records from an historical dataset of observed weather data recorded by the stations of interest where the predictive performance of various learners are assessed on each set of records by an iterated leave-one-out cross validation

Exploration parameters

Stations of interest

We will investigate if combining both Pameseb and IRM networks improve the quality of the predictions. Also we will consider if correcting the Pameseb TSA data using the correction model built by the Humain stations intercomparison (::TODO:: see article) increases the quality of the predictions. This leads to 3 possible situations :

::BEGIN DRAFT:: It is important to get a deep insight and comprehensive overview of our weather station network before interpolating its data in order to avoid the integration of non-desired local or structural effects during the interpolation process.

A specific attention will be ported on the analysis of the quality of the data produced by each of our stations. We will need to carry an analysis in order to detect eventual structural or local effects such as overheating in temperature shelters.

Local temperature effects will be detectable by pointing out abnormally high our low values appearing from long term analysis of each of the stations from our network. Again, a good knowledge of the station network (eg : situation and direct environment of each of the stations) is required. To remove local effects from the interpolation process, each station could first be weighted according to a quality parameter characterized by the local situation of the considered station. Time series analysis ( example map ::TODO:: find source code and create dedicated vignette) will help us for this purpose.

The Agromet project aims to spatialize weather data gathered both by the Pameseb network owned by the CRA-W and stations owned by the national weather office RMI.

Before integrating two different networks in the spatilization process, we need to assess their intercompatibilty. To address this, both our team and the RMI works on an intercomparison of the networks performed by the mean of a location (Humain - Belgium) equiped with 2 stations belongings to the 2 networks. The first results of this comparative analysis are available on this repository.

::TODO:: maps of the 2 networks

::END DRAFT::

Explanatory variables (called features in terms of mlr)

These variables must already be available as spatialized data in order to be used as explanatory variables. We can make the distinction between the static explanatory variables and the dynamic explanatory variables. The static variables are constant over time while the dynamic vary over time.

The static explanatory variables to investigate are :

The dynamic explanatory variables to investigate are :

Algorithms

Numerous regression algorithms exist. We have decided to restrict the investigation field to those that have already proven their efficiency in other studies :

The source code used to construct the learners from these algorithms is stored into the data-raw/makeLearners.R file of the present package. These learners are constructed on the basis of the mlr package and are available once you have called library(agrometeoR). The algorithms and their respective mlr class that are actually used (or that will be used) to construct our learners are :

A deep learning approach will be based on the TensorFlow library but will not be considered before 2020. A good introduction to deep learning with R is available in the Machine Learning with R and TensorFlow video.

Assessment of the best spatialization technique

Each constructed combination of the described exploration parameters will be refered as an explorative construction. We have decided to identify each of these explorative constructions with an ID as this is an handy shortcut to describe a complex association of an algorithm, its potential multiple hyperparameters values, the investigated features and the considered stations.

The respective performances of multiple explorative constructions are assessed by performing a batch of benchmarks experiments. A new milestone is reached each time a new explorative construction has been integrated to a batch of benchmark experiments. A best construction reference is defined by the list of the ID's of the considered explorative constructions injected into the batch of benchmark experiments and the ID of the best explorative construction among this list.

When spatializing data in a production environment, its best construction reference must be stored as a metadata of the spatialized data in order to keep a track of both the chosen method and the investigated ones.

Note that some explorative constructions might not be relevant to a specific target variables. Hence it will be useless to systematically use these in a batch of benchmark experiments for this target variable.

explorative constructions

We suggest to store the explorative constructions you want to test in a database. This package comes with an example database of explanatory construtions to test stored into the explorative_constructions.csv file. Here below we present a table constructed from this file.

explorative_constructions = read.csv2("./explorative_constructions.csv", sep = ";") %>%
  dplyr::mutate_all(.funs = as.factor)

DT::datatable(explorative_constructions, filter = 'top', options = list(
  pageLength = 50, autoWidth = TRUE, rownames = FALSE))
ec = explorative_constructions
ec = ec %>% dplyr::mutate_all(as.character)

ec13 = ec %>%
  dplyr::filter(id == 13)

# https://stackoverflow.com/questions/44639307/store-comma-separated-key-value-pair-in-a-string-to-key-value-variable-in-she

# https://stat.ethz.ch/pipermail/r-help/2002-May/021823.html



makeLearnerFromEC = function(ec) {

  par.vals = as.list(strsplit(ec$hyperparameters, ", ")[[1]])

  names(par.vals) = lapply(par.vals, FUN = function(x){
    unlist(stringi::stri_split_fixed(str = x, pattern = " = ", n = 2))[[1]]
  })

  par.vals = lapply(par.vals, FUN = function(x){
    unlist(stringi::stri_split_fixed(str = x, pattern = " = ", n = 2))[[2]]
  })


  learnerForEC = mlr::makeLearner(
    cl = ec$mlr, id = ec$id, predict.type = "se", par.vals = par.vals
  )
}
ec1L = makeLearnerFromEC(ec1)
}

Future milestones

As shown in the previous tables, the hyper-parameters values for the kriging learners KED and OK have been preset. No hyper-parameter tuning is conducted in order to assess their values. This is a choice that was decided with the project steering committee in order to reduce computing time and to make things simple at the beginning. The combinations of the hyper-parameters values are based on what RMI uses for their own maps (personnal communication by M. Journée).

In a next future, it might be interesting to use an hyper-tuning loop in order to find the best-hyperparameters values to use. Note that this is a very computationnaly intensive step.

Metadata of makeSpatialization to store in the database

References



pokyah/agrometeoR documentation built on May 26, 2019, 7 p.m.