In pokyah/agrometeoR: Spatial interpolation for Agromet project @ CRA-W

knitr::opts_chunk$set(echo = TRUE)
devtools::load_all()
library(mlr)
library(tidyr)
library(dplyr)
library(stringr)

Introduction

The tool to assess which spatialization method is the most appropriate is a batch of benchmark experiments performed on multiple set of records from an historical dataset of observed weather data recorded by the stations of interest where the predictive performance of various learners are assessed on each set of records by an iterated leave-one-out cross validation.

This approach relies heavily on the mlr package [@bischl_mlr:_2016] which is an unified interface to perform machine learning analysis using R. We strongly recommand you to read the mlr documentation in order to understand its terminology and principles. Doing so you'll moreover get valuable knowledge about machine learning theory.

In this package we will refer a lot to the mlr terminology. The borrowed terms from mlr terminology must be understood in terms of their mlr definition (e.g. learner, benchmark experiment, etc.)

We have decided to use data from 01 jan 2016 to 31 dev 2017 as these two years cover two very distinct situations (2016 = wet and 2017 = dry). To conduct these experiments we will use the makeBenchmark function integrated with this package. Many parameters can influence the quality of the spatial predictions and these will be tested in multiple benchmark experiments. This is why the exploration field must be restricted and the investigated parameters must be prioritized. This article present our investigations roadmap. Our global philosophy is to start with a simple approach and gradually add complexity to it.

Goals

We must assess which is the best spatialization technique for :

air temperature (TSA) and relative humidity (HRA)
for both hourly and daily data. As the performance of the same learner might differ from one time resolution to the other, we need to investigate both time resolutions separately.

We don't need to asses the best spatialization technique for :

Leaves wetness (HCT) as it will be computed from already spatialized datasets (HRA and ::TODO::).
rainfall (PLU) as we receive spatialized datasets from RMI's rainfalls radar ::TODO:: (name of the product ?).

Terminology

A specific terminology is important as we need precise definitions to avoid confusions in our future interpretations and discussions. As already stated earlier, we will need to conduct :

a batch of benchmark experiments performed on multiple set of records from an historical dataset of observed weather data recorded by the stations of interest where the predictive performance of various learners are assessed on each set of records by an iterated leave-one-out cross validation

Set of records : a set of data recorded by the stations of interests at a specific moment.
Historical dataset : the dataset containing all the hourly set of records of weather parameters for all the stations of interest.
Iterated leave-one-out cross validation : cross validation of a model created by training a specific learner on a single set of records from the historical dataset
Learner : an implementation of an algorithm (e.g. multiple linear regression, ANN, SVM, etc) for which a filtering of the explanatory variables to use might be applied and where its hyper-parameters values are set if required.
Benchmark Experiment : comparison of multiples iterated leave-one-out cross validations results operated on the same set of records but with different explorative constructions
Batch of benchmarks experiments : set of benchmark experiments conducted on multiple set of records (typically : on an historical dataset containing 2 years of records at a certain time resolution).
explorative construction : a specific combination of the 3 exploration parameters.
exploration parameters : the set of parameters that can influence the quality of the spatialization. These are, the considered explanatory variables (also called features), the stations of interests, the choosen algorithm and its hyperparameters values if relevant.

Exploration parameters

Stations of interest

We will investigate if combining both Pameseb and IRM networks improve the quality of the predictions. Also we will consider if correcting the Pameseb TSA data using the correction model built by the Humain stations intercomparison (::TODO:: see article) increases the quality of the predictions. This leads to 3 possible situations :

Pameseb only
Pameseb + RMI
Pameseb corr + RMI

::BEGIN DRAFT:: It is important to get a deep insight and comprehensive overview of our weather station network before interpolating its data in order to avoid the integration of non-desired local or structural effects during the interpolation process.

A specific attention will be ported on the analysis of the quality of the data produced by each of our stations. We will need to carry an analysis in order to detect eventual structural or local effects such as overheating in temperature shelters.

Local temperature effects will be detectable by pointing out abnormally high our low values appearing from long term analysis of each of the stations from our network. Again, a good knowledge of the station network (eg : situation and direct environment of each of the stations) is required. To remove local effects from the interpolation process, each station could first be weighted according to a quality parameter characterized by the local situation of the considered station. Time series analysis ( example map ::TODO:: find source code and create dedicated vignette) will help us for this purpose.

The Agromet project aims to spatialize weather data gathered both by the Pameseb network owned by the CRA-W and stations owned by the national weather office RMI.

Before integrating two different networks in the spatilization process, we need to assess their intercompatibilty. To address this, both our team and the RMI works on an intercomparison of the networks performed by the mean of a location (Humain - Belgium) equiped with 2 stations belongings to the 2 networks. The first results of this comparative analysis are available on this repository.

::TODO:: maps of the 2 networks

::END DRAFT::

Explanatory variables (called features in terms of mlr)

These variables must already be available as spatialized data in order to be used as explanatory variables. We can make the distinction between the static explanatory variables and the dynamic explanatory variables. The static variables are constant over time while the dynamic vary over time.

The static explanatory variables to investigate are :

latitude and longitude
elevation : the dataset is freely available on the Registry of Open Data on AWS and can be easily donwloaded using the elevatr package
slope : this variable is computed from elevation using the raster package [@hijmans_package_2015]
aspect : this variable is computed from elevation using the raster package
soil occupation : this variable is downloaded form the NGI website ::TODO:: details

The dynamic explanatory variables to investigate are :

irradiance (ENS) : data pulled from the MSG Downward Surface Shortwave Flux (MDSSF) from Landsaf [@trigo_satellite_2011]
INCA_BE analysis T0 (INC) : data pulled from the INCA-BE operational nowcasting system of the RMI [@reyniers_nowcasting_2012]

Algorithms

Numerous regression algorithms exist. We have decided to restrict the investigation field to those that have already proven their efficiency in other studies :

ZEPP [@zeuner_use_2007].
Kilimanjaro study [@appelhans_evaluating_2015]

The source code used to construct the learners from these algorithms is stored into the data-raw/makeLearners.R file of the present package. These learners are constructed on the basis of the mlr package and are available once you have called library(agrometeoR). The algorithms and their respective mlr class that are actually used (or that will be used) to construct our learners are :

multiple linear regression : regr.lm
inverse distance weighted: regr.gstat
one nearest neighbour : regr.gstat
kriging : regr.gstat
generalized linear model : regr.glm
cubist : regr.cubist
artificial neural network : regr.nnet

A deep learning approach will be based on the TensorFlow library but will not be considered before 2020. A good introduction to deep learning with R is available in the Machine Learning with R and TensorFlow video.

Assessment of the best spatialization technique

Each constructed combination of the described exploration parameters will be refered as an explorative construction. We have decided to identify each of these explorative constructions with an ID as this is an handy shortcut to describe a complex association of an algorithm, its potential multiple hyperparameters values, the investigated features and the considered stations.

The respective performances of multiple explorative constructions are assessed by performing a batch of benchmarks experiments. A new milestone is reached each time a new explorative construction has been integrated to a batch of benchmark experiments. A best construction reference is defined by the list of the ID's of the considered explorative constructions injected into the batch of benchmark experiments and the ID of the best explorative construction among this list.

When spatializing data in a production environment, its best construction reference must be stored as a metadata of the spatialized data in order to keep a track of both the chosen method and the investigated ones.

Note that some explorative constructions might not be relevant to a specific target variables. Hence it will be useless to systematically use these in a batch of benchmark experiments for this target variable.

explorative constructions

We suggest to store the explorative constructions you want to test in a database. This package comes with an example database of explanatory construtions to test stored into the explorative_constructions.csv file. Here below we present a table constructed from this file.

explorative_constructions = read.csv2("./explorative_constructions.csv", sep = ";") %>%
  dplyr::mutate_all(.funs = as.factor)

DT::datatable(explorative_constructions, filter = 'top', options = list(
  pageLength = 50, autoWidth = TRUE, rownames = FALSE))

ec = explorative_constructions
ec = ec %>% dplyr::mutate_all(as.character)

ec13 = ec %>%
  dplyr::filter(id == 13)

# https://stackoverflow.com/questions/44639307/store-comma-separated-key-value-pair-in-a-string-to-key-value-variable-in-she

# https://stat.ethz.ch/pipermail/r-help/2002-May/021823.html



makeLearnerFromEC = function(ec) {

  par.vals = as.list(strsplit(ec$hyperparameters, ", ")[[1]])

  names(par.vals) = lapply(par.vals, FUN = function(x){
    unlist(stringi::stri_split_fixed(str = x, pattern = " = ", n = 2))[[1]]
  })

  par.vals = lapply(par.vals, FUN = function(x){
    unlist(stringi::stri_split_fixed(str = x, pattern = " = ", n = 2))[[2]]
  })


  learnerForEC = mlr::makeLearner(
    cl = ec$mlr, id = ec$id, predict.type = "se", par.vals = par.vals
  )
}
ec1L = makeLearnerFromEC(ec1)
}

Future milestones

As shown in the previous tables, the hyper-parameters values for the kriging learners KED and OK have been preset. No hyper-parameter tuning is conducted in order to assess their values. This is a choice that was decided with the project steering committee in order to reduce computing time and to make things simple at the beginning. The combinations of the hyper-parameters values are based on what RMI uses for their own maps (personnal communication by M. Journée).

In a next future, it might be interesting to use an hyper-tuning loop in order to find the best-hyperparameters values to use. Note that this is a very computationnaly intensive step.

Metadata of makeSpatialization to store in the database

RMSE de la LOOCV de l'EC utilisée (retour)
l'EC utilisée parmi range d'EC benchmarkées (input)
par réseau, les stations gardées (retour)
summary (retour) *

References

pokyah/agrometeoR documentation built on May 26, 2019, 7 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

pokyah/agrometeoR
Spatial interpolation for Agromet project @ CRA-W

In pokyah/agrometeoR: Spatial interpolation for Agromet project @ CRA-W

Introduction

Goals

Terminology

Exploration parameters

Stations of interest

Explanatory variables (called features in terms of mlr)

Algorithms

Assessment of the best spatialization technique

explorative constructions

Future milestones

Metadata of makeSpatialization to store in the database

References

R Package Documentation

Browse R Packages

We want your feedback!

pokyah/agrometeoR Spatial interpolation for Agromet project @ CRA-W

In pokyah/agrometeoR: Spatial interpolation for Agromet project @ CRA-W

Introduction

Goals

Terminology

Exploration parameters

Stations of interest

Explanatory variables (called features in terms of mlr)

Algorithms

Assessment of the best spatialization technique

explorative constructions

Future milestones

Metadata of makeSpatialization to store in the database

References

R Package Documentation

Browse R Packages

We want your feedback!

pokyah/agrometeoR
Spatial interpolation for Agromet project @ CRA-W