knitr::opts_chunk$set(echo = TRUE, message = FALSE)
devtools::load_all()

Intro

As the documentation about the use of gstat to make a knn interpolation is really scarce (the only relevant information we have found is in the chapter interpolation of the Spatial Data Analysis and Modeling with R website), we have conducted a quick investigation to make sure that what gstat performs when using our nearest neighbours learners (e.g. lrn.gstat.5nn) correponds to what we actually expect from the algorithm.

Our expectation was that for a specific location where we want to predict a value, the algorithm computes the mean of the values of its 5 spatially closest neighbours when we implicitly filter the features to latitude and longitude only.

As it will be demonstrated here below, the gstat nearest neighbour (knn) interpolation works as expected.

Load the libraries

Here are all the libraries required for our investigation

# loading mlr & agrometeoR
library(FNN)
library(mlr)
library(agrometeoR)
library(dplyr)
library(sf)
library(mapview)

Example dataset

We create an example dataset from our AGROMET weather database. We extract an hourly observation of temperature for the 29 stations of the Pameseb AWS network.

user_token = Sys.getenv("AGROMET_API_V1_KEY")
dfrom = ex_dfrom
dto = ex_dto
stations = paste0(as.character(stations.df$sid), collapse = ",")
sensor = "tsa"
staticExpl = "elevation"

myDataset = makeDataset(
  user_token = user_token,
  stations = stations,
  dfrom = dfrom,
  dto = dto,
  sensor = sensor,
  staticExpl = staticExpl
)

To simplify our future calculations, we add a new custom id column where each id correspond to the row number

dataset = myDataset$output$value[[1]]
id = as.data.frame(1:nrow(dataset))
colnames(id) = "id"
dataset = dataset %>%
  dplyr::bind_cols(id)

Getting the 5 nearest neighbours id's of each station

Based on this stackoverflow question, for each station, we get its 5 closest neigbours id's and distances.

# Calculate the nearest ID and distance
near_data <- get.knn(dataset[, 4:5], k = 5)

# Extract the nearest ID
nn_index <- as.data.frame(near_data$nn.index)

# Extract the nearest Distance
nn_dist <- as.data.frame(near_data$nn.dist)

# combining again in dataset
dataset = dataset %>%
  dplyr::bind_cols(nn_index) 

# Excerpt of the data
head(dataset)

Training and testing dataset

We will conduct our test on a single station. For this purpose, we exclude the first row from our dataset to create the training set. The test set will therefore correspond to this excluded station.

train = dataset[2:29,]
test = dataset[1:1,]

train.task = makeTask(
  dataset = train,
  target = "tsa"
)
train.task = train.task$output$value
train.data = getTaskData(train.task)

test.task =  makeTask(
  dataset = test,
  target = "tsa"
)
test.task = test.task$output$value
test.data = getTaskData(test.task)

Building the model & predicting with gstat

We can now "train" our learner on the train set and use it to predict the value of the test set. The learner we invoke here is a learner that comes prebuild with the agrometeoR package. It's source code can be found in the makeLearners.R file on github

# training
train.model.5nn = mlr::train(
  learner = learners$otherLearners$lrn.gstat.5nn,
  task = train.task)

# predicting
test.pred.5nn = predict(train.model.5nn, test.task)

# adding predicted info to test set
test.data = test.data %>%
  dplyr::bind_cols(test.pred.5nn$data) %>%
  dplyr::mutate(residuals = truth - response)

# excerpt of the data
head(test.data)

We see that the gstat 5nn predicted value corresponds to r test.data$response Let's make a manual computation to check if we get the same value !

Manual computation

The manual computation is simple. We take the tsa value of the 5 stations which are closest to our excluded station and compute their mean. These stations id can be found using by filtering our dataset by the id that matche the id stored the columns V1, V2, V3, V4, V5.

manual_5NN = dataset %>%
  dplyr::filter(id %in% as.numeric(dataset[1,8:12])) %>%
  dplyr::summarise_at(.vars = "tsa", .funs = mean)

manual_5NN

We see that this manual computation provides the exact same value of the one provided by gstat !

Conclusion

We can assert that the behaviour of the gstat learner correspond to our expectations. Also, we must note that we have define our gstat.5nn learner such as the values of the neighbours are not weighted according to their distance to the station for which we want to predict the value. Here is the code corresponding to the definition of our gstat.5nn learner :

  lrn.gstat.5nn = makeFilterWrapper(
    learner = makeLearner(
      cl = "regr.gstat",
      id = "nn5",
      par.vals = list(
        set = list(idp = 0),
        nmax = 5,
        debug.level = 0),
      predict.type = "se"),
    fw.method = "linear.correlation",
    fw.mandatory.feat = c("y", "x"),
    fw.abs = 2)

The line set = list(idp = 0) actually tells that values of the neighbours must not be weighted according to their distance to the station for which we want to predict the value.

Here below is a map of the used station and their values

dataset.sf = st_as_sf(dataset, coords = c("x","y"))
sf::st_crs(dataset.sf) = 3812

test.data.sf = st_as_sf(test.data, coords = c("x","y"))
sf::st_crs(test.data.sf) = 3812

train.data.sf = st_as_sf(train.data, coords = c("x","y"))
sf::st_crs(train.data.sf) = 3812

map = mapview::mapview(test.data.sf["response"]) + mapview::mapview(train.data.sf["tsa"]) + 
mapview::mapview(dataset.sf["id"])

map


pokyah/agrometeoR.extras documentation built on May 27, 2019, 2:07 p.m.