knitr::opts_chunk$set(echo = TRUE, message = FALSE) devtools::load_all()
As the documentation about the use of gstat to make a knn interpolation is really scarce (the only relevant information we have found is in the chapter interpolation of the Spatial Data Analysis and Modeling with R website), we have conducted a quick investigation to make sure that what gstat performs when using our nearest neighbours learners (e.g. lrn.gstat.5nn
) correponds to what we actually expect from the algorithm.
Our expectation was that for a specific location where we want to predict a value, the algorithm computes the mean of the values of its 5 spatially closest neighbours when we implicitly filter the features to latitude and longitude only.
As it will be demonstrated here below, the gstat nearest neighbour (knn) interpolation works as expected.
Here are all the libraries required for our investigation
# loading mlr & agrometeoR library(FNN) library(mlr) library(agrometeoR) library(dplyr) library(sf) library(mapview)
We create an example dataset from our AGROMET weather database. We extract an hourly observation of temperature for the 29 stations of the Pameseb AWS network.
user_token = Sys.getenv("AGROMET_API_V1_KEY") dfrom = ex_dfrom dto = ex_dto stations = paste0(as.character(stations.df$sid), collapse = ",") sensor = "tsa" staticExpl = "elevation" myDataset = makeDataset( user_token = user_token, stations = stations, dfrom = dfrom, dto = dto, sensor = sensor, staticExpl = staticExpl )
To simplify our future calculations, we add a new custom id column where each id correspond to the row number
dataset = myDataset$output$value[[1]] id = as.data.frame(1:nrow(dataset)) colnames(id) = "id" dataset = dataset %>% dplyr::bind_cols(id)
Based on this stackoverflow question, for each station, we get its 5 closest neigbours id's and distances.
# Calculate the nearest ID and distance near_data <- get.knn(dataset[, 4:5], k = 5) # Extract the nearest ID nn_index <- as.data.frame(near_data$nn.index) # Extract the nearest Distance nn_dist <- as.data.frame(near_data$nn.dist) # combining again in dataset dataset = dataset %>% dplyr::bind_cols(nn_index) # Excerpt of the data head(dataset)
We will conduct our test on a single station. For this purpose, we exclude the first row from our dataset to create the training set. The test set will therefore correspond to this excluded station.
train = dataset[2:29,] test = dataset[1:1,] train.task = makeTask( dataset = train, target = "tsa" ) train.task = train.task$output$value train.data = getTaskData(train.task) test.task = makeTask( dataset = test, target = "tsa" ) test.task = test.task$output$value test.data = getTaskData(test.task)
We can now "train" our learner on the train set and use it to predict the value of the test set. The learner we invoke here is a learner that comes prebuild with the agrometeoR package. It's source code can be found in the makeLearners.R file on github
# training train.model.5nn = mlr::train( learner = learners$otherLearners$lrn.gstat.5nn, task = train.task) # predicting test.pred.5nn = predict(train.model.5nn, test.task) # adding predicted info to test set test.data = test.data %>% dplyr::bind_cols(test.pred.5nn$data) %>% dplyr::mutate(residuals = truth - response) # excerpt of the data head(test.data)
We see that the gstat 5nn predicted value corresponds to r test.data$response
Let's make a manual computation to check if we get the same value !
The manual computation is simple. We take the tsa
value of the 5 stations which are closest to our excluded station and compute their mean. These stations id can be found using by filtering our dataset by the id that matche the id stored the columns V1, V2, V3, V4, V5.
manual_5NN = dataset %>% dplyr::filter(id %in% as.numeric(dataset[1,8:12])) %>% dplyr::summarise_at(.vars = "tsa", .funs = mean) manual_5NN
We see that this manual computation provides the exact same value of the one provided by gstat !
We can assert that the behaviour of the gstat learner correspond to our expectations. Also, we must note that we have define our gstat.5nn learner such as the values of the neighbours are not weighted according to their distance to the station for which we want to predict the value. Here is the code corresponding to the definition of our gstat.5nn learner :
lrn.gstat.5nn = makeFilterWrapper( learner = makeLearner( cl = "regr.gstat", id = "nn5", par.vals = list( set = list(idp = 0), nmax = 5, debug.level = 0), predict.type = "se"), fw.method = "linear.correlation", fw.mandatory.feat = c("y", "x"), fw.abs = 2)
The line set = list(idp = 0)
actually tells that values of the neighbours must not be weighted according to their distance to the station for which we want to predict the value.
Here below is a map of the used station and their values
dataset.sf = st_as_sf(dataset, coords = c("x","y")) sf::st_crs(dataset.sf) = 3812 test.data.sf = st_as_sf(test.data, coords = c("x","y")) sf::st_crs(test.data.sf) = 3812 train.data.sf = st_as_sf(train.data, coords = c("x","y")) sf::st_crs(train.data.sf) = 3812 map = mapview::mapview(test.data.sf["response"]) + mapview::mapview(train.data.sf["tsa"]) + mapview::mapview(dataset.sf["id"]) map
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.