In RETURN-project/BenchmarkRecovery: Benchmark recovery indicators derived from remote sensing time series

Problem description

Currently, our vignette sensitivityAnalysis.Rmd is using evalParam for creating outputs (in the form of saved files) corresponding to each one of the 96 permutations of the following parameters:

par1, whose possible values are (distMag, distRec, distT, missVal, remSd, seasAmp)
par2, whose possible values are (R80p, RRI, simTS, YrYr)
par3, whose possible values are (MAPE, nTS, R2, RMSE)

The loop corresponding to par1 happens at vignette level via an lapply function. The other two happen inside evalParam via nested for loops.

For the sake of code clarity, speed and parallelization, it could be a good idea to refactor evalParam in a more atomic way. By atomic I mean accepting one or more parameters identifying which one of the 96 possible cases we want to calculate, and calculating that and only that one. Depending on the amount of identification parameters, the whole process can be efficiently looped via vectorization, lapply (for a single id per row) or mapply (for multiple ids per row).

The evalParam function is already quite complicated, so we'll assume vectorization is not feasible. The two remaining possibilities are thus:

Use a single identifier (i.e.: evalParam(..., id = "R80p_nTS_distT"), to later be called via lapply).
Use a multiple identifier (i.e: evalParam(..., par1 = "R80p", par2 = "nTS", par3 = "distT"), to later be called via mapply).

Graphical summary

Currently evalParam is doing too much:

It will be advisable to split like:

In the present vignette I show a simple case study of using apply functions in this kind of problems.

knitr::opts_chunk$set(echo = TRUE)

Generate the input data frame

We will generate a simple data set with all the 12 permutations corresponding to these identifiers:

type, whose possible values are ("A", "B", "C").
gender, whose possible values are (1, 2).
country, whose possible values are ("NL", "BE").

plus a single column containing some measurement (just a random number in this case).

# Identifiers
v1 <- c("A", "B", "C")
v2 <- c(1, 2)
v3 <- c("NL", "BE")

# Measurements
v4 <- runif(n = length(v1) * length(v2) * length(v3))

idf <- expand.grid(type = v1, gender = v2, country = v3, stringsAsFactors = FALSE)
idf <- cbind(idf, measurement = v4)

print(idf)

It is usually a good idea to assign meaningful names to the rows.

# Auxiliary function.
# Creates a single row identifier by combining type, gender and country
# For instance: "B_2_NL"
create_id <- function(type, gender, country) {
  id = paste(type, gender, country, sep = "_")
}

rownames(idf) <- create_id(idf$type, idf$gender, idf$country) # Rownames checks that the names are not duplicated

print(idf)

If the name is well chosen (and ours is) it can even be redundant with the other three id columns. But for this tutorial we'll keep all of them, in order to investigate different ways of applying functions to a given row.

Analyze the data

We want to perform the following analysis:

for each row
multiply the measurement by 2 if the country is Belgium.
multiply the measurement by -2 if the country is the Netherlands.

The three functions below do the same, and only differ in the way the input row is specified:

# Analyze (brute-force)
# This method is expected to be called row by row (for instance, inside a loop). It will crash otherwise
analyze <- function(row) {

  # Auxiliary function. 
  # Returns 2 for Belgium and -2 for The Netherlands
  country_to_number <- function(country) {
    if(country == "BE") {
      return(2)
    } else {
      return(-2)
    }
  }

  number <- country_to_number(row$country)
  return(number * row$measurement)
}

# Analyze (prepared for lapply)
# Same as analyze, but a single id (the row name) has to be provided
lanalyze <- function(id, data) {
  row <- data[id, ]
  analyze(row)
}

# Analyze (prepared for mapply)
# Same as analyze, but type, gender and country have to be provided
manalyze <- function(type, gender, country, data) {
  id <- create_id(type, gender, country)
  lanalyze(id, data)
}

The analysis itself is deliberately silly. It is just an example of an action to be:

Performed on an input data set.
Controlled by an input data set (in this case, the same one).

Apply to a desired subset

lresults <- lapply(c("A_1_NL", "B_2_BE"), lanalyze, data = idf) # Using lapply (single row identifier)
mresults <- mapply(manalyze, c("A", "B"), c(1, 2), c("NL", "BE"), MoreArgs = list(data = idf)) # Using mapply (multiple row identifiers)

print(lresults)
print(mresults)

Apply to the whole dataset

lresults <- lapply(create_id(idf$type, idf$gender, idf$country), lanalyze, data = idf) # Using lapply (single row identifier)
mresults <- mapply(manalyze, idf$type, idf$gender, idf$country, MoreArgs = list(data = idf)) # Using mapply (multiple row identifiers)

print(lresults)
print(mresults)

RETURN-project/BenchmarkRecovery documentation built on July 13, 2021, 5:47 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

RETURN-project/BenchmarkRecovery
Benchmark recovery indicators derived from remote sensing time series

In RETURN-project/BenchmarkRecovery: Benchmark recovery indicators derived from remote sensing time series

Problem description

Graphical summary

Generate the input data frame

Analyze the data

Apply to a desired subset

Apply to the whole dataset

R Package Documentation

Browse R Packages

We want your feedback!

RETURN-project/BenchmarkRecovery Benchmark recovery indicators derived from remote sensing time series

In RETURN-project/BenchmarkRecovery: Benchmark recovery indicators derived from remote sensing time series

Problem description

Graphical summary

Generate the input data frame

Analyze the data

Apply to a desired subset

Apply to the whole dataset

R Package Documentation

Browse R Packages

We want your feedback!

RETURN-project/BenchmarkRecovery
Benchmark recovery indicators derived from remote sensing time series