validate: Validate FVCOM predictions with observations

View source: R/validate.R

validateR Documentation

Validate FVCOM predictions with observations

Description

For specified location(s), layer(s) and timestamp(s), this function pairs observations with corresponding FVCOM predictions. Locations and observations can be derived from different datasets. This means that animal movement datasets, in which locations and observations may be derived from different electronic tags, can be used for validation. The function uses nearest neighbour interpolation to extract FVCOM predictions for the nearest hour, at the specified layer and at the nearest mesh cell for which predictions are available relative to user inputs. Paired observations and predictions can be used to examine model skill (i.e., prediction accuracy).

Usage

validate(
  dat_obs1,
  dat_obs2 = NULL,
  threshold_match_gap = NULL,
  mesh,
  match_hour = data.frame(hour = 0:23, index = 1:24),
  match_layer = NULL,
  match_mesh = NULL,
  extension = ".mat",
  corrupt = NULL,
  read_fvcom = function(con) R.matlab::readMat(con)$data,
  dir2load,
  cl = NULL,
  pass2varlist = NULL,
  verbose = TRUE
)

Arguments

dat_obs1

A dataframe which contains the location(s), layer(s) and timestamp(s) at which environmental observations have been made. These should be defined in columns named 'lat' and 'long', 'location', 'layer' and 'timestamp' respectively. A column 'key' may also need to be included (see Details). Location(s) are assumed to be in World Geodetic System format (i.e. WGS 84). Observations can either be located in this dataframe, in a column called 'obs', or in a separate dataframe, (dat_obs2), see below.

dat_obs2

A dataframe which contains the environmental observations which will be used to validate model predictions. This must contain timestamps ('timestamp'), observations ('obs'). A column 'key' may also need to be included (see Details). Note that observations can be provided in dat_obs1 but, for some validation datasets, observations and locations are in separate datasets (see Details). dat_obs2 allows for this flexibility.

threshold_match_gap

A numeric input which defines the duration (s) between the timestamp of a known location and that of a corresponding observation before/after which these timestamps in dat_obs1 are removed. This is useful if dat_obs2 is provided and if observations are only available for a sample of the timestamps at which locations are known. In this scenario, matching observations via the nearest timestamp may be inappropriate because there may be long gaps between the times of known locations and observations. In this case, all of the 'nearest' observation(s) that are more than threshold_match_gap s away (either before or after) the required timestamp(s) are removed.

mesh

A SpatialPolygonsDataFrame-class object which defines the WeStCOMS mesh created by build_mesh. The coordinate reference system for the mesh should match inputted coordinates (e.g., lat, long) (see dat_obs1).

match_hour

A dataframe with two integer columns named 'hour' and 'index' which defines the index in FVCOM arrays (i.e. the row) which corresponds to each hour (see extract).

match_layer

A dataframe with two integer columns named 'layer' and 'index' which defines the index in FVCOM arrays (i.e. the column) which corresponds to each layer n (see extract).

match_mesh

(optional) A dataframe with two columns named 'mesh' and 'index' which defines the index in FVCOM arrays (columns or sheets for 2d and 3d arrays respectively) which corresponds to each mesh cell (see extract).

extension

A string which defines the extension of the FVCOM arrays (see extract).

corrupt

A vector of numbers, representing WeStCOMS date names, which define corrupt files (see extract).

read_fvcom

A function which is used to load files (see extract).

dir2load

A string which defines the directory from which to load FVCOM arrays containing predictions (see extract).

cl

(optional) A cluster objected created by the parallel package (see extract).

pass2varlist

A list containing the names of exported objects. This may be required if cl is supplied. This is passed to the varlist argument of clusterExport. Exported objects must be located in the global environment (see extract).

verbose

A logical input which defines whether or not to display messages to the console detailing function progress.

Details

To use this function, the user must supply a dataframe (dat_obs1) which contains the locations(s) , layer(s) and time(s) (in a column named 'timestamp'). These columns must be named 'long' and 'lat', 'layer' and 'timestamp' respectively. A column, 'key', may also need to be included (see below). Location(s) are assumed to be in World Geodetic System format (i.e. WGS 84). Observations can either be located in this dataframe, in a column called 'obs', or in a separate dataframe (dat_obs2) with columns 'timestamp', 'obs' and 'key'. One situation where this latter option is useful is for animal movement data when the location of the animal is known from one tag type (e.g. passive acoustic telemetry) and observations are recorded by another tag type (e.g. an archival tag), possibly at a different resolution. In this scenario, both dataframes should contain a column, 'key', which connects the factor level(s) (e.g. individuals) for which locations observations have been made in the first dataframe with the factor level(s) for which environmental observations (which will be used to validate the model) have been made. In this case, a nearest neighbour matching approach is used to add observations into the first dataframe (dat_obs1) from the second dataframe (dat_obs2) by selecting those that occurred closest in time to the timestamps stipulated in the first dataframe (this is why is is important to have a 'key' to distinguish among factor levels). The function uses a mesh supplied by the user to determine the FVCOM mesh nodes/elements within which each location lies (i.e. nearest neighbour interpolation). With this information, the function loads in each FVCOM file from a user-defined directory, extracts the relevant model predictions from this file, and then returns the original dataframe with predictions added. FVCOM files can be loaded in parallel via cl and pass2varlist arguments.

Value

The function returns a dataframe containing the columns in dat_obs1 with some additions. If dat_obs2 is used to add observations to this dataframe, the dataframe returned also contains a column with environmental observations, 'obs', the timestamps at which these were made ('timestamp_obs'), extracted from dat_obs2, and the difference (seconds) between the timestamps of known locations and the timestamps at which observations were made ('difftime'). 'mesh_ID' defines the node (or element) within which each observation occurred and 'index_row', 'index_mesh' (and, if applicable, 'index_layer') provide a reference to the cells in the FVCOM arrays from which model predictions were extracted. 'wc' contains the predicted conditions from WeStCOMS and 'diff' contains the difference between observed and predicted conditions (i.e. 'obs' - 'wc').

Author(s)

Edward Lavender

Examples


###########################################
#### Example (1): Temperature observations at specified locations and times

#### Hypothetical scenario
# Imagine we have recorded surface temperature observations
# ... with standard probes in the following locations and at the specified times.

#### Define dataframe:
# First, define imaginary timestamps and locations at which we made observations
set.seed(1)
timestamp <- as.POSIXct(c("2016-03-01", "2016-03-02"), tz = "UTC")
xy_node <- sp::coordinates(dat_mesh_around_nodes)
xy_sel <- sample(1:nrow(xy_node), size = length(timestamp))
long <- as.numeric(xy_node[xy_sel, 1])
lat <- as.numeric(xy_node[xy_sel, 2])
dat_obs1 <- data.frame(timestamp = timestamp, long = long, lat = lat, layer = 1)
# Second, add imaginary observations to the dataframe:
dat_obs1$obs <- stats::runif(nrow(dat_obs1), 7, 10)

#### Define match dataframes to provide the link between dat_obs1 and the WeStCOMS arrays:
match_hour <- data.frame(hour = 0:1, index = 1:2)
match_layer <- data.frame(layer = 1:2, index = 1:2)
match_mesh <- data.frame(mesh = dat_nodexy$node_id, index = 1:length(dat_nodexy$node_id))

#### Define mesh CRS:
proj <- sp::CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
raster::crs(dat_mesh_around_nodes) <- proj

#### Define path from which to load predictions:
path <- system.file("WeStCOMS_files/temp",
                    package = "fvcom.tbx", mustWork = TRUE)
path <- paste0(path, "/")

#### Implement validation:
validation <-
  validate(dat_obs1 = dat_obs1,
                   dat_obs2 = NULL,
                   mesh = dat_mesh_around_nodes,
                   match_hour = match_hour,
                   match_layer = match_layer,
                   match_mesh = match_mesh,
                   dir2load = path,
                   extension = ".mat",
                   corrupt = NULL,
                   cl = NULL,
                   pass2varlist = NULL,
                   verbose = TRUE
  )
# Examine output:
validation

#### Check that each node IDs was correctly identified given supplied coordinates:
# Compare selected nodes from which coordinates were inputted
# ... with the nodes identified by the function:
as.character(dat_mesh_around_nodes$ID)[xy_sel]; validation$mesh_ID
# We can see we have identified the correct position in the array corresponding to each node:
cbind(validation[, c("index_mesh", "mesh_ID")], match_mesh[validation$index_mesh, ])

### We can see that the data for each node has been extracted correctly:
mapply(list.files(path, full.names = TRUE),
       split(validation, f = validation$date_name),
       FUN = function(con, df){
         # Load in file
         sample <- R.matlab::readMat(con)
         sample <- sample$data
         # Extract data (include df$layer here since 3d field):
         sample <- sample[df$index_hour, df$index_layer,  df$index_mesh]
         # Check identical:
         print(identical(sample, df$wc))
       })


###########################################
#### Example (2): A validation dataset from animal movement data

#### Hypothetical scenario
# Imagine that our validation dataset comes from animal movement data
# ... in which locations are known from one dataset and observations have been made
# ... and stored in another dataset.

#### Define dat_obs1
# Define some timestamps (on 00:00 hours to match sample data in package):
timestamp <- as.POSIXct(c("2016-03-01 00:00:32", "2016-03-02 00:00:30"), tz = "UTC")
# Define locations
long <- as.numeric(xy_node[xy_sel, 1])
lat <- as.numeric(xy_node[xy_sel, 2])
# Define dataframe
dat_obs1 <- data.frame(timestamp = timestamp, long = long, lat = lat, layer = 1, key = 1)

#### Define dat_obs2
# Imagine we have collected observations every 2 minutes
timestamp_obs <-
  seq(as.POSIXct("2016-03-01", tz = "UTC"), as.POSIXct("2016-03-03", tz = "UTC"), by = "2 mins")
dat_obs2 <- data.frame(timestamp = timestamp_obs,
                       key = 1,
                       obs = stats::runif(length(timestamp_obs), 7, 10))

#### Implement validation:
validation <-
  validate(dat_obs1 = dat_obs1,
                   dat_obs2 = dat_obs2,
                   mesh = dat_mesh_around_nodes,
                   match_hour = match_hour,
                   match_layer = NULL,
                   match_mesh = match_mesh,
                   dir2load = path,
                   extension = ".mat",
                   corrupt = NULL,
                   cl = NULL,
                   pass2varlist = NULL,
                   verbose = TRUE
  )
#### Examine output:
validation

#### We can see that the algorithm has identified observations
# ... at the nearest point in time to known locations
validation[, c("timestamp", "timestamp_obs", "obs")]
cbind(validation$obs, dat_obs2[dat_obs2$timestamp %in% validation$timestamp_obs, "obs"])
# The algorithm then uses this information to extract outputs from WeStCOMS.


edwardlavender/fvcom.tbx documentation built on Nov. 26, 2022, 10:28 p.m.