library(EML)
library(emld)
library(lubridate)
library(tibble)
library(dplyr)
library(tidyr)
library(ggplot2)
library(EFIstandards)

Visualizing the simple, example forecast of population growth of two interacting species

========================================================

Parsing metadata

First load the data using EML.

eml_filename <- system.file("extdata", "forecast-eml.xml", package="EFIstandards")
md <- read_eml(eml_filename)  #you could replace the variable here with a filename form a downloaded data package from a repository

Then to examine the details, use EML::eml_get. First, the title and abstract and who made it.

eml_get(md, "title")
eml_get(md, "abstract")
eml_get(md, "creator")

Then diving in, the coverage and keywords

eml_get(md, "coverage")
eml_get(md, "keywordSet")

Looks like it's a population level timeseries forecast of r eml_get(md, "geographicDescription")[[1]] for two species:

# these seem a bit of a pain to pull out maybe there's an easier way
spps <- eml_get(md, "taxonRankValue")
sp1 <- paste(spps[1:2], collapse=" ")
sp2 <- paste(spps[3:4], collapse=" ")
sp1
sp2

Reading structured methods

Load the additional metadata (see documentation of EFI forecasting standard):

methods_md <- eml_get(md, "additionalMetadata")
methods_md

This metadata contains details about the system forecast including number of state variables and parameters, and the methods used, including propagation of uncertainty. For example, details of the treatment of process error in this model are given by

methods_md %>% eml_get("process_error") # note could do this directly on \`md` object
n_ensembles <- methods_md %>% eml_get("process_error") %>% eml_get("size")
n_ensembles <- n_ensembles[[1]]

From this metadata, the output of this forecast should have error propagation via an ensemble of size r n_ensembles, process error for two state varible dimensions with no covariance between these. Further, one can check if the forecast used data assimilation (no):

eml_get(md, "assimilation")

Output metadata

Finally, I'll dive into the metadata for the output data itself. Wait, which output data? All we've seen so far is metadata!

First remember that EML specifies a tag "physical" for the location of an actual data file. There's plenty of metadata on the file itself including authentication format, etc

dt_md <- eml_get(md, "dataset")
eml_get(dt_md, "physical")

To get the data parse the filename and load it!

datafilename <- eml_get(dt_md, "objectName")[[1]]
# this steps are necessary because this is in an R package if you have a downloaded data repository
# you coudl just pass 'datafilename' into read.csv here
datafile <- system.file("extdata", datafilename, package="EFIstandards") 
dt <- as_tibble(read.csv(datafile, row.names=1))
dt

OK! This has the column names and actual data. To understand these column names, recall the output dataset metadata, which was loaded earlier. In particular the attributeList contains fine details like units and definitions for each column:

dt_md_cols <- dt_md %>% eml_get("attributeList") %>% get_attributes()
dt_md_cols$attributes

Cross-referencing this documentation of EFI forecasting standards, note that data_assimilation column is zeroed out so this performed no data assimilation and forecast column is zeroed out so this was a hindcast. As expected, there are r length(unique(dt$ensemble)) ensemble members in the forecast. As consumer of this type of forecast this is the only way to understand the uncertainty in the forecast.

Visualizing the data

The output includes densities (number per meter squared) for both species 1 and species 2. I assume these are in the same order as the entries in taxononmicCoverage: r sp1 and r sp2 at several depths measured in meters. Glancing at column depth there are only three depths represented. While reshaping the data, I will rename columns with useful information about their units and construct a categorical variable with the actual species names:

dtl <- pivot_longer(dt, c("species_1", "species_2"), values_to="density (m^-2)") %>%
  mutate(time = as.POSIXct(time),
         name=recode(name, species_1=sp1, species_2=sp2),
         `depth (m)` = depth)

Given the ensembles, order statitics or quantiles seem like a natural choice to summarize uncertainty, so compute the median and upper and lower quartiles:

dtl <- dtl %>%
  group_by(name, depth, time) %>%
  mutate(density_50 = median(`density (m^-2)`),
         density_25 = quantile(`density (m^-2)`, .25),
         density_75 = quantile(`density (m^-2)`, .75)
         )

Putting these pieces together to visualize and describe the forecast and its uncertainty:

ggplot(dtl) +
  geom_ribbon(aes(time, ymin=density_25, ymax=density_75, fill=name), alpha=0.3) +
  geom_line(aes(time, `density (m^-2)`, color=name, group=interaction(name, ensemble)), alpha=0.5) + 
  #geom_smooth(aes(group=name), method='gam', formula= y ~ s(x, bs = "cs", k=30)) + 
  geom_line(aes(time, density_50, color=name), size=1.25) +
  facet_grid(`depth (m)` ~ ., labeller=label_both) + 
  guides(color=guide_legend(title='Median density'),
         fill=guide_legend(title='25th-75th quantile density')) +
  labs(title=eml_get(md, 'title')[[1]], 
       subtitle=paste("Forecast from", n_ensembles, "ensembles (thin lines), with uncertainty from quantiles."),
       caption=paste("Demo visualization from EFIstandards package for an \nimaginary forecast at", eml_get(md, "geographicDescription")[[1]]))


cboettig/forecast-standards documentation built on Jan. 1, 2023, 8:21 a.m.