knitr::opts_chunk$set(
  collapse = TRUE,
  comment = ""
)

This document is meant as a primer to organize the development of Norwegian Polar Institute's Marine Database.

Motivation {#motivation}

Consider following scenarios:

  1. Dr A needs marine biological data for an article about zooplankton time-series in Kongsfjorden. Time-series means that Dr A wants to extract all available data ever collected from a specific area within a set of geographic coordinates. Dr A can do programming in Matlab but has never used any other programming language.

  2. Dr B needs chlorophyll data from the last cruise Dr B attended and wants to compare those results to chlorophyll values collected in 1996. Dr B is a longtime MS Excel user and has very limited time to learn to program. Dr B has also very limited time to be patient and wants the data ASAP so that he/she can work with them in Excel after the kids have gone to bed at 10 pm. Dr B has a very important meeting tomorrow at 8 am. Dr B is tired but needs to get this data work done tonight because there is no time during the day.

  3. Dr C has collected data from an instrument that was never used before. These data must be deposited to the Marine Database for later use. Dr C knows how to program, but does not have a quite good grip on database management. Dr C is busy and would like to get the data deposited as soon as possible with as little effort as possible. However, Dr C understands that depositing new data requires creating new entries for meta-data and might take some effort.

  4. Data manager (Dm) D gets messy and unclear data from researchers. Dm D is tired of the situation and wants to make his/her job easier. Dm D is also concerned about the performance and standardization of the data because that is what he/she learned in numerous workshops and meetings Dm D has been involved in. Dm D also feels a certain kind of bride: any data downloaded from a public database should follow international standard formats, Dm D thinks.

  5. Dr E has received all data from the last cruise on Excel sheets. These data have to be deposited in the Marine Database. Dr E is a long time Excel user and has no time to learn programming. Dr E is inclined to send the spreadsheets to Data manager D hoping that she/he would do the job for Dr E.

  6. Dr F is constructing an ecosystem model for the Arctic and needs data to validate her/his model. Dr F is used to work with large datasets and often reads data in netCDF format. Dr F wants to download all abundance data from the database in order to extract variables Dr F needs for his/her model.

The question is how do we accommodate all these scenarios in one database?

The vision

The answer to that question is called "the vision" and can be expressed using five words: comprehensive, simple, flexible, structured, and exportable

Comprehensive means to

Simple means

Flexible means

Structured means

Exportable means

Longer explanation

Note that exportability and simplicity do not collide because the format in which data are stored in the database does not need to be the same than the exported formats. Therefore we should first focus on the most efficient structure data should be stored in, and think about how to export data in desired formats later. This means that data do not need to be stored in, for example CoverageJSON format, but this does not exclude exporting data in such format.

Simplicity minimizes the possible coding errors when importing and exporting data and also makes it possible to save the data in a smaller space. Yet, simplicity has limits because the database has to be comprehensive and structured.

Flexibility means that the data structure cannot be too rigid, essentially leaving us with lists. Yet, the structure of the list has to be rigid enough to be indexable and searchable meaning that we do need to standardize all names of list elements.

Structure

Here is a suggested structure for a list containing data from a single station. Data from multiple stations can be chained together in a list, where Station ID is the highest level in the hierarchy.

Note that the names for data structure are not standardized here. Making the structure comprehensible has been prioritized instead. See Standardized variable names for suggested standardized names.

Station ID and searchable meta-data

All data for a single station should be bound under a unique station ID:

library(MarineDatabase)
library(data.tree)

stn <- Node$new("Station ID")
  meta <- stn$AddChild("Meta-data")
    meta$AddChild("Longitude")
    meta$AddChild("Latitude")
    meta$AddChild("Date-time")
    meta$AddChild("Bottom depth")
    meta$AddChild("Data types")

print(stn)

Station ID is proposed in a following format:

YYYYMMDD_<Expedition name>_<Station name>_<Replicate number>

Words between < > are Unicode strings without whitespace. Expedition and station name is included in the station ID and does not need to be repeated in meta-data (although they can be repeated for convenience). Replicate number is intended for cases when the same station was taken several times during a day on an expedition.

Structuring meta-data directly under the station ID allows reading only the meta-data when building search functions for the data structure. This would make any search functions faster as the computer would not need to open and process all data while searching for entries wanted by the user.

Note that geographic coordinates, date-time, and bottom depth will vary for each instrument in the dataset. The information in the meta-data can be taken from the main CTD cast or averaged over all meta-data.

Data types entry in meta-data contains a character vector of all available data in the Data node (see Data types). This is intended for searching without having to load all data.

Data

Data will be organized under its own level:

stn <- Node$new("Station ID")
  meta <- stn$AddChild("Meta-data")
  dat <- stn$AddChild("Data")
    CTD <- dat$AddChild("CTDs")
      CTD1 <- CTD$AddChild("CTD 1")
        CTD1.meta <- CTD1$AddChild("Meta-data")
        CTD1.data <- CTD1$AddChild("Data")
      CTD2 <- CTD$AddChild("CTD 2")
        CTD2.meta <- CTD2$AddChild("Meta-data")
        CTD2.data <- CTD2$AddChild("Data")
    type1 <- dat$AddChild("Type 1")
      type1.meta <- type1$AddChild("Meta-data")
      type1.data <- type1$AddChild("Data")
    type2 <- dat$AddChild("Type 2")
      type2.meta <- type2$AddChild("Meta-data")
      type2.data <- type2$AddChild("Data")

print(stn)

CTD data is further organized under its own hierarchy because there are typically several CTD casts per station. CTD meta-data (typically outputted by the instrument) is organized under its own level for each CTD cast and Data (salinity, temperature, pressure, etc.) under own level. See Data types for further information about data organization.

Other marine biological data, named as "Type" here (see Data types), are on the same level than CTDs and also include their own meta-data and data hierarchies.

The order of data types in the list structure should flexible, and any index/extract calls should be targetted to the standardized data type names. Whether a full word or the code in Table \@ref(tab:types) is used as data type name can be decided based on convenience, but once we choose one form we cannot change it easily.

Data types {#datatypes}

Data in NPI's marine biological sampling can be organized hierarchically: a sample is taken using a gear. The samples from a gear are further allocated to sample types, which act as the overarching data categories for practical reasons: sample types originate from where the samples are sent for analysis and who is responsible for the work. Data from an individual sample type often comes back as an Excel sheet once analyzed.

Further, each sample type can contain several measured variables. Each of these variables is associated with a value and a unit. Units can be considered as metadata for each variable as they are often standardized and one expedition should not use several units for same data (if they do, the unit should be standardized before entering the values into the database). As an example we take a look at how Pigments (CHL in Table \@ref(tab:types)) data could be organized:

stn <- Node$new("Station ID")
  meta <- stn$AddChild("Meta-data")
  dat <- stn$AddChild("Data")
    chl <- dat$AddChild("Pigments")
      meta <- chl$AddChild("Meta-data")
        meta$AddChild("Longitude")
        meta$AddChild("Latitude")
        meta$AddChild("Sampling gear")
        meta$AddChild("Date-time")
        meta$AddChild("Bottom depth")
        vars <- meta$AddChild("Variables")
          vars$AddChild("Chlorophyll-a, Phaeopigment")
        uns <- meta$AddChild("Units")
          uns$AddChild("mg/m3, mg/m3")
      data <- chl$AddChild("Data")
        id <- data$AddChild("Sample ID")
          id$AddChild("CHL-001, CHL-002, CHL-003, ...")
        repl <- data$AddChild("Replicate")
          repl$AddChild("1, 1, 1, ...")
        dept <- data$AddChild("Depth")
          dept$AddChild("5, 10, 25, ...")
        varb <- data$AddChild("Variable")
          varb$AddChild("Chlorophyll-a, Chlorophyll-a, Chlorophyll-a, ...")
        val <- data$AddChild("Value")
          val$AddChild("3.567, 5.235, 0.422, ...")

print(stn)

The data in Pigments -> Data node will be organized in long format to make indexing easier. Each sublevel in Data node should be a vector and all vectors should be of equal length to make tabularizing possible (in other words Data node is just a vectorized table). For missing values NA should be used instead of an empty string.

The unit field has to support non-alphabetical characters, such as ยต. Whether values should be given for completely standardized units (g/m3) in the example above, can be discussed. Milligrams in the example make the values easier to look at, as the leading zeros are omitted. Therefore researchers prefer to work with mg and any new data will likely be in that format for "Pigments" data type. The same applies to bacterial data, as weights may be expressed as nanograms or even picograms. These numbers begin to reach the floating point limitations of computers, and therefore it might be best to define a standard unit for each type -> variable, but not to require that these units should be in the SI base units (i.e. grams in the example above).

library(knitr)
library(kableExtra)

data("sample_types")

kable(TYPES, "html", caption = "List of sample types used in the marine biological work so far.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Importing data

Importing data to the database will be the major challenge after a rigid and working database structure has been invented. The import scripts should ideally be written as functions to allow standardized data structure. The functions could be compiled into a library/package to further force standardization and documentation of the process. The MarineDatabase package for R attempts to work as an example how such package could be constructed. Spending time to write smart functions from the beginning of the process might be reasonable to minimize the workload of Datamanager D in Scenario 4 in the future. If these functions are robust and easy-to-use, the data section could share the library with researchers and ask them to run data import functions themselves. In this way, the data section could receive data that is more or less ready for the database.

In order to make researchers themselves to prepare data in an importable format, it could be a good idea to create a frequently updated package to GitHub. Alternatively one could make a data import interface online that runs the scripts on the server.

As mentioned in Scenario 5, a large proportion of marine biological data is received as Excel tables. There are packages for R that read Excel tables directly, and there must be such packages for other languages too. Therefore it might be worth it to adapt and make an import routine that also accepts Excel files. The structure of the Excel files, however, should be standardized. Writing a guide how to record data and giving standardized Excel tables to data suppliers might be a good investment for the future.

CTD files are received in standard Seabird .csv format. The MarineDatabase package uses the read.ctd function in the oce package to import Seabird .csv files. These ctd objects can be then added to the data structure almost directly.

Binding meta-data and data

Standardized sample IDs are used for each expedition (see the structure in Data types). These IDs are further passed to any lab where samples are analyzed. Therefore sample IDs and a standardized sample log on an Excel sheet (produced by Dr Wold) can be used as a starting point to bind data together.

Export formats of data

Exporting data from the database should not be a major challenge as long as the data structure is consistent. For example, writing a script that exports relevant data for Dr B in tabular format, is just a matter of searching meta-data, picking relevant data fields and merging them together into a table. This table can then be exported as a .csv file, which in turn can (hopefully) be read by Dr B.

Writing wrappers that take structured data from the database and add fields required by standardized data formats (netCDF, coverageJSON, you name it) is also only a matter of manipulating the data in a script.

Sharing, open-source and future perspectives

If this database format will be proven to work, sharing the database structure and the GitHub packages with other institutions (IMR, UiT, UNIS, etc.) using the new research vessel could allow constructing a standardized database format for marine biological data in Norway.

Creating open-source scripts in a language accepted by the institutions from the beginning could also take away NPI data section's workload in the future as data managers from other institutions could participate to the development of a common marine biological database.

Standardized variable names {#standards}

Variable names should be standardized practically, whereas units should be standardized following accepted norms. Exported variable names, on the other hand, should be standardized to CF standard names when possible.

A table of standardized variable names comes here.

Known variable levels for data types

Tables of known levels for variables for each data type and standardized units come here.



MikkoVihtakari/MarineDatabase documentation built on July 7, 2020, 2:16 a.m.