In einarhjorleifsson/datrasdoodle: Does not matter.

Managing DATRAS data {#mgm}

THIS IS STILL VERY, VERY VERY MUCH A TESTING GROUND

Load libraries and scripts:

library(tidyverse)
library(fs)
library(tidyices)

Data storage system

One of the greatest hindrance when analyzing DATRAS data is that a direct sql-connection to the database is not (yet) available for users outside the ICES secretary. So in order to work with the data one has first to have a local copy of the data. Given that the database gets frequently updated both with newly collected survey data as well corrections and or modification of older data keeping the local copy of the data in sync with the latest DATRAS version is a challenge in its own self. What makes it even more challenging is that it is impossible for the outside user to know what data has changed and when did it changed.

There are plenty of routes available for storing and maintaining the downloaded raw DATRAS data on your computer so what is adopted here is just one method. In the approach adopted local copies of the raw DATRAS data is stored in a single directory (here we use data-raw/datras) with the file name of the format "survey_year_quarter_xx.rds", where the xx stands for HH (Haul data), HL (Length frequency) and CA (detail data). I.e. each year and quarter for a specific survey as well as table is stored as a separate file. An example of the content directory is shown below, in this case a binary copy of all the tables from the ROCKALL survey:

├── data-raw
│   ├── datras
│   │   ├── ROCKALL_1999_3_CA.rds
│   │   ├── ROCKALL_1999_3_HH.rds
│   │   ├── ROCKALL_1999_3_HL.rds
│   │   ├── ROCKALL_2001_3_CA.rds
│   │   ├── ROCKALL_2001_3_HH.rds
│   │   ├── ROCKALL_2001_3_HL.rds
│   │   ├── ROCKALL_2002_3_CA.rds
│   │   ├── ROCKALL_2002_3_HH.rds
│   │   ├── ROCKALL_2002_3_HL.rds
│   │   ├── ROCKALL_2003_3_CA.rds
│   │   ├── ROCKALL_2003_3_HH.rds
│   │   ├── ROCKALL_2003_3_HL.rds
│   │   ├── ROCKALL_2005_3_CA.rds
│   │   ├── ROCKALL_2005_3_HH.rds
│   │   ├── ROCKALL_2005_3_HL.rds
│   │   ├── ROCKALL_2006_3_CA.rds
│   │   ├── ROCKALL_2006_3_HH.rds
│   │   ├── ROCKALL_2006_3_HL.rds
│   │   ├── ROCKALL_2007_3_CA.rds
│   │   ├── ROCKALL_2007_3_HH.rds
│   │   ├── ROCKALL_2007_3_HL.rds
│   │   ├── ROCKALL_2008_3_CA.rds
│   │   ├── ROCKALL_2008_3_HH.rds
│   │   ├── ROCKALL_2008_3_HL.rds
│   │   ├── ROCKALL_2009_3_CA.rds
│   │   ├── ROCKALL_2009_3_HH.rds
│   │   └── ROCKALL_2009_3_HL.rds

There were two main reason for selecting this storage structure (explain better):

Will be easier to update (see later).
One can choose if one only wants to work on lets say haul data.

Datras overview

We can get an overview of the available years and quarters for each survey in the DATRAS database by using the get_datras_overview. E.g. if one were interested in the NS-IBTS surveys one would get the available years and quarters by:

surveys <- dtrs_overview(survey = c("NS-IBTS"))
surveys %>% knitr::kable()

Downloading

The above overview dataframe ("surveys") can be be passed to the dtrs_download-function. By doing so we will be accessing the DATRAS webserver for each survey, year and quarter and downloading the file onto the local computer. The data are then saved as "raw" data in the directory specified by the path-arguement:

surveys %>% 
  dtrs_download(path = "data-raw/datras")

Tidying

Lets first see what binary files we have downloaded:

surveys <- 
  dir_info("data-raw/datras") %>% 
  mutate(survey = basename(path)) %>% 
  select(survey, size, time = birth_time) %>% 
  mutate(survey = str_remove(survey, ".rds")) %>% 
  separate(survey, c("survey", "year", "quarter", "type"),
           sep = "_", convert = TRUE) %>% 
  drop_na()
surveys %>% 
  select(-c(time)) %>%
  spread(type, size) %>% 
  knitr::kable()

Reading raw data into R session and tidy: The dtrs_read-function can be used to read any of the raw binary files into an R-session (TODO: add argument years and quarters). We can in addition pass the raw data directly into the dtrs_tidy-function to obtain a ... (see earlier chapters):

hh <- 
  dtrs_read(surveys = "NS-IBTS", type = "HH") %>% 
  dtrs_tidy()
glimpse(hh)

#species <- read_csv("ftp://ftp.hafro.is/pub/reiknid/einar/datras_worms.csv")
hl <- 
  dtrs_read(surveys = "NS-IBTS", type = "HL") %>%
  dtrs_tidy(hh = hh)
glimpse(hl)

ca <- 
  dtrs_read(surveys = "NS-IBTS", type = "CA") %>% 
  dtrs_tidy()
glimpse(ca)

... should one save the tidy data or should one just always access the raw data and tidy within each R-session?

Updating the local copies

... something on practice for updating the local copy of the DATRAS data ...

A sidestep - approach using list-columns in a dataframe

Here download and tidy all data using the map approach:

x <- dtrs_overview()
x <-
  x %>% 
  # Only this years surveys
  filter(year == 2018) %>% 
  mutate(id = 1:n()) %>% 
  group_by(id) %>%
  nest() %>%
  # operating on datras webservice
  mutate(hh = map(data, tidyices:::get_hh),
         hl = map(data, tidyices:::get_hl),
         ca = map(data, tidyices:::get_ca)) %>% 
  # tidying the data
  mutate(hh =  map(hh,     tidy_hh),
         hl = map2(hl, hh, tidy_hl),
         ca =  map(ca,     tidy_ca))