Importing DATRAS {#import}

see (see Figure \@ref(fig:allhauls))

Preamble

library(tidyverse)
library(lubridate)
library(sf)
library(icesDatras)
library(tidyices)
# devtools::install_github("einarhjorleifsson/gisland", dependencies = FALSE)
library(gisland)
# devtools::install_github("ropensci/worrms")
library(worrms)

Data download

With the advent of the DATRAS webserver one has for some time been able to access the the DATRAS data programmatically rather than through the point-and-click DATRAS download facilities. The icesDatras-package allows one to download and access the data directly in R via the getDATRAS function. The function is (currently) limited to only accessing one survey at the time, but multiple years and quarters can however be specified. In addition in its current form the haul dataframe (HH), the length data (HL) and the age data (CA) have to be called separately (TODO: Write a convenience function process this with only one function call?). The following code shows how one can access all the quarter 1 and 3 NS-IBTS data, here stored as a list in an R-binary file for later processing.

# not run
yrs <- 1965:2018
qts <- c(1, 3)
hh_raw <- 
  getDATRAS(record = "HH", survey = "NS-IBTS", years = yrs, quarters = qts)
hl_raw <- 
  getDATRAS(record = "HL", survey = "NS-IBTS", years = yrs, quarters = qts)
ca_raw <- 
  getDATRAS(record = "CA", survey = "NS-IBTS", years = yrs, quarters = qts)

raw <- list(hh = hh_raw, hl = hl_raw, ca = ca_raw)
write_rds(raw, file = "data-raw/datras/ns-ibts_raw.rds")

Data tidying

The "exchange" data do not strictly fall under the umbrella of tidy dataframes (see: Tidy Data) - @wickham2014tidy. To make all subsequent coding more streamlined an a priori re-coding of the exchange dataframes is hence warranted.

To load the data downloaded earlier one can read in the data by:

raw <- read_rds("data-raw/datras/ns-ibts_raw.rds")

As stated above, the data is stored as a list. Lets check the names of the object:

names(raw)

The names refer to the separate dataframes, the haul data (hh), the length data (hl) and the age data (ca). Each of the dataframes can be viewed by running the following code (not run):

raw$hh %>% glimpse()
raw$hl %>% glimpse()
raw$ca %>% glimpse()
cn <- c(colnames(raw$hh), colnames(raw$hl), colnames(raw$ca))
cn.unique <- cn %>% unique()

The three dataframes have in total r length(cn) columns but thankfully a lot of them are a repeat and hence redundant. Additional processing that is done here is explained in details below. At this moment it is not necessary for the novice user of DATRAS data to dig too much into the details of the tidying process, this could be revisited at a later stage. For those that have used the "exchange" data before, some readings into the details may be kosher.

The haul data

The haul dataframe contains one record per haul and is in a relatively tidy format. The total number of variables (61) may though be a bit overwhelming for routine abundance estimates. The function tidy_hh does the following:

hh <-
  raw$hh %>% 
  tidy_hh()

Auxillary variable

We are often interested in analyzing the data by ICES areas, hence one can add the a variable (here called faoarea) to the haul data, based on the shooting coordinates (link to some details of the geo_inside-function):

# Read in the FAO area from the web:
fao <- 
  read_sf_ftp("fao-areas_nocoastline") %>% 
  as("Spatial")
# Find the faoarea attribute of each shooting location
hh <-
  hh %>% 
  mutate(faoarea = geo_inside(shootlong, shootlat, fao, "name"))

TODO - MOVE THIS INTO A LATER SECTION: In this booklet the example code is largely based on the NS-IBTS data. In some of the steps the "[Roundfish area]{#nsrf}" is used. A new numerical variable is created that contains the numerical code of the "Roundfish area" that the haul belongs to by:

ns_area <- 
  read_sf_ftp("NS_IBTS_RF") %>% 
  as("Spatial")
hh <-
  hh %>% 
  mutate(nsarea = 
           geo_inside(shootlong, shootlat, ns_area, "AreaName") %>% 
           as.integer())

The length data

The exchange format of the length related measurements is a bit messy. E.g.:

A convenient functions, tidy_hl basically takes care of the above, it specifically doing:

In order for this to complete its job, we need in addition to the raw hl-dataframe to pass the tidy haul-dataframe (because that is where the variable DataType is stored). And to convert the coded species information to Latin name we need to supply the proper "lookup" table (TODO: provide link the the auxiliary chapter explaining how one can obtain this from scratch rather via the temporary csv file):

species <- read_csv("ftp://ftp.hafro.is/pub/reiknid/einar/datras_worms.csv")
hl <-
  raw$hl %>% 
  tidy_hl(hh, species)

So starting with the raw length dataframe containing 27 variables what is returned is a dataframe that contains only 5 variables, the haul id, the species name (latin), sex, length (in centimeters) and the number of fish measured (n).

hl <- read_rds("data/ns-ibts_hl.rds")
glimpse(hl)

NOTE: Sometimes only counts are in the hl-data (length NA), sometimes total weight etc. Need to cover that if possible and if not drop those records in the tidying.

The age data

.. draft to be written

ca <-
  nsibts_raw$ca %>% 
  tidyices::tidy_ca(species)

Save stuff for later use

hh %>% write_rds("data/ns-ibts_hh.rds")
hl %>% write_rds("data/ns-ibts_hl.rds")
ca %>% write_rds("data/ns-ibts_ca.rds")

Get the whole mess

The above code shows how to obtain the haul, length and age data for one survey. Since we may be interested in looking at more than one survey, the code below describes how to get all the survey data stored in the DATRAS database via a loop-script:

Downloading

# Get an overview of all the surveys
dtrs <- icesDatras::getDatrasDataOverview()

# Loop through each survey, download and save ----------------------------------
for(i in 1:length(dtrs)) { 

  sur <- names(dtrs[i])
  print(sur)
  yrs <- rownames(dtrs[[i]]) %>% as.integer()
  qts <- c(1:4)
  # A error occurs in the NS-IBTS if all quarters are requested
  if(sur == "NS-IBTS") qts <- c(1, 3)

  hh_raw <- 
    icesDatras::getDATRAS(record = "HH", survey = sur, years = yrs, quarters = qts)
  hl_raw <- 
    icesDatras::getDATRAS(record = "HL", survey = sur, years = yrs, quarters = qts)
  ca_raw <- 
    icesDatras::getDATRAS(record = "CA", survey = sur, years = yrs, quarters = qts)
  list(hh = hh_raw, hl = hl_raw, ca = ca_raw) %>% 
    write_rds(path = paste0("data-raw/datras/", tolower(sur), "_raw.rds"))
}

Tidying

# Make sure the needed objects are available
fao <- gisland::read_sf_ftp("fao-areas_nocoastline") %>% as("Spatial")
ns_area <- gisland::read_sf_ftp("NS_IBTS_RF") %>% as("Spatial")
species <- read_csv("ftp://ftp.hafro.is/pub/reiknid/einar/datras_worms.csv")

fil <- dir("data-raw/datras", full.names = TRUE)
# Setup list objects to temporarily store the results
res_hh <- res_hl <- res_ca <- list()

# Loop through each survey
for(i in 1:length(fil)) {

  raw <- read_rds(fil[i])

  sur <- raw$hh$Survey[1] %>% tolower()

  hh <-
    raw$hh %>% 
    tidy_hh(all_variables = TRUE) %>% 
    mutate(nsarea = gisland::geo_inside(shootlong, shootlat, ns_area, "AreaName") %>% as.integer(),
           faoarea = gisland::geo_inside(shootlong, shootlat, fao, "name"),
           # TODO: this should be part of the tidy-function
           rigging = as.character(rigging),
           stratum = as.character(stratum),
           stno = as.character(stno),
           hydrostno = as.character(hydrostno))

  if(!is.null(raw$hl)) {
    hl <-
      raw$hl %>%
      tidy_hl(hh, species)
  }
  if(!is.null(raw$ca)) {
    ca <-
      raw$ca %>%
      tidy_ca(species)
  }
  hh %>% write_rds(paste0("data/", sur, "_hh.rds"))
  if(!is.null(raw$hl)) hl %>% write_rds(paste0("data/", sur, "_hl.rds"))
  if(!is.null(raw$ca)) ca %>% write_rds(paste0("data/", sur, "_ca.rds"))

  # temporary storage
  res_hh[[i]] <- hh
  res_hl[[i]] <- hl
  res_ca[[i]] <- ca
}


# Bind all the DATRAS data and save for later retrieval
res_hh %>% bind_rows() %>% write_rds("data/hh_datras.rds")
res_hl %>% bind_rows() %>% write_rds("data/hl_datras.rds")
res_ca %>% bind_rows() %>% write_rds("data/ca_datras.rds")

Recap

In all its simplicity and with no extra frills the code to download, tidy and save one survey is as follows:

spe <- read_csv("ftp://ftp.hafro.is/pub/reiknid/einar/datras_worms.csv")
sur <- "NS-IBTS"
yrs <- 1965:2018
qts <- c(1, 3)

hh <- 
  getDATRAS(record = "HH", survey = sur, years = yrs, quarters = qts) %>% 
  tidy_hh()
getDATRAS(record = "HL", survey = sur, years = yrs, quarters = qts) %>% 
  tidy_hl(hh, spe) %>% 
  write_rds("data/ns-ibts_hl.rds")
getDATRAS(record = "CA", survey = sur, years = yrs, quarters = qts) %>% 
  tidy_ca(spe) %>% 
  write_rds("data/ns-ibts_ca.rds")
hh %>% write_rds("data/ns-ibts_hh.rds")


einarhjorleifsson/datrasdoodle documentation built on May 27, 2019, 2:09 p.m.