In einarhjorleifsson/datrasdoodle: Does not matter.

Importing DATRAS {#import}

see (see Figure \@ref(fig:allhauls))

Preamble

library(tidyverse)
library(lubridate)
library(sf)
library(icesDatras)
library(tidyices)
# devtools::install_github("einarhjorleifsson/gisland", dependencies = FALSE)
library(gisland)
# devtools::install_github("ropensci/worrms")
library(worrms)

Data download

With the advent of the DATRAS webserver one has for some time been able to access the the DATRAS data programmatically rather than through the point-and-click DATRAS download facilities. The icesDatras-package allows one to download and access the data directly in R via the getDATRAS function. The function is (currently) limited to only accessing one survey at the time, but multiple years and quarters can however be specified. In addition in its current form the haul dataframe (HH), the length data (HL) and the age data (CA) have to be called separately (TODO: Write a convenience function process this with only one function call?). The following code shows how one can access all the quarter 1 and 3 NS-IBTS data, here stored as a list in an R-binary file for later processing.

# not run
yrs <- 1965:2018
qts <- c(1, 3)
hh_raw <- 
  getDATRAS(record = "HH", survey = "NS-IBTS", years = yrs, quarters = qts)
hl_raw <- 
  getDATRAS(record = "HL", survey = "NS-IBTS", years = yrs, quarters = qts)
ca_raw <- 
  getDATRAS(record = "CA", survey = "NS-IBTS", years = yrs, quarters = qts)

raw <- list(hh = hh_raw, hl = hl_raw, ca = ca_raw)
write_rds(raw, file = "data-raw/datras/ns-ibts_raw.rds")

Data tidying

The "exchange" data do not strictly fall under the umbrella of tidy dataframes (see: Tidy Data) - @wickham2014tidy. To make all subsequent coding more streamlined an a priori re-coding of the exchange dataframes is hence warranted.

To load the data downloaded earlier one can read in the data by:

raw <- read_rds("data-raw/datras/ns-ibts_raw.rds")

As stated above, the data is stored as a list. Lets check the names of the object:

names(raw)

The names refer to the separate dataframes, the haul data (hh), the length data (hl) and the age data (ca). Each of the dataframes can be viewed by running the following code (not run):

raw$hh %>% glimpse()
raw$hl %>% glimpse()
raw$ca %>% glimpse()

cn <- c(colnames(raw$hh), colnames(raw$hl), colnames(raw$ca))
cn.unique <- cn %>% unique()

The three dataframes have in total r length(cn) columns but thankfully a lot of them are a repeat and hence redundant. Additional processing that is done here is explained in details below. At this moment it is not necessary for the novice user of DATRAS data to dig too much into the details of the tidying process, this could be revisited at a later stage. For those that have used the "exchange" data before, some readings into the details may be kosher.

The haul data

The haul dataframe contains one record per haul and is in a relatively tidy format. The total number of variables (61) may though be a bit overwhelming for routine abundance estimates. The function tidy_hh does the following:

Creates a unique key: The linkage of the haul data to other tidied dataframes (length and age dataframes) will be through the combined key of year, quarter, ship, gear and haulno. This combination of variable is unique within the haul-dataframe. In order to facilitate linkage the key variables are combined into a single variable named id. In the haul table the original variables are retained.
A datetime variable: Information of the date and time of haul shot in the raw exchange data are stored in four variables: year, month, day and timeshot. While this may be understandable if one is concerned with cross-platform deliveries, within R these variables can be combined into a single variable, here named datetime.
By default, only selected variables are returned, these being: id, year, quarter, survey, ship, gear, haulno, date, country, depth, haulval, hauldur, shootlat, shootlong, haullat, haullong, statrec, daynight, datatype, stdspecreccode, bycspecreccode). If all variable are wished for in further processing, the argument all_variable in the tidy_hh-function can be set to TRUE (the default setting is FALSE).

hh <-
  raw$hh %>% 
  tidy_hh()

Auxillary variable

We are often interested in analyzing the data by ICES areas, hence one can add the a variable (here called faoarea) to the haul data, based on the shooting coordinates (link to some details of the geo_inside-function):

# Read in the FAO area from the web:
fao <- 
  read_sf_ftp("fao-areas_nocoastline") %>% 
  as("Spatial")
# Find the faoarea attribute of each shooting location
hh <-
  hh %>% 
  mutate(faoarea = geo_inside(shootlong, shootlat, fao, "name"))

TODO - MOVE THIS INTO A LATER SECTION: In this booklet the example code is largely based on the NS-IBTS data. In some of the steps the "[Roundfish area]{#nsrf}" is used. A new numerical variable is created that contains the numerical code of the "Roundfish area" that the haul belongs to by:

ns_area <- 
  read_sf_ftp("NS_IBTS_RF") %>% 
  as("Spatial")
hh <-
  hh %>% 
  mutate(nsarea = 
           geo_inside(shootlong, shootlat, ns_area, "AreaName") %>% 
           as.integer())

The length data

The exchange format of the length related measurements is a bit messy. E.g.:

The species depends on a numerical code (SpecCode) whose meaning then depends on another variable (SpecCodeType). Then there is a third reference to species code (Valid_Aphia) which linkage to the other two variables is as of this writing unclear. The Valid_Aphia variable or if missing (NA) the SpecCode if SpecCodeType is "W" is used to derive the Latin name from WoRMS.
The unit of the variable length class (LngtClass) is different it being dependent on another variable (lngtcode).
The most confusing part is that the number of fish "measured" (HLNoAtLngt) is different, its meaning dependent on a variable (DataType) in the haul table.
Lastly, there are a lot of variables (27) in the raw length table, many of them really associated with the haul table and hence redundant.

A convenient functions, tidy_hl basically takes care of the above, it specifically doing:

Only distinct records (they are all distinct)
Filter out data were species code, length-code and length-class are undefined (TODO: Check code)
Set lengths to millimeter. In the raw records:
If lngtcode is "1" then the lngtclass is in centimeters
If lngtcode is "." or "0" then the lngclass is in millimeters
Standardize the haul numbers at length. In the raw data:
If datatype in the station table is "C" then hlnoatlngt has been standardized to 60 minutes haul
If datatype in the station table is "R" then hlnoatlngt has not been standardized to 60 minutes haul
Get the Latin species name from a species-table
Return only variable that are needed in further processing

In order for this to complete its job, we need in addition to the raw hl-dataframe to pass the tidy haul-dataframe (because that is where the variable DataType is stored). And to convert the coded species information to Latin name we need to supply the proper "lookup" table (TODO: provide link the the auxiliary chapter explaining how one can obtain this from scratch rather via the temporary csv file):

species <- read_csv("ftp://ftp.hafro.is/pub/reiknid/einar/datras_worms.csv")
hl <-
  raw$hl %>% 
  tidy_hl(hh, species)

So starting with the raw length dataframe containing 27 variables what is returned is a dataframe that contains only 5 variables, the haul id, the species name (latin), sex, length (in centimeters) and the number of fish measured (n).

hl <- read_rds("data/ns-ibts_hl.rds")

glimpse(hl)

NOTE: Sometimes only counts are in the hl-data (length NA), sometimes total weight etc. Need to cover that if possible and if not drop those records in the tidying.

The age data

.. draft to be written

ca <-
  nsibts_raw$ca %>% 
  tidyices::tidy_ca(species)

Save stuff for later use

hh %>% write_rds("data/ns-ibts_hh.rds")
hl %>% write_rds("data/ns-ibts_hl.rds")
ca %>% write_rds("data/ns-ibts_ca.rds")

Get the whole mess

The above code shows how to obtain the haul, length and age data for one survey. Since we may be interested in looking at more than one survey, the code below describes how to get all the survey data stored in the DATRAS database via a loop-script:

Downloading

# Get an overview of all the surveys
dtrs <- icesDatras::getDatrasDataOverview()

# Loop through each survey, download and save ----------------------------------
for(i in 1:length(dtrs)) { 

  sur <- names(dtrs[i])
  print(sur)
  yrs <- rownames(dtrs[[i]]) %>% as.integer()
  qts <- c(1:4)
  # A error occurs in the NS-IBTS if all quarters are requested
  if(sur == "NS-IBTS") qts <- c(1, 3)

  hh_raw <- 
    icesDatras::getDATRAS(record = "HH", survey = sur, years = yrs, quarters = qts)
  hl_raw <- 
    icesDatras::getDATRAS(record = "HL", survey = sur, years = yrs, quarters = qts)
  ca_raw <- 
    icesDatras::getDATRAS(record = "CA", survey = sur, years = yrs, quarters = qts)
  list(hh = hh_raw, hl = hl_raw, ca = ca_raw) %>% 
    write_rds(path = paste0("data-raw/datras/", tolower(sur), "_raw.rds"))
}

Tidying

# Make sure the needed objects are available
fao <- gisland::read_sf_ftp("fao-areas_nocoastline") %>% as("Spatial")
ns_area <- gisland::read_sf_ftp("NS_IBTS_RF") %>% as("Spatial")
species <- read_csv("ftp://ftp.hafro.is/pub/reiknid/einar/datras_worms.csv")

fil <- dir("data-raw/datras", full.names = TRUE)
# Setup list objects to temporarily store the results
res_hh <- res_hl <- res_ca <- list()

# Loop through each survey
for(i in 1:length(fil)) {

  raw <- read_rds(fil[i])

  sur <- raw$hh$Survey[1] %>% tolower()

  hh <-
    raw$hh %>% 
    tidy_hh(all_variables = TRUE) %>% 
    mutate(nsarea = gisland::geo_inside(shootlong, shootlat, ns_area, "AreaName") %>% as.integer(),
           faoarea = gisland::geo_inside(shootlong, shootlat, fao, "name"),
           # TODO: this should be part of the tidy-function
           rigging = as.character(rigging),
           stratum = as.character(stratum),
           stno = as.character(stno),
           hydrostno = as.character(hydrostno))

  if(!is.null(raw$hl)) {
    hl <-
      raw$hl %>%
      tidy_hl(hh, species)
  }
  if(!is.null(raw$ca)) {
    ca <-
      raw$ca %>%
      tidy_ca(species)
  }
  hh %>% write_rds(paste0("data/", sur, "_hh.rds"))
  if(!is.null(raw$hl)) hl %>% write_rds(paste0("data/", sur, "_hl.rds"))
  if(!is.null(raw$ca)) ca %>% write_rds(paste0("data/", sur, "_ca.rds"))

  # temporary storage
  res_hh[[i]] <- hh
  res_hl[[i]] <- hl
  res_ca[[i]] <- ca
}


# Bind all the DATRAS data and save for later retrieval
res_hh %>% bind_rows() %>% write_rds("data/hh_datras.rds")
res_hl %>% bind_rows() %>% write_rds("data/hl_datras.rds")
res_ca %>% bind_rows() %>% write_rds("data/ca_datras.rds")

Recap

In all its simplicity and with no extra frills the code to download, tidy and save one survey is as follows:

spe <- read_csv("ftp://ftp.hafro.is/pub/reiknid/einar/datras_worms.csv")
sur <- "NS-IBTS"
yrs <- 1965:2018
qts <- c(1, 3)

hh <- 
  getDATRAS(record = "HH", survey = sur, years = yrs, quarters = qts) %>% 
  tidy_hh()
getDATRAS(record = "HL", survey = sur, years = yrs, quarters = qts) %>% 
  tidy_hl(hh, spe) %>% 
  write_rds("data/ns-ibts_hl.rds")
getDATRAS(record = "CA", survey = sur, years = yrs, quarters = qts) %>% 
  tidy_ca(spe) %>% 
  write_rds("data/ns-ibts_ca.rds")
hh %>% write_rds("data/ns-ibts_hh.rds")

einarhjorleifsson/datrasdoodle documentation built on May 27, 2019, 2:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

einarhjorleifsson/datrasdoodle
Does not matter.

In einarhjorleifsson/datrasdoodle: Does not matter.

Importing DATRAS {#import}

Preamble

Data download

Data tidying

The haul data

Auxillary variable

The length data

The age data

Save stuff for later use

Get the whole mess

Downloading

Tidying

Recap

R Package Documentation

Browse R Packages

We want your feedback!

einarhjorleifsson/datrasdoodle Does not matter.

In einarhjorleifsson/datrasdoodle: Does not matter.

Importing DATRAS {#import}

Preamble

Data download

Data tidying

The haul data

Auxillary variable

The length data

The age data

Save stuff for later use

Get the whole mess

Downloading

Tidying

Recap

R Package Documentation

Browse R Packages

We want your feedback!

einarhjorleifsson/datrasdoodle
Does not matter.