In einarhjorleifsson/datrasdoodle: Does not matter.

Auxiliary {#aux}

library(tidyverse)
library(gisland)

geo_inside {#geoinside}

One is often interested in establishing some georeference of some coordinate locations, such as whether a tow falls inside or outside a particular EEZ or which ICES area tows "belongs" to. Geospatial data such as EEZ or FAO areas are normally stored in a specific spatial format and hence ordinary dplyr-functions are not ...

Lets first import e.g. the FAO areas (conveniently located at ftp://ftp.hafro.is/pub/reiknid/einar/shapes):

fao <- read_sf_ftp("fao-areas_nocoastline")
glimpse(fao)

The object here is of a special format (an sf-dataframe) with two variables, where name is the name of FAO area and the geometry is as special spatial format that contains the geometric polygon of each area.

The objects contains the complete global FAO areas as can bee seen when we plot the object:

plot(fao)

Lets filter the object such that it only contains the ICES areas (FAO area 27):

fao <-
  fao %>% 
  filter(str_sub(name, 1, 2) == 27)

Lets generate some 500 random point coordinates:

d <- 
  data_frame(lon = runif(500, -44, 68.5),
             lat = runif(500,  36, 89.9))

In order to establish the ICES area each of the 500 observations "belong to" we use a convenient function geo_inside that resides in the gisland-package:

d <-
  d %>% 
  mutate(area = geo_inside(lon, lat, fao %>% as("Spatial"), "name")) 
d %>% glimpse()

Here we have added a variable (area) that specifies name of the FAO area that each coordinate "belongs to". Lets plot it to get a better picture of what we have done:

d %>% 
  ggplot() +
  geom_sf(data = fao) +
  geom_point(aes(lon, lat, colour = area))

On joins {#joins}

left_join

Lets take a simple case of haul data and length frequency measurements in different hauls. We limit ourselves to:

3 hauls
2 hauls where in one three length bins were recorded with fish, in the other four length bins were recorded. The variable n is the number of fish that fall within each length bin

So effectively we have 1 haul (here haul with the id = 2) where no fish was recorded

hh <-
  data_frame(id = 1:3)
hl <-
  data_frame(id = c(1, 1, 1, 3, 3, 3, 3),
             length = runif(7, 20, 50) %>% round(),
             n = runif(7, 1, 4) %>% round())
hl

Now lets say we are only interested in calculating the abundance at each station. We would start by some code like the following:

d <-
  hl %>% 
  group_by(id) %>% 
  summarise(n = sum(n))
d

Now if wanted to obtain the mean abundance per one can not do:

d %>% summarise(m = mean(n))

because we would not be taking into account of the zero fish caught in haul with id = 2.

Hence we have to merge the counts from the length dataframe with the station dataframe. Something like this:

hh %>% 
  left_join(d, by = "id")

Here haul 2 has the abundance variable assigned as NA. To get the mean abundance could try:

hh %>% 
  left_join(d, by = "id") %>% 
  summarise(m = mean(n))

This however generates an NA. One could try:

hh %>% 
  left_join(d, by = "id") %>% 
  summarise(m = mean(n, na.rm = TRUE))

But this would give us the mean of the positive stations (as we already obtained above). So effectively we have to assign a value of zero to the abundance when it is NA. The full pipeline would hence be:

hh %>% 
  left_join(hl, by = "id") %>% 
  mutate(n = replace_na(n, 0)) %>% 
  group_by(id) %>% 
  summarise(n = sum(n)) %>% 
  summarise(m = mean(n))

Another approach to obtain the mean could be to sum the counts in the length data and divide it by the number of hauls obtained from the haul data:

hl %>% summarise(n = sum(n)) / nrow(hh)

This would work for the mean. However if one were interested in other type of statistics, like e.g. the variance, this approach would create us some headache to code.

On missingness {#onmissingness}

TODO: Write a better intro

Lets take a little detour and look at a simplified example on why the above will not work. Say we have the following:

hh2 <- tibble(id = c(1, 2, 3))
hl2 <- tibble(
  id = c(1:2, 1),
  species = c("cod", "plaice", "plaice"),
  length = c(20, 25, 30),
  n = c(3, 4, 5)
)
hh2
hl2

So we have 3 hauls, with one haul were both species are recorded, one haul were only one of the species is recorded and then one haul with no catch of neither species.

by.haul.positive2 <-
  hl2 %>% 
  group_by(id, species) %>% 
  summarise(n = sum(n))
by.haul.positive2

The simplest way to include implicitly missing species is to first generate a dataframe that contains all the hauls and species combination before one does the left join with the count data. Here we use the function crossing (similar to the base function expand.grid):

all <- 
  hh2 %>% 
  crossing(species = by.haul.positive2$species) 
all

Here we have a combination of all hauls and all species (3 x 3). So when we do a left-join with the count data we will join the tables by both the id and the Latin name:

all %>% 
  left_join(by.haul.positive2) %>% 
  mutate(n = replace_na(n, 0))

North sea roundfish area {#nsrf}

read_sf_ftp("NS_IBTS_RF") %>% 
  mutate(AreaName = as.integer(AreaName)) %>%
  ggplot() +
  theme_bw() +
  geom_sf(aes(fill = factor(AreaName))) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "North Sea roundfish area",
       subtitle = "Used for generation of age-length-keys",
       fill = "ALK areas")

Species code {#spcode}

In the DATRAS data the species are encoded. There are three columns associated with this identity:

speccode: A integer code id for the species. The meaning of that value is dependent on the speccodetype.
speccodetype: A single character letter:
- N: NODC code (EINAR: Think this is not used)
- T: TSN code
- W: WoRMS ApiaID code
valid_aphia: An integer code for the species

It is not clear at the time of this writing what processes are when Latin names are assigned to the above species code sweeps of variables. The approach taken here is to use the World Registry of Marine Species often written as acronym WoRMS. Fortunately an R-library worrms allows us to obtain the Latin names programatically.

Here the variable valid_aphia is the primary source, but in cases that is not specified (is NA) the speccode is used if the speccodetype is "W". We refer to this variable as aphia. Each numerical aphia value should have a unique Latin name. For convenience a complete lookup-table is generated based on the all the DATRAS data (as available in July 2018). The following code was used:

# List the available raw survey data files
fil <- dir("data-raw/datras", full.names = TRUE)
# Collate the aphia species code list, compiled from all available surveys
SP <- list()
for(i in 1:length(fil)) {       # loop through surveys
  x <-  read_rds(fil[i])$hl
  if(!is.null(x)) {
    SP[[i]] <-
      x %>%
      select(speccodetype = SpecCodeType,
             speccode = SpecCode,
             valid_aphia = Valid_Aphia) %>%
      distinct() %>%
      as_tibble()
  }
}

# Generate a vector of aphia code
APHIA <-
  bind_rows(SP) %>%
  distinct() %>%
  mutate(id = 1:n()) %>% 
  mutate(aphia = valid_aphia) %>%
  # if valid_aphia is NA then use speccode if speccodetype is W
  mutate(aphia = ifelse(is.na(aphia) & speccodetype == "W",
                        speccode,
                        aphia)) %>% 
  select(aphia) %>% 
  drop_na() %>% 
  # return a vector
  pull(aphia) %>% 
  unique() %>% 
  sort()
# Obtain a dataframe matching aphia with Latin name
from_worms <- 
  # You need an internet connection for this to work
  worrms::wm_id2name_(id = APHIA) %>% 
  bind_rows() %>%
  gather(aphia, latin, convert = TRUE)
from_worms %>% 
  write_rds("data/datras_worms.rds")
from_worms %>% 
  write_csv("/net/ftp/export/home/ftp/pub/reiknid/einar/datras_worms.csv")
system("chmod a+rx /net/ftp/export/home/ftp/pub/reiknid/einar/datras_worms.csv")

What is the meaning of each specval code?

specval: validity codes
- 0 Invalid information
- 1 Valid information for use in DATRAS data products
- 2 Partly valid information
- 3 Length composition incomplete
- 4 No length measurements only total number
- 5 Observed only not measured not counted but only presence/absence is registered
- 6 No length measurements only category catch weight
- 7 No length measurements only total number and category catch weight
- 8 Only volume (litre) registered
- 9 Valid information available but not recorded in the file
- 10 No category catch weight only total numbers and length composition

Just a placholder for fixing bad geoms

library(mapdata)
library(sf)
library(lwgeom)
mapBase <- map("worldHires", fill = T, plot = F)
# now we need to coerce it to an "sf" object, and fix any
mapBase <- st_as_sf(mapBase)
# now let's try cropping it to a region of Europe
cropMap <- st_crop(mapBase, xmin = -15, xmax = 30, ymin = 30, ymax = 60)
# note we get an error message about Self-intersection...
# we can fix this using the lwgeom library...
mapBase <- st_make_valid(mapBase)
cropMap <- st_crop(mapBase, xmin = -15, xmax = 30, ymin = 30, ymax = 60)

ggplot(cropMap) + geom_sf()