library(tidyverse) library(gisland)
One is often interested in establishing some georeference of some coordinate locations, such as whether a tow falls inside or outside a particular EEZ or which ICES area tows "belongs" to. Geospatial data such as EEZ or FAO areas are normally stored in a specific spatial format and hence ordinary dplyr-functions are not ...
Lets first import e.g. the FAO areas (conveniently located at ftp://ftp.hafro.is/pub/reiknid/einar/shapes):
fao <- read_sf_ftp("fao-areas_nocoastline") glimpse(fao)
The object here is of a special format (an sf-dataframe) with two variables, where name is the name of FAO area and the geometry is as special spatial format that contains the geometric polygon of each area.
The objects contains the complete global FAO areas as can bee seen when we plot the object:
plot(fao)
Lets filter the object such that it only contains the ICES areas (FAO area 27):
fao <- fao %>% filter(str_sub(name, 1, 2) == 27)
Lets generate some 500 random point coordinates:
d <- data_frame(lon = runif(500, -44, 68.5), lat = runif(500, 36, 89.9))
In order to establish the ICES area each of the 500 observations "belong to" we use a convenient function geo_inside
that resides in the gisland-package:
d <- d %>% mutate(area = geo_inside(lon, lat, fao %>% as("Spatial"), "name")) d %>% glimpse()
Here we have added a variable (area) that specifies name of the FAO area that each coordinate "belongs to". Lets plot it to get a better picture of what we have done:
d %>% ggplot() + geom_sf(data = fao) + geom_point(aes(lon, lat, colour = area))
Lets take a simple case of haul data and length frequency measurements in different hauls. We limit ourselves to:
So effectively we have 1 haul (here haul with the id = 2) where no fish was recorded
hh <- data_frame(id = 1:3) hl <- data_frame(id = c(1, 1, 1, 3, 3, 3, 3), length = runif(7, 20, 50) %>% round(), n = runif(7, 1, 4) %>% round()) hl
Now lets say we are only interested in calculating the abundance at each station. We would start by some code like the following:
d <- hl %>% group_by(id) %>% summarise(n = sum(n)) d
Now if wanted to obtain the mean abundance per one can not do:
d %>% summarise(m = mean(n))
because we would not be taking into account of the zero fish caught in haul with id = 2.
Hence we have to merge the counts from the length dataframe with the station dataframe. Something like this:
hh %>% left_join(d, by = "id")
Here haul 2 has the abundance variable assigned as NA. To get the mean abundance could try:
hh %>% left_join(d, by = "id") %>% summarise(m = mean(n))
This however generates an NA. One could try:
hh %>% left_join(d, by = "id") %>% summarise(m = mean(n, na.rm = TRUE))
But this would give us the mean of the positive stations (as we already obtained above). So effectively we have to assign a value of zero to the abundance when it is NA. The full pipeline would hence be:
hh %>% left_join(hl, by = "id") %>% mutate(n = replace_na(n, 0)) %>% group_by(id) %>% summarise(n = sum(n)) %>% summarise(m = mean(n))
Another approach to obtain the mean could be to sum the counts in the length data and divide it by the number of hauls obtained from the haul data:
hl %>% summarise(n = sum(n)) / nrow(hh)
This would work for the mean. However if one were interested in other type of statistics, like e.g. the variance, this approach would create us some headache to code.
TODO: Write a better intro
Lets take a little detour and look at a simplified example on why the above will not work. Say we have the following:
hh2 <- tibble(id = c(1, 2, 3)) hl2 <- tibble( id = c(1:2, 1), species = c("cod", "plaice", "plaice"), length = c(20, 25, 30), n = c(3, 4, 5) ) hh2 hl2
So we have 3 hauls, with one haul were both species are recorded, one haul were only one of the species is recorded and then one haul with no catch of neither species.
by.haul.positive2 <- hl2 %>% group_by(id, species) %>% summarise(n = sum(n)) by.haul.positive2
The simplest way to include implicitly missing species is to first generate a dataframe that contains all the hauls and species combination before one does the left join with the count data. Here we use the function crossing (similar to the base function expand.grid):
all <- hh2 %>% crossing(species = by.haul.positive2$species) all
Here we have a combination of all hauls and all species (3 x 3). So when we do a left-join with the count data we will join the tables by both the id and the Latin name:
all %>% left_join(by.haul.positive2) %>% mutate(n = replace_na(n, 0))
read_sf_ftp("NS_IBTS_RF") %>% mutate(AreaName = as.integer(AreaName)) %>% ggplot() + theme_bw() + geom_sf(aes(fill = factor(AreaName))) + scale_fill_brewer(palette = "Set3") + labs(title = "North Sea roundfish area", subtitle = "Used for generation of age-length-keys", fill = "ALK areas")
In the DATRAS data the species are encoded. There are three columns associated with this identity:
It is not clear at the time of this writing what processes are when Latin names are assigned to the above species code sweeps of variables. The approach taken here is to use the World Registry of Marine Species often written as acronym WoRMS. Fortunately an R-library worrms allows us to obtain the Latin names programatically.
Here the variable valid_aphia is the primary source, but in cases that is not specified (is NA) the speccode is used if the speccodetype is "W". We refer to this variable as aphia. Each numerical aphia value should have a unique Latin name. For convenience a complete lookup-table is generated based on the all the DATRAS data (as available in July 2018). The following code was used:
# List the available raw survey data files fil <- dir("data-raw/datras", full.names = TRUE) # Collate the aphia species code list, compiled from all available surveys SP <- list() for(i in 1:length(fil)) { # loop through surveys x <- read_rds(fil[i])$hl if(!is.null(x)) { SP[[i]] <- x %>% select(speccodetype = SpecCodeType, speccode = SpecCode, valid_aphia = Valid_Aphia) %>% distinct() %>% as_tibble() } } # Generate a vector of aphia code APHIA <- bind_rows(SP) %>% distinct() %>% mutate(id = 1:n()) %>% mutate(aphia = valid_aphia) %>% # if valid_aphia is NA then use speccode if speccodetype is W mutate(aphia = ifelse(is.na(aphia) & speccodetype == "W", speccode, aphia)) %>% select(aphia) %>% drop_na() %>% # return a vector pull(aphia) %>% unique() %>% sort() # Obtain a dataframe matching aphia with Latin name from_worms <- # You need an internet connection for this to work worrms::wm_id2name_(id = APHIA) %>% bind_rows() %>% gather(aphia, latin, convert = TRUE) from_worms %>% write_rds("data/datras_worms.rds") from_worms %>% write_csv("/net/ftp/export/home/ftp/pub/reiknid/einar/datras_worms.csv") system("chmod a+rx /net/ftp/export/home/ftp/pub/reiknid/einar/datras_worms.csv")
library(mapdata) library(sf) library(lwgeom) mapBase <- map("worldHires", fill = T, plot = F) # now we need to coerce it to an "sf" object, and fix any mapBase <- st_as_sf(mapBase) # now let's try cropping it to a region of Europe cropMap <- st_crop(mapBase, xmin = -15, xmax = 30, ymin = 30, ymax = 60) # note we get an error message about Self-intersection... # we can fix this using the lwgeom library... mapBase <- st_make_valid(mapBase) cropMap <- st_crop(mapBase, xmin = -15, xmax = 30, ymin = 30, ymax = 60) ggplot(cropMap) + geom_sf()
Mastering Software Development in R
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.