Let's first look at the structure of the age dataframe:
glimpse(ca)
Lets generate a code for an age-length-key for a particular species and survey from the age data, ignoring sex and maturity:
Year <- 2014 Quarter <- 3 Latin <- "Gadus morhua" hh2 <- hh %>% filter(year == Year, quarter == Quarter) ca2 <- hh2 %>% select(id) %>% left_join(ca, by = "id") %>% filter(latin == Latin) %>% mutate(length = floor(length), age = ifelse(age > 9, 9, age)) %>% group_by(age, length) %>% summarise(n = sum(n)) %>% ungroup() %>% drop_na() alk.empirical <- ca2 %>% group_by(length) %>% mutate(p = n / sum(n, na.rm = TRUE)) %>% select(-n) %>% drop_na() alk.empirical %>% ggplot(aes(length, p, colour = factor(age))) + geom_point() + geom_line(lwd = 0.3, linetype = 5) + scale_color_brewer(palette = "Set3") + labs(x = "Length", y = "Probability", colour = "Age", title = "Probability of age at length")
So for each length class we have observations (points) we have calculated the probability that a fish belongs to a particular age class. The sum of the probabilities within a length group is by definition always one. Lets take the 70 cm length class as an example:
alk.empirical %>% filter(length == 70) %>% knitr::kable()
Here we have 5 age-classes within the length class, the most numerous ones being age 3 (50%), then 4 (22%) and 5 (17%), while around 6% of the fish are of age 2 and 6.
In the above empirical approach there is a bit of a problem - missingness :-) The reader may have noticed the warning message given when plotting the "Length frequency by age":
## Warning: Removed 1 rows containing missing values (position_stack).
This message was given because varible n.age has one record with NA. It so happens that it is the first record so we can see it by:
glimpse(d)
This means that length class 5 cm was recorded in the length measurements (hl), but no fish of that length class was aged (ca) in the year (2014) and quarter (3) for the species (Gadus morhua) in question. Since in this particular case it was the smallest length class one could obviously assign it "manually" to the youngest age group (age 0) in the dataset. But "missingness" of age samples in other length classes are bound to occur, particularly if we start to use some kind of spatial stratification of the age-length-key. So we need some generic approach. In this section we will take take some statistical approach.
....
Reference to @henningsen2011maxlik
Here, at least for the time being, we hide things in a little function:
alk.mlogit <- ca2 %>% fit_alk(lengths = c(min(lfs$length):max(lfs$length)), model = "mlogit")
In the above step we start with the same dataframe (ca2) as used when generating the alk.empirical above and we end with an alk-dataframe that has the same structure as the alk.empirical generated above. The thing in the middle is a bit of a black box. Lets take a peek:
alk.mlogit %>% filter(length == min(length)) %>% glimpse()
So even in the smallest length class (5 cm) we have some small probabilities that the fish will belong to some older age classes. Although biologically not really possible, that is statistics :-) The beauty is though that know we have solved the problem of "missingness" in our data - all length classes will be assigned to having some probabilities of belonging to all of the age classes in the data. Before we proceed, lets take a visual peek of what we have just done:
ggplot() + geom_point(data = alk.empirical, aes(length, p, colour = factor(age))) + geom_line(data = alk.empirical, aes(length, p, colour = factor(age)), lwd = 0.3, linetype = 5) + geom_line(data = alk.mlogit, aes(length, p, colour = factor(age))) + scale_colour_brewer(palette = "Set3") + labs(x = "Length [cm]", y = "Probability", colour = "Age", title = "Probability of age by length", subtitle = "Empirical (points and dashed line) and mlogit fit (line)")
We can now use the modeled alk's to calculate the abundance in each haul by age:
by.haul.age <- by.haul %>% left_join(alk.mlogit, by = "length") %>% # NOTE: check why this is "needed" drop_na() %>% group_by(id, age) %>% summarise(n = sum(p * n)) %>% ungroup() %>% left_join(hh2 %>% select(id, shootlong, shootlat), by = "id")
But what if we wanted to generate an age-length-key for multiple years?
hh3 <- hh %>% filter(year %in% 2014:2017, quarter == 3) ca3 <- hh3 %>% select(id, year) %>% left_join(ca) %>% filter(latin == Latin) %>% mutate(length = floor(length), age = ifelse(age > 9, 9, age)) %>% group_by(year, age, length) %>% summarise(n = sum(n)) %>% ungroup() %>% drop_na() %>% group_by(year) %>% nest() %>% # Have no idea why is should use the ".$data" rather than just "data" mutate(alk = purrr::map(.$data, fit_alk)) ca3 # testing identical(ca3$alk[[1]], ca2 %>% fit_alk())
Or for that matter multiple years and multiple strata within a year??
hh4 <- hh %>% filter(year %in% 2014:2017, quarter == 3, nsarea %in% 1:10) %>% mutate(nsarea = ifelse(nsarea <= 5, 1, 2)) ca4 <- hh4 %>% select(id, year, nsarea) %>% left_join(ca) %>% filter(latin == Latin) %>% mutate(length = floor(length), age = ifelse(age > 9, 9, age)) %>% group_by(year, nsarea, age, length) %>% summarise(n = sum(n)) %>% ungroup() %>% drop_na() %>% group_by(year, nsarea) %>% nest() ca4 %>% mutate(alk = purrr::map(.$data, fit_alk)) ca4 %>% filter(year %in% c(2015, 2017)) %>% mutate(alk = purrr::map(.$data, fit_alk))
... is the error related to the data & model specifications or to the nesting approach?? Check with:
ca4$data[[1]] %>% fit_alk()
i.e. it is within the data & model specifications.
... check @berg2012spatial
library(mgcv) hh <- read_rds("data/ns-ibts_hh.rds") hl <- read_rds("data/ns-ibts_hl.rds") ca <- read_rds("data/ns-ibts_ca.rds") Year <- 2014 Quarter <- 3 Latin <- "Gadus morhua" hh2 <- hh %>% filter(year == Year, quarter == Quarter) ca2 <- hh2 %>% select(id) %>% left_join(ca, by = "id") %>% filter(latin == Latin) %>% mutate(length = floor(length), age = ifelse(age > 9, 9, age)) %>% group_by(id, age, length) %>% summarise(n = sum(n)) %>% ungroup() %>% drop_na() alk <- fitALKXX(ca2, minAge = 1, maxAge = 4) %>% fit_alk10() glimpse(alk)
https://colinfay.me/tidyeval-1
...
NOTE: Need to check interpretation of the n-variable (original name CANoAtLngt) in ca-data. The DATRAS pages refer to this variable as meaning: "Amount of fish at the given category (per haul, species, length class, sex, maturity, age)."
NOTE: In some cases the haul numbers are "NA". Example:
ca %>% filter(id == "2004_1_THA2_GOV_NA") %>% glimpse() # but hh %>% filter(id == "2004_1_THA2_GOV_NA") %>% glimpse()
So we have some sort of an orphan in the ca data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.