In robschick/tangled: Creates data objects and plots for Knowlton, Schick et al. manuscript on entanglement

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)

Introduction

Here I examine the intervals between pregnancies to see if reduced health and/or increased entanglement burden lengthen the interval. I'm following Section 6.3 from Clark (2007) where he conducted an MLE analysis of the "time since fire" with an exponential and a Weibull, concluding there is support for a Weibull formulation. n.b. this is actually from Clark (1990).

Pregnancy Interval Data

We start with the pregnancy data, which have all the pregnancies by year for each individual.

pregnant <- read_csv(here::here('data-raw', '2021-02-25_years_since_pregnancy_1980-2013.csv'))%>% 
  dplyr::select(!starts_with('X')) %>% 
  filter(Year >=1980 & Year <= 2013) %>% 
  drop_na(Pregnant) %>% 
  drop_na(elapsed) %>%
  filter(elapsed >= 1) %>%
  filter(Pregnant > 0)

pregnant$Pregnant[pregnant$Pregnant == 'F'] <- '1'
pregnant$Pregnant <- as.numeric(pregnant$Pregnant)
head(pregnant)

We then summarize them along the lines of how Clark does in the book (Table 6.2), which shows a peak interval of 3 years along with a long tail:

pregdata <- pregnant %>% 
  group_by(elapsed) %>% 
  tally()
colnames(pregdata) <- c('years', 'npreg')
allyears <- seq(0, max(pregdata$years), by = 1)
npregyr <- allyears * 0
for(i in 1:length(pregdata$years)){
  npregyr[allyears[pregdata$years[i] == allyears]] <- pregdata$npreg[i]
}
plot(allyears, npregyr, type = 's', xlab = 'Pregnancy Interval (Years)', ylab = 'Number Observed')

Next we write down a likelihood function for the Weibull, which is parameterized as follows:

[ f(a) = c\lambda a^{c-1} exp[-\lambda a^c],]

where $c$ is the shape parameter, and $\lambda$ is the scale parameter.

wlik <- function(param){
  lp <- param[1]
  cp <- param[2]
  lik <- log(cp * lp^cp * pregnant$elapsed^(cp - 1)) - (lp * pregnant$elapsed)^cp
  return(-sum(lik))
}

Then we call optim():

param0 <- c(0.1, 1)
out <- optim(param0, wlik, lower = c(0.01, 0.01), upper = c(10, 10), method = "L-BFGS-B", hessian = TRUE)
sigma <- solve(out$hessian)
se <- sqrt(diag(sigma))

b1_up <- out$par[1] + 1.96*se[1]
b1_low <- out$par[1] - 1.96*se[1]

b2_up <- out$par[2] + 1.96*se[2]
b2_low <- out$par[2] - 1.96*se[2]

plot the results:

lw <- out$par[1]
cw <- out$par[2]
time <- seq(0, 40, length = 100)
hazard <- cw*lw^cw*time^(cw-1)
fweib <- hazard*exp(-(lw*time)^cw)
plot(allyears, npregyr / sum(npregyr), type = 's')
lines(time, fweib)
lines(time, hazard, lty = 2)

...and summarize and evaluate them:

lmle <- 1 / mean(pregnant$elapsed)
like <- -sum(log(lmle) - lmle*pregnant$elapsed)

likw <- out$value
dev <- 2 * (like - likw)
pr <- 1 - pchisq(dev, 1)

This yields a deviance of: r dev, which indicates strong support for the Weibull over the exponential.

Covariates

Fair enough, but how do we incorporate covariates? We add a link function to the likelihood function over which we optimize. So covariates are entangled (no/yes), health (scaled months below the fecundity threshold from Rolland et al. 2016) and they are acting through the shape parameter $c$:

wlik <- function(param){
  lp <- param[1]
  b0 <- param[2]
  b1 <- param[3]
  b2 <- param[4]
  cp <- exp(b0 + b1 *  xdat$sev_fac + b2 * xdat$health_scl) 
  lik <- log(cp * lp^cp * xdat$elapsed^(cp - 1)) - (lp * xdat$elapsed)^cp

  return(-sum(lik))
}

Now that we have the formulation correct, let's give it a whirl. We do some data prep:

pregnant <- readRDS("/Users/rob/Documents/research/projects/right-whales_PCOD/rightWhaleEntanglement/data/2021-06-15_pregnancy-with-covariates.rds")
xdat <- pregnant[, c("Pregnant", "health_scl", "health_min", "mon_below_scl", "elapsed", 
                     "severity", "num_events", "sev_num", "std_year", "decade")]
xdat <- xdat[xdat$Pregnant > 0, ]
xdat <- xdat[xdat$elapsed > 0, ]
xdat$sev_fac <- xdat$sev_num
xdat$sev_fac[xdat$sev_num > 0] <- 1
xdat$decade_scl <- scale(xdat$decade)
xdat$health_min_scl <- scale(xdat$health_min)

And then call the function. Note that our starting values for the covariates are negative, but are not constrained in the optimizer. The assumption then is that as either a) if the whale becomes entanglement (i.e. the covariate increases), or if the animal spends more months below a threshold (i.e. the covariate increases), then we expect the shape parameter ($c$) to decrease. A decrease in this parameter results in a longer tail for the distribution as well as a higher mean.

param0 <- c(0.1, 1, -1, -1) 
out <- optim(param0, wlik, 
             method = "L-BFGS-B", hessian = TRUE)
out$par

Store the values along with the confidence intervals for the parameters:

sigma <- solve(out$hessian)
se <- sqrt(diag(sigma))
b1_up <- out$par[3] + 1.96*se[3]
b1_low <- out$par[3] - 1.96*se[3]

b2_up <- out$par[4] + 1.96*se[4]
b2_low <- out$par[4] - 1.96*se[4]

my_pars <- data.frame(cov = c('entanglement', 'health'), mle = c(out$par[3], out$par[4]),
                      low = c(b1_low, b2_low),
                      high = c(b1_up, b2_up))
knitr::kable(my_pars)

Interpretation

We see a significant relationship between both entanglement (negative) and good health (negative). How do we interpret these - especially health?

Let's do a quick EDA to see the patterns:

library(ggplot2)
ggplot(data = xdat, aes(elapsed, group = factor(sev_fac)))+
  geom_boxplot()

While significant, the effect is not huge:

xdat %>% group_by(sev_fac) %>% summarize(mean(elapsed))

Now about the health:

ggplot(data = xdat, aes(health_scl, elapsed ))+
  geom_point()+
  geom_smooth()

This was not what I'd expected to see, and I'm wondering about the animals with 9 year + gaps. Like is there something fundamentally different about them, and/or if there's a time factor in here too. I'm just going to pull those out to investigate:

ggplot(data = subset(xdat, elapsed < 9), aes(health_scl, elapsed ))+
  geom_point()+
  geom_smooth()

Let's facet by decade just for exploration

ggplot(data = xdat, aes(health_scl, elapsed ))+
  geom_point()+
  geom_smooth()+
  facet_grid('decade')

So we have a number of whales in poor health with longer intervals in the 1990's, but also a lot of whales in better health still with long intervals in the 2000's. Though there's also a mean shift by decade as well:

xdat %>% 
  group_by(decade) %>% 
  summarize(median(health_scl))

Updated Formulation

From above, I've talked myself into including decade as a covariate. Let's look at that again:

# xdat <- xdat %>% 
#   dplyr::filter(elapsed < 9)

# new likelihood function
wlik <- function(param){
  lp <- param[1]
  b0 <- param[2]
  b1 <- param[3]
  b2 <- param[4]
  b3 <- param[5]
  cp <- exp(b0 + b1 *  xdat$sev_fac + b2 * xdat$health_min_scl + b3 * xdat$decade_scl)
  # cp <- exp(b0 + b1 *  xdat$sev_fac + b2 * xdat$health_min_scl) 
  lik <- log(cp * lp^cp * xdat$elapsed^(cp - 1)) - (lp * xdat$elapsed)^cp

  return(-sum(lik))
}

# call optim
param0 <- c(0.1, 1, -1, -1, -1) 
out <- optim(param0, wlik, 
             method = "L-BFGS-B", hessian = TRUE)

# recover params
sigma <- solve(out$hessian)
se <- sqrt(diag(sigma))
b1_up <- out$par[3] + 1.96*se[3]
b1_low <- out$par[3] - 1.96*se[3]

b2_up <- out$par[4] + 1.96*se[4]
b2_low <- out$par[4] - 1.96*se[4]

b3_up <- out$par[5] + 1.96*se[5]
b3_low <- out$par[5] - 1.96*se[5]

my_pars <- data.frame(cov = c('entanglement', 'health', 'decade'), mle = c(out$par[3], out$par[4], out$par[5]),
                      low = c(b1_low, b2_low, b3_low),
                      high = c(b1_up, b2_up, b3_up))

# my_pars <- data.frame(cov = c('entanglement', 'health'), mle = c(out$par[3], out$par[4]),
#                       low = c(b1_low, b2_low),
#                       high = c(b1_up, b2_up))
knitr::kable(my_pars)

Update - 2021-06-18

After meeting with Josh, we decided to take health out for this, because he suggested that we have a data censoring issue here to an extent. By that we mean we're only looking at whales that have gone on to give birth after an entanglement; we're not looking at whales that did not give birth. In that sense we're sort of artificially censoring the data. We'll want to change that in the next iteration of PCOMS. For now, I think this is sufficient.

# only look at animals with a delayed interval
xdat <- xdat %>% 
  dplyr::filter(elapsed >= 5)
# new likelihood function
wlik <- function(param){
  lp <- param[1]
  b0 <- param[2]
  b1 <- param[3]
  b2 <- param[4]
  cp <- exp(b0 + b1 *  xdat$sev_num + b2 * xdat$health_min_scl) 
  lik <- log(cp * lp^cp * xdat$elapsed^(cp - 1)) - (lp * xdat$elapsed)^cp

  return(-sum(lik))
}

# call optim
param0 <- c(0.1, 1, -1, -1) 
out <- optim(param0, wlik, 
             method = "L-BFGS-B", hessian = TRUE)

# recover params
sigma <- solve(out$hessian)
se <- sqrt(diag(sigma))
b1_up <- out$par[3] + 1.96*se[3]
b1_low <- out$par[3] - 1.96*se[3]

b2_up <- out$par[4] + 1.96*se[4]
b2_low <- out$par[4] - 1.96*se[4]

my_pars <- data.frame(cov = c('entanglement', 'health_min_scl'), mle = c(out$par[3], out$par[4]),
                      low = c(b1_low, b2_low),
                      high = c(b1_up, b2_up))

likw_2param <- out$value

knitr::kable(my_pars)

Glimpse of the Data

Here's the relationship between entanglement and interval for those animals with a delayed interval:

ggplot(data = subset(xdat, elapsed >= 5), aes(elapsed, group = factor(sev_fac)))+
  geom_boxplot()

and for minimum health across an interval (for delayed animals) and the length of the interval again:

ggplot(data = subset(xdat, elapsed >= 5), aes(health_min_scl, elapsed ))+
  geom_point()+
  geom_smooth()+
  facet_grid('sev_fac')

xdat %>% 
  group_by(sev_num) %>% 
  summarize(median(health_scl))

By decade and entanglement status

ggplot(data = subset(xdat, elapsed >= 5), aes(health_min_scl, elapsed ))+
  geom_point()+
  geom_smooth()+
  facet_grid(decade ~ sev_fac)

My interpretation is that the value is negative for health, i

And let's run a model with entanglement alone:

# new likelihood function
wlik <- function(param){
  lp <- param[1]
  b0 <- param[2]
  b1 <- param[3]
  cp <- exp(b0 + b1 *  xdat$sev_fac) 
  lik <- log(cp * lp^cp * xdat$elapsed^(cp - 1)) - (lp * xdat$elapsed)^cp

  return(-sum(lik))
}

# call optim
param0 <- c(0.1, 1, -1) 
out <- optim(param0, wlik, 
             method = "L-BFGS-B", hessian = TRUE)

# recover params
sigma <- solve(out$hessian)
se <- sqrt(diag(sigma))
b1_up <- out$par[3] + 1.96*se[3]
b1_low <- out$par[3] - 1.96*se[3]

my_pars <- data.frame(cov = c('entanglement'), mle = c(out$par[3]),
                      low = c(b1_low),
                      high = c(b1_up))
likw_1param <- out$value
knitr::kable(my_pars)

And conduct a likelihood ratio test:

dev <- 2 * (likw_1param - likw_2param)
pr <- 1 - pchisq(dev, 1)

Final Summary

The LRT yields a p-value of r pr, which indicates support for the extra parameter. This means we have a model for the intervals between pregnancies for animals in relation to entanglement status during that interval and decade. Both these relationships are negative, though entanglement doesn't seem to be significant. There's still support to keep the parameter, so it has some explanatory power, but it's clearly weak. And as noted above, we're only examining whales that have given birth, not those that ceased to give birth. To do that, we'd need to do a different survival type analysis, where we have censoring when the "study" ends where the animals are still alive and we're tracking their health, but they have yet to give birth again.

Sort of leaning towards doing that in a separate analysis/manuscript.

References

Clark, J. S. 1990. Fire and climate change during the last 750 yr in northwestern Minnesota. Ecological Monographs 60:135–159.

Clark, J. S. 2007. Models for Ecological Data: An Introduction. Princeton University Press.

Rolland, R. M., R. S. Schick, H. M. Pettis, A. R. Knowlton, P. K. Hamilton, J. S. Clark, and S. D. Kraus. 2016. Health of North Atlantic right whales Eubalaena glacialis over three decades: from individual health to demographic and population health trends. Marine ecology progress series 542:265–282.

robschick/tangled documentation built on May 9, 2022, 4:07 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com