knitr::opts_chunk$set(echo = TRUE)
library(lubridate)
library(dplyr)
library(tangled)
library(tidyr)
library(purrr)

The purpose of this document is to provide an initial overview of the suggested changes for analyzing the survival output from the right whale model. Our goal is to summarize survival by injury type:

  1. Minor
  2. Moderate
  3. Severe

...and to do this for both males and females.

The algorithm for making these comparisons would look something like this.

  1. We start by extracting the last entanglement event from an individual animal
  2. Calculate the survival probability from the end of entanglement window until presumed death ($\theta_{ind}$)
  3. Using that same time window define a set of candidate reference animals
  4. Find mean survival probability for those animals ($\theta_{ref}$)
  5. Sample for individual survival of the entangled animal from $Y_{ind} \sim Binom(\theta_{ind})$; find the mean
  6. Repeat step 5 for the reference individuals, i.e. sample for survival of the reference animals from $Y_{ref} \sim Binom(\theta_{ref})$; find the mean
  7. Find $Z$ which is the distribution of differences ($E[Y_{ind}] - E[Y_{ref}]$)
  8. Plot $Z$ for each injury type and gender

Below the steps are written out in detail with R code

We start by assembling and viewing a data frame of all of the last entanglement event for each animal:

tangEvents <- tangleOut %>% 
  group_by(EGNo) %>% 
  filter(EventNo == max(EventNo)) %>% 
  arrange(EGNo)

For the paper, we also excluded animals that were ship-struck, or had severe prop scars. Here I also remove the known dead animals. (Whether or not to include/exlcude these is a point of discussion.)

exclude <- c(1004, 1006, 1014, 1045, 1128, 1223, 1267, 1308, 1504, 1623, 
             1907, 1909, 2143, 2150, 2220, 2250, 2404, 2425, 2450, 2617, 
             2820, 3508, 3522, 3853)
knownDead <- deadTable$SightingEGNo

tangSub <- tangEvents %>% 
  filter(!(EGNo %in% exclude)) %>% 
  filter(!(EGNo %in% knownDead))

tangleOutSub <- tangleOut %>% 
  filter(!(EGNo %in% exclude)) %>% 
  filter(!(EGNo %in% knownDead))

Entangled Animal

Here's one animal's data

event1 <- tangSub[1, ]
event1

This particular animal had one event with a very long possible window. We shortened that to be approximately 4 months:

We need to find its survival probability over that window:

idx <- which(ID == event1$EGNo)
svec <- 1 - cumsum(deathyr[idx, ] / ng) 
svec[1:firstSight[idx]] <- NA

Now we want to define the window and view the survival probability over that window.

w <- which(deathyr[idx, ] > 0)
w1 <- min(w)
w2 <- max(w)
theta <- svec[(w1 - 1):w2]
testVar <- theta * (1 - theta)
plot(theta, type = 'l', ylab = 'Survival Probability', 
     xlab = 'Months Following End of Entanglement',
     main = paste0("Sighting EGNo: ", event1$EGNo))
lines(theta + testVar, lty = 2)
lines(theta - testVar, lty = 2)

Survival Dataframe

Let's iterate over animal-specific events and compile a data frame with rows for each entangled individual. Since the time windows will change for each event, I'll keep the survival probabilities in a list column.

getSurv <- function(EGNo, tangSub, i){
  idx <- which(ID == EGNo)
  svec <- 1 - cumsum(deathyr[idx, ] / ng) 
  svec[1:firstSight[idx]] <- NA
  w1 <- which(myName == as.character(tangSub[i, 'ewindmonyr']))
  if(w1 > lastSight[idx]){w1 <- lastSight[idx] - 1}
  w2 <- max(which(deathyr[idx, ] > 0))
  theta <- svec[w1:w2]
  list(theta = theta, swind = w1, ewind = w2)
}

getSurvvar <- function(theta){
  testVar <- theta * (1 - theta)
  testVar
}

tangSub$sidx <- NA
tangSub$eidx <- NA
tangSub$svals <- NA
tangSub$svar <- NA

for(i in 1:nrow(tangSub)){
  tmp <- getSurv(tangSub$EGNo[i], tangSub, i)

  tangSub$sidx[i]  <- tmp$swind
  tangSub$eidx[i]  <- tmp$ewind
  tangSub$svals[i] <- list(tmp$theta)
  tangSub$svar[i]  <- list(getSurvvar(unlist(tangSub$svals[i])))
}

So what does that yield? Four new columns: Two that have the indices that define the start and end of the entanglment/survival window; one that has the monthly survival probability following the end of entanglement, and one that has the variance around that estimate.

Survival Plots Summarized by Entanglement Injury

The idea here is to remake the the KM curves, but directly from the posterior.

tangSev <- subset(tangSub, Severity == 'severe')
tangSev$gender <- gender[match(tangSev$EGNo, ID)]
nmonths <- 36
survDat <- matrix(0, nrow = nrow(tangSev), ncol = nmonths)
for(i in 1:nrow(tangSev)){
  svec <- tangSev$svals[[i]]
  if(length(svec) > nmonths){svec <- svec[1:nmonths]}
  survDat[i, 1:length(svec)] <- svec
}
sevSurv <- colMeans(survDat)

And then we can plot it:

sevVar <- sevSurv * (1 - sevSurv)
plot(sevSurv, type = 'l', ylab = 'Survival Probability', 
     xlab = 'Months Following End of Entanglement',
     main = 'Severely Entangled Whales', ylim = c(0, 1))
lines(sevSurv + sevVar, lty = 2)
lines(sevSurv - sevVar, lty = 2)

Reference Animal

Now we want to sample the reference animals against which we'll compare survival. We define an animal as belonging to the 'reference' set if they are:

  1. alive for the whole duration of the window we defined for each individual entanglement window and
  2. who are themselves not entangled during that same interval.

We'll then summarize the mean survival of all those animals for each of the months during the entanglement survival window. Finally to build up the distribution of differences $Z$ we'll sample from two binomials and take the differences. Mathematically,

$$ Y_1 \sim Bin(\theta_{ind})\ Y_2 \sim Bin(\theta_{ref})\ Z = ln\left( \frac{E[Y_1]}{E[Y_2]}\right) $$

For example, let's say we have a an 8-month entanglement window when the survival probability declines from 1 to 0. During each of these 8 months, we'll calculate the difference in survival between the entangled animal ($\theta_{ind}$) and the reference class ($\theta_{ref}$). Each of the entangled animals will have metadata that indicates injury and gender. By building up the distribution of differences we will be able to quantify the impact of entanglement on survival by category, e.g ($Z_{Female, Severe}$). Note that we will only compare females to females, and males to males; that is for an entangled female, the reference class will be comprised solely of females.

We define our reference class by first figuring out which animals were entangled during the same time frame as our entangled animal. We will exclude this animal as a possible reference. (Note that because each individual's entanglement window is different, the reference class is going to be unique to each animal.) To find the entire entanglement reference window for each animal we start by defining ranges during which the animals were entangled. Also note that we're using a conservative start date for the possible range. This is following discussion in Boston in February 2018 about when the window should start. Here we use the first possible date, as opposed to the windows we agreed upon for the paper.

The logic in the next block is that we define indices for the start and stop of the window. Then we loop over rows, and store the range between those start/stop times as a list column.

load("E:/rob/Documents/research/projects/rightWhaleEntanglement/src/data-raw/eg_203_ng_50000_BIG_25000_BIG_25000.rdata")
# Define Ranges of Entanglement for Each Animal
tangleOutSub$sidx <- match(tangleOutSub$smonyr, myName)
tangleOutSub$eidx <- match(tangleOutSub$ewindmonyr, myName) 
tangleOutSub$range <- NA
for(i in seq_along(tangleOutSub$EGNo)){
  tangleOutSub$range[i] <- list(as.integer(tangleOutSub[i, 'sidx']):as.integer(tangleOutSub[i, 'eidx']))
}

With those established, then we need to remove any of the reference animals from the candidate list if there is an intersection between the entangled animal and the reference animals. To do this, we have a nested loop that compares each individual event to all possible, and only stores the IDs of the candidate animals that do not intersect in time. The outer loop (i) loops over each of the entangled animals' final entanglement; the inner (j, k) loops compare the range from the ith animal to all possible reference animals.

cand <- vector('list', max(seq_along(tangSub$EGNo)))
possAn <- ID[!(ID %in% unique(tangleOutSub$EGNo))]
possAn <- possAn[!(possAn %in% exclude)]
possAn <- possAn[!(possAn %in% knownDead)]
possAnF <- possAn[gender[match(possAn, ID)] == 'F']
possAnM <- possAn[gender[match(possAn, ID)] == 'M']

for(i in seq_along(tangSub$EGNo)) {

  # to find range of entangled animal's survival window:
  if (as.integer(tangSub[i, 'eidx']) > nt){
    entRng <- as.integer(tangSub[i, 'sidx']):nt
  } else {
    entRng <- as.integer(tangSub[i, 'sidx']):as.integer(tangSub[i, 'eidx'])
  }

  candGen <- gender[which(ID == as.integer(tangSub[i, "EGNo"]))]
  candSub <- numeric(0)

  # So we can find possible reference animals that are not currently entangled
  for (j in seq_along(tangleOutSub)) {

    refGen <- gender[which(ID == as.integer(tangleOutSub[j, "EGNo"]))]
    if(candGen != refGen) next() # cuz I only want the same gender
    range1 <- unlist(tangleOutSub[j, 'range'])

    if (!any(range1 %in% entRng)) {
      candSub <- c(candSub, tangleOutSub$EGNo[j])
    }

  }

  # So we can find possible reference animals 
  # who have never been entangled &
  # alive during the relevant time
  psub <- numeric(0)

  # to make the matching smaller in the for loop, i.e. avoid matching 
  # males if we only want females
  if (candGen == 'F') {
    pvec = possAnF 
  } else {
    pvec = possAnM
  }

  for (k in seq_along(pvec)) {
    idx <- which(ID == pvec[k])
    prange <- firstSight[idx]:lastSight[idx]

    if (any(prange %in% entRng)) {
      psub <- c(psub, pvec[k])
    }
  }

  cand[[i]] <- unique(c(candSub, psub))

}

To recap, each element of the cand list object is a set of possible candidate reference animals. The animals in each element are of the same gender and satisfy one of two conditions:

  1. they have been entangled, but are not currently entangled (current refers to the reference animal)
  2. the have never been entangled, but are alive at the same time as the reference animal

Now we have the data indices we need, and can proceed to building up $Z$,

Summary of Survival

For each animal's last entanglement, we can now build up a data structure containing the $Y$ distribution for the reference animals.

library(matrixStats)
library(ggplot2)
library(reshape2)

tangSub <- data.frame(tangSub)
yrefMat <- vector('list', length(cand))
for (i in seq_along(tangSub$EGNo)){

  if (as.integer(tangSub[i, 'eidx']) > nt){
    range1 <- tangSub[i, 'sidx']:nt
  } else {
    range1 <- tangSub[i, 'sidx']:tangSub[i, 'eidx']
  }

  refAns <- cand[[i]]
  refparams <- matrix(NA, nrow = length(refAns), ncol = length(range1))

  for (j in seq_along(refAns)){

    idx <- which(ID == refAns[j])
    svec <- 1 - cumsum(deathyr[idx, ] / ng) 
    svec[1:firstSight[idx]] <- NA
    theta <- svec[range1]
    refparams[j, ] <- theta

  }

  if(i == 1) {
    plotdat <- refparams
    plotrefan <- refAns
    plotrange <- range1
    plotyref <- colMeans(refparams, na.rm = TRUE)
  }

  dat <- colMeans(refparams, na.rm = TRUE) # check if this the right dimension # 
  dat[is.na(dat)] <- mean(dat, na.rm = TRUE)
  yrefMat[[i]] <- dat

}

Before we proceed with assembling the distribution(s) of differences, let's look at the survival one candidate animal.

pdat <- data.frame(t(plotdat))
colnames(pdat) <- plotrefan
pdat$time <- seq_along(plotrange)


meltdf <- melt(pdat, id = "time")
ggplot(meltdf, aes(x = time, y = value)) +
    geom_line()+
    facet_wrap(~ variable)+
    labs(x = 'Months Since End of Last Entanglement', 
         y = 'Probability of Survival')+
    ggtitle('Survival of Reference Animals', 
            subtitle = paste0('For Entangled Whale: ', tangSub$EGNo[i]))

And we can summarize the survival for this reference group as well:

plot(plotyref, type = 'l', ylim = c(0, 1), 
     xlab = 'Months Since Entanglement', 
     ylab = 'P(Surv) Among Reference Animals')
theta <- plotyref * (1 - plotyref)
lines(plotyref + theta, lty = 2)
lines(plotyref - theta, lty = 2)

Distribution of Differences

Now we've summarized and extracted individual survival probability for each last entanglement. We've also determined the reference animals and summarized their survival. With these data structures assembled, we can now determine $Z$.

We'll demonstrate the algorithm for one individual. Here we sample from $Y_{ind}$ and $Y_{ref}$ where the $ind$ and $ref$ subscripts refer to the entangled animals and the candidate, or reference, animals. In the model from Schick et al. 2013, survival was distributed as a Binomial. To account for the uncertainty in survival, we sample from the posterior using the probabilities for the entangled and the reference animal:

entprobs <- tangSub$svals[[1]] # survival probs of the individual
refprobs <- yrefMat[[1]] # survival probs of the reference class

survMat <- matrix(NA, nrow = 1000, ncol = length(entprobs))
cansurvMat <- matrix(NA, nrow = 1000, ncol = length(entprobs))

for(i in 1:nrow(survMat)) {
  survMat[i, ]    <- rbinom(prob = entprobs, n = length(entprobs), size = 1)
  cansurvMat[i, ] <- rbinom(prob = refprobs, n = length(refprobs), size = 1)
}

Now we examine the differences ($Z$) for this animal:

sdiff <- data.frame(refAn = tangSub$EGNo[1], 
                    meanDiff = rowMeans(survMat) - rowMeans(cansurvMat)) # is this dimensioning right

hist(sdiff$meanDiff, breaks = 10, col = 'cornsilk', border = 'lightgrey', 
     xlab = "Difference in Survival Probability", main = '')

That's for one animal, but now we need to summarize all of the events by gender and by entanglement severity.

diffList <- vector('list', nrow(tangSub))

for(i in seq_along(tangSub$EGNo)){
  entprobs <- tangSub$svals[[i]]
  refprobs <- yrefMat[[i]]
  if(any(is.na(refprobs))) break()

  survMat <- matrix(NA, nrow = 1000, ncol = length(entprobs))
  cansurvMat <- matrix(NA, nrow = 1000, ncol = length(refprobs))

  for(j in 1:nrow(survMat)) {
    survMat[j, ]    <- rbinom(prob = entprobs, n = length(entprobs), size = 1)
    cansurvMat[j, ] <- rbinom(prob = refprobs, n = length(refprobs), size = 1)
  }

  gidx <- gender[which(ID == tangSub$EGNo[i])]
  sdiff <- data.frame(refAn = tangSub$EGNo[i], 
                      meanDiff = rowMeans(survMat) - rowMeans(cansurvMat), 
                      gender = gidx, gearStatus = tangSub$gear[i], 
                      injuryClass = tangSub$gearInj[i],
                      severity = tangSub$Severity[i])

  diffList[[i]] <- sdiff

}

survDf <- do.call(rbind.data.frame, diffList)

Ok, let's take a look at it.

p <- ggplot(survDf, aes(meanDiff, color = factor(refAn)))+
  geom_histogram()+
  facet_grid(severity ~ gender, scales = 'free_y')
p

Maybe it would be interesting to group by individual to see the means. I'm trying to get a handle on who the 0 values are.

library(dplyr)
survDf %>% group_by(severity, gender) %>% 
  summarise(mean = mean(meanDiff, na.rm = TRUE), var = var(meanDiff, na.rm = TRUE))


robschick/tangled documentation built on May 9, 2022, 4:07 p.m.