knitr::opts_chunk$set(echo = TRUE) library(lubridate) library(dplyr) library(tangled) library(tidyr) library(purrr)
The purpose of this document is to provide an initial overview of the suggested changes for analyzing the survival output from the right whale model. Our goal is to summarize survival by injury type:
...and to do this for both males and females.
The algorithm for making these comparisons would look something like this.
Below the steps are written out in detail with R code
We start by assembling and viewing a data frame of all of the last entanglement event for each animal:
tangEvents <- tangleOut %>% group_by(EGNo) %>% filter(EventNo == max(EventNo)) %>% arrange(EGNo)
For the paper, we also excluded animals that were ship-struck, or had severe prop scars. Here I also remove the known dead animals. (Whether or not to include/exlcude these is a point of discussion.)
exclude <- c(1004, 1006, 1014, 1045, 1128, 1223, 1267, 1308, 1504, 1623, 1907, 1909, 2143, 2150, 2220, 2250, 2404, 2425, 2450, 2617, 2820, 3508, 3522, 3853) knownDead <- deadTable$SightingEGNo tangSub <- tangEvents %>% filter(!(EGNo %in% exclude)) %>% filter(!(EGNo %in% knownDead)) tangleOutSub <- tangleOut %>% filter(!(EGNo %in% exclude)) %>% filter(!(EGNo %in% knownDead))
Here's one animal's data
event1 <- tangSub[1, ]
event1
This particular animal had one event with a very long possible window. We shortened that to be approximately 4 months:
r event1$StartDateWindow
r event1$EndDateWindow
We need to find its survival probability over that window:
idx <- which(ID == event1$EGNo) svec <- 1 - cumsum(deathyr[idx, ] / ng) svec[1:firstSight[idx]] <- NA
Now we want to define the window and view the survival probability over that window.
w <- which(deathyr[idx, ] > 0) w1 <- min(w) w2 <- max(w) theta <- svec[(w1 - 1):w2] testVar <- theta * (1 - theta) plot(theta, type = 'l', ylab = 'Survival Probability', xlab = 'Months Following End of Entanglement', main = paste0("Sighting EGNo: ", event1$EGNo)) lines(theta + testVar, lty = 2) lines(theta - testVar, lty = 2)
Let's iterate over animal-specific events and compile a data frame with rows for each entangled individual. Since the time windows will change for each event, I'll keep the survival probabilities in a list column.
getSurv <- function(EGNo, tangSub, i){ idx <- which(ID == EGNo) svec <- 1 - cumsum(deathyr[idx, ] / ng) svec[1:firstSight[idx]] <- NA w1 <- which(myName == as.character(tangSub[i, 'ewindmonyr'])) if(w1 > lastSight[idx]){w1 <- lastSight[idx] - 1} w2 <- max(which(deathyr[idx, ] > 0)) theta <- svec[w1:w2] list(theta = theta, swind = w1, ewind = w2) } getSurvvar <- function(theta){ testVar <- theta * (1 - theta) testVar } tangSub$sidx <- NA tangSub$eidx <- NA tangSub$svals <- NA tangSub$svar <- NA for(i in 1:nrow(tangSub)){ tmp <- getSurv(tangSub$EGNo[i], tangSub, i) tangSub$sidx[i] <- tmp$swind tangSub$eidx[i] <- tmp$ewind tangSub$svals[i] <- list(tmp$theta) tangSub$svar[i] <- list(getSurvvar(unlist(tangSub$svals[i]))) }
So what does that yield? Four new columns: Two that have the indices that define the start and end of the entanglment/survival window; one that has the monthly survival probability following the end of entanglement, and one that has the variance around that estimate.
The idea here is to remake the the KM curves, but directly from the posterior.
tangSev <- subset(tangSub, Severity == 'severe') tangSev$gender <- gender[match(tangSev$EGNo, ID)] nmonths <- 36 survDat <- matrix(0, nrow = nrow(tangSev), ncol = nmonths) for(i in 1:nrow(tangSev)){ svec <- tangSev$svals[[i]] if(length(svec) > nmonths){svec <- svec[1:nmonths]} survDat[i, 1:length(svec)] <- svec } sevSurv <- colMeans(survDat)
And then we can plot it:
sevVar <- sevSurv * (1 - sevSurv) plot(sevSurv, type = 'l', ylab = 'Survival Probability', xlab = 'Months Following End of Entanglement', main = 'Severely Entangled Whales', ylim = c(0, 1)) lines(sevSurv + sevVar, lty = 2) lines(sevSurv - sevVar, lty = 2)
Now we want to sample the reference animals against which we'll compare survival. We define an animal as belonging to the 'reference' set if they are:
We'll then summarize the mean survival of all those animals for each of the months during the entanglement survival window. Finally to build up the distribution of differences $Z$ we'll sample from two binomials and take the differences. Mathematically,
$$ Y_1 \sim Bin(\theta_{ind})\ Y_2 \sim Bin(\theta_{ref})\ Z = ln\left( \frac{E[Y_1]}{E[Y_2]}\right) $$
For example, let's say we have a an 8-month entanglement window when the survival probability declines from 1 to 0. During each of these 8 months, we'll calculate the difference in survival between the entangled animal ($\theta_{ind}$) and the reference class ($\theta_{ref}$). Each of the entangled animals will have metadata that indicates injury and gender. By building up the distribution of differences we will be able to quantify the impact of entanglement on survival by category, e.g ($Z_{Female, Severe}$). Note that we will only compare females to females, and males to males; that is for an entangled female, the reference class will be comprised solely of females.
We define our reference class by first figuring out which animals were entangled during the same time frame as our entangled animal. We will exclude this animal as a possible reference. (Note that because each individual's entanglement window is different, the reference class is going to be unique to each animal.) To find the entire entanglement reference window for each animal we start by defining ranges during which the animals were entangled. Also note that we're using a conservative start date for the possible range. This is following discussion in Boston in February 2018 about when the window should start. Here we use the first possible date, as opposed to the windows we agreed upon for the paper.
The logic in the next block is that we define indices for the start and stop of the window. Then we loop over rows, and store the range between those start/stop times as a list column.
load("E:/rob/Documents/research/projects/rightWhaleEntanglement/src/data-raw/eg_203_ng_50000_BIG_25000_BIG_25000.rdata") # Define Ranges of Entanglement for Each Animal tangleOutSub$sidx <- match(tangleOutSub$smonyr, myName) tangleOutSub$eidx <- match(tangleOutSub$ewindmonyr, myName) tangleOutSub$range <- NA for(i in seq_along(tangleOutSub$EGNo)){ tangleOutSub$range[i] <- list(as.integer(tangleOutSub[i, 'sidx']):as.integer(tangleOutSub[i, 'eidx'])) }
With those established, then we need to remove any of the reference animals from the candidate list if there is an intersection between the entangled animal and the reference animals. To do this, we have a nested loop that compares each individual event to all possible, and only stores the IDs of the candidate animals that do not intersect in time. The outer loop (i) loops over each of the entangled animals' final entanglement; the inner (j, k) loops compare the range from the ith animal to all possible reference animals.
cand <- vector('list', max(seq_along(tangSub$EGNo))) possAn <- ID[!(ID %in% unique(tangleOutSub$EGNo))] possAn <- possAn[!(possAn %in% exclude)] possAn <- possAn[!(possAn %in% knownDead)] possAnF <- possAn[gender[match(possAn, ID)] == 'F'] possAnM <- possAn[gender[match(possAn, ID)] == 'M'] for(i in seq_along(tangSub$EGNo)) { # to find range of entangled animal's survival window: if (as.integer(tangSub[i, 'eidx']) > nt){ entRng <- as.integer(tangSub[i, 'sidx']):nt } else { entRng <- as.integer(tangSub[i, 'sidx']):as.integer(tangSub[i, 'eidx']) } candGen <- gender[which(ID == as.integer(tangSub[i, "EGNo"]))] candSub <- numeric(0) # So we can find possible reference animals that are not currently entangled for (j in seq_along(tangleOutSub)) { refGen <- gender[which(ID == as.integer(tangleOutSub[j, "EGNo"]))] if(candGen != refGen) next() # cuz I only want the same gender range1 <- unlist(tangleOutSub[j, 'range']) if (!any(range1 %in% entRng)) { candSub <- c(candSub, tangleOutSub$EGNo[j]) } } # So we can find possible reference animals # who have never been entangled & # alive during the relevant time psub <- numeric(0) # to make the matching smaller in the for loop, i.e. avoid matching # males if we only want females if (candGen == 'F') { pvec = possAnF } else { pvec = possAnM } for (k in seq_along(pvec)) { idx <- which(ID == pvec[k]) prange <- firstSight[idx]:lastSight[idx] if (any(prange %in% entRng)) { psub <- c(psub, pvec[k]) } } cand[[i]] <- unique(c(candSub, psub)) }
To recap, each element of the cand
list object is a set of possible candidate reference animals. The animals in each element are of the same gender and satisfy one of two conditions:
Now we have the data indices we need, and can proceed to building up $Z$,
For each animal's last entanglement, we can now build up a data structure containing the $Y$ distribution for the reference animals.
library(matrixStats) library(ggplot2) library(reshape2) tangSub <- data.frame(tangSub) yrefMat <- vector('list', length(cand)) for (i in seq_along(tangSub$EGNo)){ if (as.integer(tangSub[i, 'eidx']) > nt){ range1 <- tangSub[i, 'sidx']:nt } else { range1 <- tangSub[i, 'sidx']:tangSub[i, 'eidx'] } refAns <- cand[[i]] refparams <- matrix(NA, nrow = length(refAns), ncol = length(range1)) for (j in seq_along(refAns)){ idx <- which(ID == refAns[j]) svec <- 1 - cumsum(deathyr[idx, ] / ng) svec[1:firstSight[idx]] <- NA theta <- svec[range1] refparams[j, ] <- theta } if(i == 1) { plotdat <- refparams plotrefan <- refAns plotrange <- range1 plotyref <- colMeans(refparams, na.rm = TRUE) } dat <- colMeans(refparams, na.rm = TRUE) # check if this the right dimension # dat[is.na(dat)] <- mean(dat, na.rm = TRUE) yrefMat[[i]] <- dat }
Before we proceed with assembling the distribution(s) of differences, let's look at the survival one candidate animal.
pdat <- data.frame(t(plotdat)) colnames(pdat) <- plotrefan pdat$time <- seq_along(plotrange) meltdf <- melt(pdat, id = "time") ggplot(meltdf, aes(x = time, y = value)) + geom_line()+ facet_wrap(~ variable)+ labs(x = 'Months Since End of Last Entanglement', y = 'Probability of Survival')+ ggtitle('Survival of Reference Animals', subtitle = paste0('For Entangled Whale: ', tangSub$EGNo[i]))
And we can summarize the survival for this reference group as well:
plot(plotyref, type = 'l', ylim = c(0, 1), xlab = 'Months Since Entanglement', ylab = 'P(Surv) Among Reference Animals') theta <- plotyref * (1 - plotyref) lines(plotyref + theta, lty = 2) lines(plotyref - theta, lty = 2)
Now we've summarized and extracted individual survival probability for each last entanglement. We've also determined the reference animals and summarized their survival. With these data structures assembled, we can now determine $Z$.
We'll demonstrate the algorithm for one individual. Here we sample from $Y_{ind}$ and $Y_{ref}$ where the $ind$ and $ref$ subscripts refer to the entangled animals and the candidate, or reference, animals. In the model from Schick et al. 2013, survival was distributed as a Binomial. To account for the uncertainty in survival, we sample from the posterior using the probabilities for the entangled and the reference animal:
entprobs <- tangSub$svals[[1]] # survival probs of the individual refprobs <- yrefMat[[1]] # survival probs of the reference class survMat <- matrix(NA, nrow = 1000, ncol = length(entprobs)) cansurvMat <- matrix(NA, nrow = 1000, ncol = length(entprobs)) for(i in 1:nrow(survMat)) { survMat[i, ] <- rbinom(prob = entprobs, n = length(entprobs), size = 1) cansurvMat[i, ] <- rbinom(prob = refprobs, n = length(refprobs), size = 1) }
Now we examine the differences ($Z$) for this animal:
sdiff <- data.frame(refAn = tangSub$EGNo[1], meanDiff = rowMeans(survMat) - rowMeans(cansurvMat)) # is this dimensioning right hist(sdiff$meanDiff, breaks = 10, col = 'cornsilk', border = 'lightgrey', xlab = "Difference in Survival Probability", main = '')
That's for one animal, but now we need to summarize all of the events by gender and by entanglement severity.
diffList <- vector('list', nrow(tangSub)) for(i in seq_along(tangSub$EGNo)){ entprobs <- tangSub$svals[[i]] refprobs <- yrefMat[[i]] if(any(is.na(refprobs))) break() survMat <- matrix(NA, nrow = 1000, ncol = length(entprobs)) cansurvMat <- matrix(NA, nrow = 1000, ncol = length(refprobs)) for(j in 1:nrow(survMat)) { survMat[j, ] <- rbinom(prob = entprobs, n = length(entprobs), size = 1) cansurvMat[j, ] <- rbinom(prob = refprobs, n = length(refprobs), size = 1) } gidx <- gender[which(ID == tangSub$EGNo[i])] sdiff <- data.frame(refAn = tangSub$EGNo[i], meanDiff = rowMeans(survMat) - rowMeans(cansurvMat), gender = gidx, gearStatus = tangSub$gear[i], injuryClass = tangSub$gearInj[i], severity = tangSub$Severity[i]) diffList[[i]] <- sdiff } survDf <- do.call(rbind.data.frame, diffList)
Ok, let's take a look at it.
p <- ggplot(survDf, aes(meanDiff, color = factor(refAn)))+ geom_histogram()+ facet_grid(severity ~ gender, scales = 'free_y') p
Maybe it would be interesting to group by individual to see the means. I'm trying to get a handle on who the 0 values are.
library(dplyr) survDf %>% group_by(severity, gender) %>% summarise(mean = mean(meanDiff, na.rm = TRUE), var = var(meanDiff, na.rm = TRUE))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.