In robschick/tangled: Creates data objects and plots for Knowlton, Schick et al. manuscript on entanglement

The goal of this vignette is to describe how we calculated and plotted survivorship for whales that had experienced entanglement events. I'll first describe a bit of how the baseline data are organised and used to create the actual survivorship. Then I'll detail the two plots that we've made - one of which is likely to be a plot for the manuscript.

Background Data

Entanglement Event Data

There are a few different data sources that we need to bring together to conduct the analysis, which in this case is to create and plot survivorship using Kaplan-Meier survival curves. In the analysis, we'll document survival for entangled whales, following their last entanglement, and then we'll break that down further to show what survivorship looks like for both sexes, and for each of three entanglement injury classes:

Minor
Moderate
Severe

Definitions of these three are given in Knowlton et al. (XXXX).

We need the entanglement data themselves, which are stored in the tangled package. This is a data frame containing the pertinent information on individual entanglement events, e.g. the individual identifier, the temporal extent, the severity, etc. Here's a data snippet for one animal who's experienced multiple entanglements. (Note that for the purposes of this analysis, we'll mostly be focusing on the last entanglement.)

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

library(tangled)
subset(tangleAll, EGNo == 1032)

To get and store just that last event, we call a function called makeEvents(), which will prepare a data frame of the last entanglement event of each animal and add some identifying information about its death status (see next section for details).

library(magrittr)
library(dplyr)
events <- makeEvents()
events <- events[events$EGNo != 1045,]
dplyr::select(events, dtime:presA)

So that shows a snippet of the data, including the 4 main columns that we calculate inside the makeEvents() function. For example, the first line shows that for EGNo 1004, it was a known dead animal, and the month it died was r myName[410]. This can be checked against the input data about known deaths in deadTable:

subset(deadTable, SightingEGNo == 1004)

That gives us the information we need to start examining the death data in order to estimate survivorship.

First we want to be able to label the survival curves to get a sense of the size of each risk group.

events$gender <- gender[match(events$EGNo, ID)]
labs <- events %>% group_by(Severity, gender) %>% summarise(total = n())
labs
legendLabs <- c(paste('Minor (F = ', labs[1, 3], ', M = ', labs[2, 3], ')', sep = ''),
                paste('Moderate (F = ', labs[3, 3], ', M = ', labs[4, 3], ')', sep = ''),
                paste('Severe (F = ', labs[5, 3], ', M = ', labs[6, 3], ')', sep = ''))

Death Data

To estimate survivorship, we need information about the timing of deaths for each individual whale. As seen in the events data frame, there are three categories for death status:

Known dead, i.e. the carcass of the animal was seen and identified
Presumed dead, i.e. model estimates indicate the animal is likely to have died
Presumed alive, i.e. model estimates indicate the animal is likely still alive

For categories 2 and 3, we use a date threshold to signify the cutoff date. As of this writing (April, 2016), we are using December, 2013 as the cutoff in the modelling. While the temporal domain of the model extends beyond 12/2013, this is the date when the sightings data from the photo-Identification efforts are assumed to be complete. We presume an animal is dead if their death that is estimated prior to 12/2013. Similarly, we presume an animal ss alive if the model estimates they are alive after 12/2013.

All of these data come from the deathyr matrix, which is an n by nt matrix with estimates of when the animal died. (For a known dead animal, the death is fixed at the date in deadTable.) For a presumed dead animal, there is typically a range of possible death months. We can use this uncertainty to calculate survivorship.

For reference, here's what the data in deathyr look like for one animal - EGNo 1001, which I've paired with some date information:

data.frame(deathyr[1, 270:290], myName[270:290])

Note that I've snipped out most of the entries because they are all 0. What this data chunk says is that it's probable that the animal died sometime between September 1992 and December 1993, and that the most probable month of death is December, 1992.

There are some other helper data that we need, but it's not worth specifying them at present. Armed with these two bits of information we can start calculating survivorship.

Censoring and Death Data - All Entangled Animals

Before we calculate survivorship, we need to determine when the animals died and when there were censored. We start by using the data from events and feeding it to calckdpaSurvdat(). The output from this will be a data frame containing information for animals that are in the known dead or presumed alive category. These categories are calculated first, because there is no uncertainty around their deaths, i.e. we either know when they died from deadTable or they are alive. We'll come on to the presumed dead animals shortly.

kdpasurvldf <- calckdpaSurvdat(events)
head(kdpasurvldf)

These data are what we'll need to calculate survivorship. The help files for the calckdpaSurvdat() function describe the individual variables, but essentially what the function does is to take the end of the last entanglement event, and determine the length of time between that event and either 1) its death date if known dead, or 2) its censoring date if presumed alive.

We then have to do the same thing for the presumed dead animals. The code is very similar with the only major difference that we pass it the output from the previous function, and append it to the presumed dead animals. The output from that is one big data frame that has the data for all three classes of death types.

survdf <- calcpresdSurvdat(events, kdpasurvldf)

Survivorship - All Entangled Animals

With the data assembled we can calculate survivorship. This is a simple call to calcKMCurves()

tmp <- calcKMCurves(survdf, kdpasurvldf, nboot = 1, dcut, increment = 12, medProb = TRUE)

This function call bears some explaining. Why would we pass both kdpasurvldf and survdf given that calcpresdSurvdat() assembles both? The reason is that we want to allow for flexibility in plotting should we want to include uncertainty around the survivorship. Because the possible deaths of the presumed dead animals have a range, we can sample from that posterior (deathyr) to include that uncertainty.

To do that, then, we specify a boot strap, and sample from deathyr to get different estimates of death month from the posterior. In that case, we want to update the above data within calcKMCurves() within a bootstrap loop. By passing the kdpasurvldf we can store the values from the known dead and presumed alive animals, which will not change within a bootstrap.

The outputs from the calcKMCurves() function are two fold, and will be used to plot the curve. In the tmp list, we have the KM curve and the ticks when animals get censored. These can then be passed to the plotting code:

library(ggplot2)
plotSurv(tmp$kmlines, tmp$censTicks, 7, increment = 12)

If we wanted to see that with uncertainty, we'd have to recalculate some of the data:

incVal <- 12
tmp <- calcKMCurves(survdf, kdpasurvldf, nboot = 10, dcut, increment = incVal, medProb = FALSE)
p <- plotSurv(tmp$kmlines, tmp$censTicks, 7, increment = incVal)

Doesn't look that different! If we zoom in a bit, it's easier to see

p + lims(x = c(2, 4), y = c(0.75, 1))

The takeaway, though, is that at the yearly scale, there's relatively little uncertainty. We can try again calculating survivorship at the monthly scale:

incVal <- 1
tmp <- calcKMCurves(survdf, kdpasurvldf, nboot = 10, dcut, increment = incVal, medProb = FALSE)
pmon <- plotSurv(tmp$kmlines, tmp$censTicks, 7, increment = incVal)
pmon + lims(x = c(1, 36), y = c(0.75, 1))

I prefer at the yearly level for now and without uncertainty, so will stick with that.

incVal <- 12
yearsOut <- 6
tmp <- calcKMCurves(survdf, kdpasurvldf, nboot = 1, dcut, increment = incVal, medProb = TRUE)
p <- plotSurv(tmp$kmlines, tmp$censTicks, yearEnd = yearsOut, increment = incVal)
ggsave(plot = p, filename = 'survivalAll.pdf', path = '~/Dropbox/Papers/KnowltonEtAl_Entanglement/images', device = 'pdf')
ggsave(plot = p, filename = 'survivalAll.png', path = '~/Dropbox/Papers/KnowltonEtAl_Entanglement/images', device = 'png', dpi = 300, width = 9, height = 6, units = 'in', scale = 0.8)

Survival of Entangled Whales be Sex and Severity

Ok, the processing logic is identical to that above, but the functions differ in the sense that they split out the data by gender and severity. This will likely be the figure we use for the manuscript. I'm going to plot it without uncertainty here, and I'll stop the plotting at 6 years following entanglement.

incVal <- 12
yearsOut <- 6
tmp <- calcKMCurvesSevGen(survdf, kdpasurvldf, nboot = 10, dcut, increment = incVal, medProb = TRUE)
p <- plotSurvGenderSeverity(tmp$kmlines, tmp$censTicks, yearEnd = yearsOut, increment = incVal, legendLabs)
p
ggsave(plot = p, filename = 'survivalGenderSeverity.pdf', 
       path = '~/Dropbox/Papers/KnowltonEtAl_Entanglement/images', device = 'pdf', width = 9, height = 6, units = 'in')
ggsave(plot = p, filename = 'survivalGenderSeverity.png', 
       path = '~/Dropbox/Papers/KnowltonEtAl_Entanglement/images', device = 'png', dpi = 300, width = 9, height = 6, units = 'in', scale = 0.8)

Statistical Tests

Ok, with the data now assembled, we need to look at performing some tests. We'll turn to the survival package, in which we can "implement the G-rho family" of tests. To conduct the analyses, we need the data in a particular format. I wrote a function to do this (returnSurvdf()).

In the dfSurv data frame that is returned from this function, the new data columns are futime and event, where futime is the time (in months) from the end of the last entanglement to the event - either the animal dies, or is censored. For the purposes of this analysis, we define censoring using the dcut variable, which is December of the year through which matching of the photo-ID data is done. As of now (April 2017) the date we are using is December 2013.

In terms of the event variable, it is comprised of 0s and 1s, where 0 = alive (but really it means right-censored), and 1 = dead. These are combined to make a Surv object, so that when an animal is censored it's that time plus because it was still alive. Note how they are represented differently before and after the object is created with Surv():

library(survival)
dfSurv <- returnSurvdf(survdf)
head(dfSurv)
head(Surv(dfSurv$futime, dfSurv$event))

Ok, the data are together, and we have the accompanying information about gender, the three class injury category, and the type of injury on the ordinal 1-6 scale. As a reminder, that is:

Minor - no gear
Moderate - no gear
Minor - gear
Severe - no gear
Moderate - gear
Severe - gear

(n.b. this is stored in the package as entvec)

Ok, let's examine a few different formulations. First we'll test if the curves for males and females are significantly different. From initial examination of the curves above, it looks like male survival is higher, so we'll use males as the reference group. It's generally easier to interpret a Hazard Ratio that exceeds 1 (Kleinbaum & Klein, 2012). To do this, we'll just relevel this factor, and then calculate the test:

dfSurv$gender <- relevel(factor(dfSurv$gender), ref = 'M')
survdiff(Surv(futime, event) ~ gender, data = dfSurv)

And they are, with females having significantly worse survival. This is shown in the Observed column of the summary. What this says is that more females died than were expected.

We can next text for the difference between entanglement severity, i.e. the three category designation:

survdiff(Surv(futime, event) ~ Severity, data = dfSurv)

Again, these are significantly different, though, it appears from the table that this difference is driven by the severe category (note the low Chi^2 values for minor and moderate. We can test this by peeling off the severe category, and just testing for the differences between minor and moderate (note that we'll see these differences picked up in the proportional hazards test as well). Indeed, these are not significant:

survdiff(Surv(futime, event) ~ Severity, data = subset(dfSurv, Severity != 'severe'))

Ok, let's put the gender and severity together

survdiff(Surv(futime, event) ~ Severity + gender, data = dfSurv)

This shows that the curves are significantly different, and these differences, again, are driven by the severe entanglements. We see more deaths for the severely entangled whales than expected for both males and females. We also see that male survival for animals with minor entanglements is higher than expected.

Cox Proportional Hazards

We can also look at the proportional hazards of the different entanglement categories. As with the tests of difference, we'll start simple, and build in complexity.

cfit0 <- coxph(Surv(futime, event) ~ Severity, data = dfSurv)
summary(cfit0)

We can think of cfit0 as the base model that just tests for the differential effects of injury type on survival, whereas below, we might want to examine this in terms of the effect accounting for covariates of interest, e.g. sex of the animal. We take that up with the next model.

cfit1 <- coxph(Surv(futime, event) ~ Severity + gender, data = dfSurv)
summary(cfit1)

What the cfit1 model shows is that moderate entanglements are better for survival than the reference case (minor). However, these differences are not significant. In contrast, cfit1 does show that severe entanglements are much worse than minor with a hazard ration of 5.394, and this is highly significant. Finally, males are significantly better off in terms of survival than females.

We can investigate if an interaction exists between gender and severity with the following model:

cfit2 <- coxph(Surv(futime, event) ~ interaction(Severity, gender), data = dfSurv)
summary(cfit2)

cfit2 suggests this interaction term is significant, and therefore worth keeping in the model.

Let's look at the relative death rates for each gender/severity combination, i.e. the hazard ratios, in more detail. These are the second part of that summary object Recall that males with minor entanglements are the reference case, so their rate is 1.0.

cmean <- matrix(c(0, coef(cfit2)), nrow=2, byrow = T)
cmean <- rbind(cmean, cmean[2, ] - cmean[1,])
dimnames(cmean) <- list(c("M", "F", "F/M ratio"), c('Minor', 'Moderate', 'Severe'))
signif(exp(cmean),3)

This shows a few interesting findings:

Females with severe entanglements are 8 times as likely to die as males with a minor entanglement.
Females with severe entanglements are 4 times as likely to die as females with a minor entanglement.
For all categories, females are more likely to die, than males, though the difference is strongest in the Minor category.
Females and Males in the Moderate category have reduced rates as compared to the Minor category, though when factoring in the 95% CI, these differences are not significant, i.e. the estimates straddle 1.

Summary

What the plots of the survival data and proportional hazards analysis confirm, is that severe entanglements are very bad for right whales of both sexes. It also confirms that in general survival is lower for females than it is for males. The exceptions lie in the moderate category. While females with moderate entanglements do have a higher hazard than males, this result is not significant at p = 0.079.

Though it does not come as a surprise that severe entanglements are much worse, one surprising finding is that minor entanglements are significantly worse for females. Since we know from previous work that adult female survival is of paramount importance (Fujiwara and Caswell, 2001), this stresses that further work needs to be done to reduce the effects of entanglement on survival.

Further investigation into the effect of covariates on survival could include tests of age, though there are many assumptions/issues that need to be addressed herein. For example, is it age at last entanglement, versus time on study (i.e. time between entry and event [death or censoring]), population level health, and possibly environmental covariates like the NAO. These investigations are beyond the scope of this paper, but bear further inspection.

robschick/tangled documentation built on May 9, 2022, 4:07 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com