In JimCarretta/SeriousInjury: Classify Whale Injuries as Serious or Non-Serious with Classification Trees

Background

The R-Package SeriousInjury uses Random Forest classification trees to assess injury severity of large whale entanglements and vessel strikes. Models are built using the R-Package rfPermute, which employs the R-package randomForest. Installing SeriousInjury will also install the rfPermute and randomForest packages. Methods are based on the publication:

Carretta, J.V. and A. Henry 2022. Risk Assessment of Whale Entanglement and Vessel Strike Injuries from Case Narratives and Classification Trees. Frontiers In Marine Science, although SeriousInjury includes data for several more species not included in the publication.

To install the latest SeriousInjury version from GitHub:

# make sure you have devtools installed
if (!require('devtools')) install.packages('devtools')

# install from GitHub
devtools::install_github('JimCarretta/SeriousInjury')

# if this install fails, use this alternative method, using package 'pak':
pak::pkg_install("JimCarretta/SeriousInjury")

library(SeriousInjury)

# see SeriousInjuryTutorial() for a guide to the package

NOAA assesses anthropogenic injuries and deaths under the Marine Mammal Protection Act through a published Serious Injury Process. Injuries are defined as ‘Non-Serious’ or ‘Serious’, where the latter is defined as “any injury that is more likely than not to result in mortality, or any injury that presents a greater than 50 percent chance of death to a marine mammal”.

NOAA is reviewing the Serious Injury policy and procedures for large whales with new data and assessing a Random Forest (RF) method to estimate individual probabilities of a health decline, death, or recovery for entanglements and vessel strikes. Current serious injury procedures require biologists to manually assess a series of conditions such as: “Was entangling gear constricting or loose?” or “Was there a deep laceration?” Injury severity (non-serious vs serious) is determined based on threshold responses to such questions (a constricting entanglement is typically considered a serious injury, while a loose wrap is a non-serious injury). RF models automate that process. RF model covariates are derived from key words and phrases in injury narratives known to be good predictors of non-serious vs serious injuries. Methods, R functions, and application examples for large whale data are summarized in the R-Package SeriousInjury and in this tutorial.

Data

The SeriousInjury package includes five data frames: WhaleData, data.entangle, data.vessel, data.test.entangle, and data.test.vessel.

WhaleData is raw data for whale injury cases. Three column fields of type='character' are required to generate covariates and build injury assessment models.

‘Narrative’ contains injury descriptions from which model covariates are generated.

'Health.status' includes one of three assessed conditions: a known death or health decline ('DEAD.DECLINE'), a whale that survives and recovers from its injuries >= 1 year post-interaction ('RECOVERED'), and data-poor cases for which the final health status is undetermined ('UNKNOWN').

'CAUSE' attributes the main cause of injury to the following sources: entanglement = "EN", entrapment = "ET", and vessel strike = "VS"). Entanglements and entrapments (relatively rare) are both treated as forms of entanglements for purposes of injury modeling.

Injury-Specific data frames

data.entangle and data.vessel are known-outcome (“DEAD.DECLINE” or “RECOVERED”) entanglements and vessel strike cases used to build RF models. Model covariates are generated with the function InjuryCovariates(). These data exclude cases where human intervention to remove entanglements occurred.

data.test.entangle and data.test.vessel include cases where ‘Health.status’ = “UNKNOWN” and are used with the predict() function and the RF objects ‘ModelEntangle’ and ‘ModelVessel’ to assign cases to classes of “DEAD.DECLINE” or “RECOVERED”. The fraction of RF trees that assign a case to each class represent the respective probabilities of a death, health decline, or recovery.

Narrative

Example ‘Narrative’ from which model covariates are derived:

Entanglement injuries at fluke insertions and peduncle with associated cyamids at injured areas and on head. Grey skin and overall poor appearance.

Key words and phrases in the example narrative coded as presence / absence covariates include ‘cyamids’, ‘fluke’, ‘grey skin’, 'head', ‘peduncle’, ‘poor’

Injury covariates

Covariates defined below are derived from injury narratives with the function ‘InjuryCovariates()’. Code to define, maintain, and extract covariates from narratives is found in the R-script InjuryCovariates.R.

mono.hook.line - Indicates a monofilament hook and line fishery entanglement, excluding monofilament gillnets. Uses function ʻMonofilament_Hook_Lineʻ.

mobility.limited - evidence a whale was anchored or immobilized by entangling material or gear. Narrative mentions inability to dive or swim, may refer to a heavily-weighted whale with multiple pots/traps impeding normal movement or an injury that may impede feeding ability.

calf.juv - narrative includes reference to an injured calf or juvenile or that the injury involves the mother of a dependent calf.

constricting - evidence of a constricting entanglement, including line cutting into whale, wrapped tightly around body or flippers.

decline - narrative includes evidence of a health decline, such as the presence of cyamids, emaciation, lesions, discolored skin, deformities caused by a chronic entanglement or severe vessel strike incident.

extensive.severe - a severe injury that can include amputation or necrosis of body parts due to a chronic entanglement or acute vessel strike injury.

fluke.peduncle - includes reference of entanglement or vessel strike injury that involves the tail, flukes, or peduncle.

gear.free - evidence the whale freed itself from entangling material. Typically involves a whale resighted at a later date than the initial entanglement observation.

head - Narrative indicates that the head, mouth, or blowhole was involved in the entanglement or vessel strike injury.

healing - Narrative refers to healing or healed wounds.

laceration.deep - Narrative includes reference to deep laceration resulting from vessel strike or entanglement. May include reference to blubber or muscle layers.

laceration.shallow - Narrative includes reference to shallow or superficial lacerations.

pectoral - Narrative includes involvement of pectoral flipper or ‘fins’ in entanglement or vessel strike.

swim.dive - Evidence that the whale is swimming, feeding, or diving normally.

trailing - Was the whale trailing gear or other entangling material?

VessSpd - Vessel Speed, coded as a factor with 3 possible states: unknown = VSpdUnk, slow = VSpdSlow, fast = VSpdFast. Based on speed references or inferences from ‘Narrative’. Speeds <=10 kts are considered slow, >10 kts are fast.

VessSz - Vessel Size, coded as a factor with 3 possible states: unknown = VSzUnk, small = VSzSmall, large = VSzLarge. Based on size references or inferences from ‘Narrative’. Sizes <=65 ft are considered ‘small’, unless the vessel is much larger than whale. Sizes > 65 ft are considered ‘large’.

wraps.present - Narrative includes reference to whale with multiple wraps of line or gear around body or appendage.

wraps.absent - Narrative includes reference to a whale without any wraps of line or gear around body or appendage.

library(SeriousInjury)
# Generate covariates from case narratives. Append to WhaleData.
WhaleDataCovs <- InjuryCovariates(WhaleData)
# View a portion of WhaleDataCovs with appended covariates
partial.WhaleDataCovs <- cbind.data.frame(WhaleDataCovs$Narrative, WhaleDataCovs$mobility.limited, WhaleDataCovs$calf.juv, WhaleDataCovs$constricting, WhaleDataCovs$decline, WhaleDataCovs$VessSpd, WhaleDataCovs$VessSz)
head(partial.WhaleDataCovs)

# Examine the distribution of injury covariates by health status ('DEAD.DECLINE' and 'RECOVERED') for cases used to build injury models.
barplotCovariates(WhaleDataCovs)

Random Forest (RF) Models

Two models are used in the package SeriousInjury, an entanglement and a vessel strike model. Each is based on 1000 classification trees. Model concepts are shown in Figures 1 and 2.

Figure 1. Example tree used to classify whale injuries as serious (*Dead.Decline*) or non-serious (*Recovered*). Data are based on known-outcome entanglement and vessel strike cases, where a known-outcome is a documented death, health decline or recovery. Health declines are considered serious injuries and recoveries are considered non-serious.

$Figure 2. Models consist of multiple bootstrap trees (a random forest) used to classify ‘out-of-bag’ (OOB) or novel cases. Samples not used in individual tree construction are considered OOB and are used to assess model accuracy through cross-validation. Novel cases represent new data or cases not included in models, for which health status is unknown. The fraction of trees ‘voting’ for a particular class represents the probability of that case belonging to the class Dead.Decline or Recovered.$

The entanglement (ModelEntangle) and vessel strike (ModelVessel) models are objects of class rfPermute. They include known-outcome entanglement and vessel strike injury cases, included as the data frames data.entangle and data.vessel in the R-Package SeriousInjury.

Entanglement Model

# Create randomForest model (using R-Package rfPermute) using known-outcome entanglement data
# set.seed for reproducibility
set.seed(1234)

# how many RF trees to build
size.RF = 1000

# covariates included in ModelEntangle
entangle.covariates = which(names(data.entangle)%in%c("mono.hook.line", "mobility.limited", "calf.juv", "constricting", "decline", 
"extensive.severe", "fluke.peduncle", "gear.free", "head", "healing", "laceration.deep", "laceration.shallow", 
"pectoral", "swim.dive", "trailing", "wraps.present", "wraps.absent"))

# balance sample size for each class; we are equally-interested in correctly      
#  predicting non-serious and serious injuries
sampsize = balancedSampsize(data.entangle$Health.status)

# RF Entanglement model
ModelEntangle = rfPermute(data.entangle$Health.status ~ .,
 data.entangle[,entangle.covariates], sampsize=sampsize, ntree=size.RF, 
  replace=FALSE, importance=TRUE, proximity=TRUE)

# RF Entanglement model Confusion Matrix
ModelEntangle

Vessel Model

# Create randomForest model (using R-Package rfPermute) using known-outcome vessel strike data
# set.seed for reproducibility
set.seed(1234)

# how many RF trees to build
size.RF = 1000
# covariates included in ModelVessel
vessel.covariates = which(names(data.vessel)%in%c("calf.juv", "decline", "extensive.severe", "fluke.peduncle",
"head", "healing", "laceration.deep","laceration.shallow", "pectoral", "VessSpd", "VessSz"))

# balance sample size for each health class; we are equally-interested in correctly      
#  predicting non-serious and serious injuries

sampsize = balancedSampsize(data.vessel$Health.status)

# RF Vessel Strike model

ModelVessel = rfPermute(data.vessel$Health.status ~ .,
 data.vessel[,c(vessel.covariates)], sampsize=sampsize, ntree=size.RF, 
  replace=FALSE, importance=TRUE, proximity=TRUE)

# RF Vessel Strike model Confusion Matrix
ModelVessel

Predictions

Use existing RF models to predict probability of a death, health decline or recovery for cases where the outcome is unknown. Deaths and health declines are considered serious injuries and recoveries are non-serious. Both Dead.Decline and Recovered probabilities are estimated, based on the fraction of RF tree assignments to each class. A binary prediction (either Dead.Decline or Recovered) is also returned, based on the majority class assignment (>50% of trees). In case of ties, which are rare, the model randomly assigns a class.

# Estimate injury classes for entanglements with unknown outcomes, using data frame 'data.test.entangle'.

# Apply RF model ('ModelEntangle') to data.entangle to generate binary and probabilistic model predictions

majority.prediction <- predict(ModelEntangle, data.test.entangle, type='response')

prob.prediction <- predict(ModelEntangle, data.test.entangle, type='prob')

predictions.df <- cbind.data.frame(majority.prediction, prob.prediction, data.test.entangle)

head(predictions.df)

Show condensed version, with covariate states summarized alongside predictions. If the covariate name appears this means it is represented in the field 'Narrative'. Covariates for vessel speed ('VessSpd') and size ('VessSz') include the following states:

table(data.test.vessel$VessSpd)
table(data.test.vessel$VessSz)

# VessSpd and VessSz are ignored in entanglement models.

CovList <- CovariatePresence(data.test.entangle)
EN.probs <- predict(ModelEntangle, data.test.entangle, type="prob")
EN.df.probs <- cbind.data.frame(CovList, EN.probs)
head(EN.df.probs)

Your Own Data

You may have injury narratives not included in the SeriousInjury package for which you want to predict health status and estimate the "DEAD.DECLINE" vs "RECOVERED" probability. This example uses a single entanglement narrative and the existing entanglement model to do that. Narratives must be part of a data frame, where the field name = 'Narrative'. The data frame must also include a field for health status, named 'Health.status', in this case 'UNKNOWN'. The 'Health.status' field value may contain any character string, e.g. 'UNK' can be replaced for 'UNKNOWN'.

Start with a single injury narrative for an unknown health status outcome stored in a data frame:

Narrative <- c("Free swimming with trailing line and small (6 inch) red buoy. Partial disentanglement - gear recovered 700' of 1/2 inch lead line w/ balloon buoy + 100' of 1/2 inch line & small 7 inch buoy; Believe all gear removed, but unable to confirm. Animal looked thin and was having trouble getting to the surface, but no photos to show health status.")

Health.status <- "UNKNOWN"

example.df <- cbind.data.frame(Narrative, Health.status)

# Data-mine Narrative text and append injury covariates

example.df <- InjuryCovariates(example.df)
example.df

Apply existing RF model ('ModelEntangle') to example.df to generate majority and probabilistic predictions of health status. The majority prediction ("DEAD.DECLINE" vs "RECOVERED") is determined by which class is predicted in >50% of 1,000 RF trees. The number of trees assigning the narrative to each respective health class is reflected in the "DEAD.DECLINE" and "RECOVERED" fractions returned. If you have a vessel strike narrative, simply apply the function 'ModelVessel' instead of 'ModelEntangle'.

majority.prediction = predict(ModelEntangle, example.df, type="response")
prob.prediction = predict(ModelEntangle, example.df, type="prob")
predictions <- cbind.data.frame(majority.prediction, prob.prediction)
predictions

Note: the above example narrative is an entanglement case and the appended data frame example.df includes variables for vessel speed and size. This is due to applying a text filter to assign all covariates to all records initially, regardless of injury type (entanglement or vessel strike), and prior to parsing records into separate data frames (data.entangle and data.vessel). In this narrative, the phrases 700' and 100' are interpreted as large vessel (> 65 ft) indicators, although they refer to rope lengths. Vessel covariates are ignored in the application of the entanglement model.

Finally, injury narratives should include words and phrases related to the subject matter. Random snippets of text, such as "the quick brown fox jumps over the lazy dog", will return a health status prediction, but it is meaningless due to a lack of relevant variables.

Shiny app

As an alternative to predicting injury severity from narratives included in a data frame, the user may interactively assign injury characteristics from narratives through a series of checkboxes, via the function 'run_shiny()'. This opens a window that includes the option to include / exclude all injury covariates generated from text narratives with the function InjuryCovariates(). The user should select the appropriate injury type (Entanglement or Vessel Strike) to apply the correct model. The 'Submit' button will update the predicted probability of a death|health decline or recovery.

Example Shiny app user interface and data entry