README.md

Habitat Model Abundance Indices estimated from RACE Bottom Trawl Survey using GLM, GAM and Random Forest

Chris Rooper June 25, 2018

Purpose

The purpose of this code is to make GLM, GAM and Random Forest models based on habitat variables. These models are then used to compute model based estimates of abundance for fishes in the Aleutian Islands and Gulf of Alaska. The code takes bottom trawl survey data and habitat variables from RacebaseExtract.R code. The package produces annual abundance estimates with errors (either by the Delta method or bootstrapping).

Setup data

Define species of interest and habitat variables

This code is designed to be run with output from RACEBASE (or other data source) where rows represent hauls to be used in the calculation of the survey index and columns indicate CPUE for species (zero-filled data), habitat and other variables. An example is the data included for juvenile Pacific Ocean perch included (the same data as in the MEHRSI package).

data("Juvenile_POP_Data")
pandoc.table(head(Juvenile_POP_Data),caption="Example RACEBASE extract for juvenile POP catches in the Gulf of Alaska",split.table=Inf)
## 
## ---------------------------------------------------------------------------------------------------------------------------------
##  hauljoin    lat     long    year   tdepth   btemp   bdepth   slope     inverts    ttemp   shrimp    juvenile_POP_CPUE   STRATUM 
## ---------- ------- -------- ------ -------- ------- -------- -------- ----------- ------- --------- ------------------- ---------
##   881077    56.14   -135.2   1996    9.4      4.4     291     1.882        0       11.8    0.01286           0             250   
## 
##   881078    55.92   -135.4   1996    25.4     4.9     275     6.926     0.6885     11.56    1.955          5.518           251   
## 
##   881079    55.93   -135.4   1996    36.4     4.2     369     8.288      0.18      10.77      0              0             351   
## 
##   881080    55.93   -134.9   1996    10.4     5.4     180     0.7038    0.8072     10.95   0.05597         8.795           151   
## 
##   881081    55.89   -134.6   1996    14.4     8.1      85       0      0.0004213   9.916      0              0             50    
## 
##   881082    55.66   -134.6   1996    20.4     5.5     206     0.4036       0       12.05   0.02139        0.3564           251   
## ---------------------------------------------------------------------------------------------------------------------------------
## 
## Table: Example RACEBASE extract for juvenile POP catches in the Gulf of Alaska

In this data you have a column for the haul identifier, the midpoint positions of the tow, a year variable indicating the survey year, the depth of the thermocline, the bottom temperature and bottom depth and the seafloor slope. The catch of structure forming invertebrates (coral and sponge), the termperature at the thermocline and the total catch of shrimp are also included as potential habitat variables. The final column is the CPUE of juvenile Pacific Ocean perch (in kg/ha). For this analysis, POP were considered juveniles if they were < 250 mm in length.

For both the GLM and GAM approaches, we will use a compound model that predicts a binary process (presence or absence) and a log-normal process with habitat covariates. The models both assume a Delta Log-Normal distribution for the data.

First we do a little data manipulation to ID the data set to be used for presence-absence modeling and the subset of data with positive catches only for use in the log-normal model. The species names are set up for easy looping if you have multiple species in the same data set. In this case the "species.name" variable can include a vector of species names corresponding to column names in the data set. Then you could just loop around the entire code below, with some additonal code to capture the model outputs.

PA.data<-Juvenile_POP_Data
species.name<-c("juvenile_POP_CPUE")
CPUE.data<-subset(PA.data,PA.data[species.name]>0)

GENERAL LINEAR MODEL

Presence-absence model

For the presence and absence model we specify the x and y variables and set up the formula. In this case we have chosen 4 x variables and converted the CPUE data to presence (1) or absence (0). It is important to include year as a factor in the formula so that the annual abundance index can be predicted.

glm.pa.xvars<-c("inverts","slope","btemp","bdepth")
glm.pa.yvar<-ifelse(PA.data[species.name]>0,1,0)
glm.pa.form <- as.formula(paste("glm.pa.yvar ~", paste(glm.pa.xvars,collapse="+"),"+as.factor(year)",sep=""))

Next, we fit the initial glm model using a binomial distribution. We use backwards term selection based on the model AIC to remove insignificant terms in the equation. The best-fitting model is determined when the removal of additional terms results in no improvement to the AIC.

for(i in 1:length(glm.pa.xvars)){
pa.glm <- glm(glm.pa.form, family = binomial, data = PA.data)
gcv_glm<-pa.glm$aic
    pvals<-summary(pa.glm)$coefficients[2:(length(glm.pa.xvars)+1),4]
    least_sig<-which.max(pvals)

glm.pa.xvars1<-glm.pa.xvars[-least_sig]
glm.pa.form1 <- as.formula(paste("glm.pa.yvar ~", paste(glm.pa.xvars1,collapse="+"),"+as.factor(year)",sep=""))
pa.glm1 <- glm(glm.pa.form1, family = binomial, data =PA.data)
gcv_glm1<-pa.glm1$aic

if(gcv_glm1<=gcv_glm){
    glm.pa.form<-glm.pa.form1
#   gcv_glm<-gcv_glm1
    glm.pa.xvars<-glm.pa.xvars1
}
if(gcv_glm1>gcv_glm)break}
pander(summary(pa.glm),split.table=Inf, caption="Best-fitting model of Juvenile POP presence or absence")
  Estimate Std. Error z value Pr(>|z|) (Intercept) -1.125 0.1154 -9.747 1.896e-22 inverts 0.6468 0.05808 11.14 8.386e-29 slope 0.08109 0.01702 4.765 1.885e-06 bdepth -0.002824 0.0003955 -7.141 9.265e-13 as.factor(year)1999 0.0658 0.1405 0.4683 0.6396 as.factor(year)2001 0.147 0.1592 0.9234 0.3558 as.factor(year)2003 0.2812 0.1358 2.071 0.03835 as.factor(year)2005 0.6573 0.1309 5.023 5.091e-07 as.factor(year)2007 0.1486 0.1381 1.076 0.2819 as.factor(year)2009 0.4808 0.1282 3.752 0.0001757

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 5155 on 4474 degrees of freedom Residual deviance: 4907 on 4465 degrees of freedom

Using the best-fitting model we plot the typical data checks, residuals, and the observed and predicted values.

par(mfrow=c(2,2))
plot(pa.glm)

par(mfrow=c(1,1))
pa.glm.data<-data.frame(pa.glm$y,pa.glm$fitted.values,pa.glm$data$year)
ggplot(pa.glm.data,aes(x=as.factor(pa.glm.data.year),y=pa.glm.fitted.values,fill=as.factor(pa.glm.y)))+geom_violin()+geom_boxplot(width=.1,position=position_dodge(.9))+xlab("Absence or Presence")+ylab("Predicted probability of presence")+scale_fill_brewer(palette="Blues")+theme(legend.position="none")

ABUNDANCE MODEL

For the abundance model we specify the x and y variables and set up the formula. In this case we have chosen the same 4 x variables as the presence-absence model (this doesn't have to be true) and log-transformed the CPUE data. Again, it is important to include year as a factor in the formula so that the annual abundance index can be predicted.

glm.cpue.yvar<-unlist(log(CPUE.data[species.name]))
glm.cpue.xvars<-c("inverts","slope","btemp","bdepth")
glm.cpue.form <- as.formula(paste("glm.cpue.yvar ~", paste(glm.cpue.xvars,collapse="+"),"+as.factor(year)",sep=""))

Next, we fit the initial glm model using a guassian distribution. We use backwards term selection based on the model AIC to remove insignificant terms in the equation. The best-fitting model is determined when the removal of additional terms results in no improvement to the AIC.

for(i in 1:length(glm.cpue.xvars)){
cpue.glm <- glm(glm.cpue.form, family = gaussian, data = CPUE.data)
gcv_glm<-cpue.glm$aic
    pvals<-summary(cpue.glm)$coefficients[2:(length(glm.cpue.xvars)+1),4]
    least_sig<-which.max(pvals)

glm.cpue.xvars1<-glm.cpue.xvars[-least_sig]
glm.cpue.form1 <- as.formula(paste("glm.cpue.yvar ~", paste(glm.cpue.xvars1,collapse="+"),"+as.factor(year)",sep=""))
cpue.glm1 <- glm(glm.cpue.form1, family = gaussian, data =CPUE.data)
gcv_glm1<-cpue.glm1$aic

if(gcv_glm1<=gcv_glm){
    glm.cpue.form<-glm.cpue.form1
#   gcv_glm<-gcv_glm1
    glm.cpue.xvars<-glm.cpue.xvars1
}
if(gcv_glm1>gcv_glm)break}
print(summary(cpue.glm))
## 
## Call:
## glm(formula = glm.cpue.form, family = gaussian, data = CPUE.data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.374  -1.256  -0.166   1.049   5.953  
## 
## Coefficients:
##                     Estimate Std. Error t value
## (Intercept)          -0.8093     0.3731   -2.17
## inverts               0.3729     0.0570    6.54
## slope                 0.0905     0.0176    5.13
## btemp                 0.2612     0.0630    4.14
## as.factor(year)1999   0.1196     0.1894    0.63
## as.factor(year)2001   0.1529     0.2126    0.72
## as.factor(year)2003   0.2530     0.1818    1.39
## as.factor(year)2005   0.3935     0.1714    2.30
## as.factor(year)2007   0.4218     0.1875    2.25
## as.factor(year)2009   0.3966     0.1714    2.31
##                     Pr(>|t|)    
## (Intercept)            0.030 *  
## inverts              9.2e-11 ***
## slope                3.4e-07 ***
## btemp                3.7e-05 ***
## as.factor(year)1999    0.528    
## as.factor(year)2001    0.472    
## as.factor(year)2003    0.164    
## as.factor(year)2005    0.022 *  
## as.factor(year)2007    0.025 *  
## as.factor(year)2009    0.021 *  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.44)
## 
##     Null deviance: 3087.6  on 1175  degrees of freedom
## Residual deviance: 2841.5  on 1166  degrees of freedom
## AIC: 4397
## 
## Number of Fisher Scoring iterations: 2

Using the best-fitting model we plot the typical data checks, residuals, and the observed and predicted values.

par(mfrow=c(2,2))
plot(cpue.glm)

par(mfrow=c(1,1))
cpue.glm.data<-data.frame(cpue.glm$y,cpue.glm$fitted.values,cpue.glm$data$year)

ggplot(cpue.glm.data,aes(x=cpue.glm.y,y=cpue.glm.fitted.values,color=as.factor(cpue.glm.data.year)))+geom_point()+xlab("Observed CPUE")+ylab("Predicted CPUE")+theme(legend.position="right",legend.title=element_blank(),panel.background=element_blank())+geom_smooth(method="lm",formula=y~x,se=FALSE)

calc_RMSE(cpue.glm$y,cpue.glm$fitted.values)
## [1] 1.55

COMBINING GLMs INTO ESTIMATE

The next step in computing the annual index of abundance is to predict the value of CPUE in each year, holding the other variables constant at their median values. For each year, it is important to generate a variance estimate. In this package you can do this either by bootstrapping or by the delta method.

For the delta method use the "predict.glm.index" function. This will take the two models (for presence-absence and abundance), combine them to create the index of abundance (pa.glm*cpue.glm with x variables held at their median values). It will output a table of year, estimate and 95% confidence intervals (based on the delta-method) for the survey data.

glm.index<-predict.glm.index(pa.glm,cpue.glm)

pander::pandoc.table(glm.index,row.names=FALSE,digits=4, caption="Survey abundance index using Delta-lognormal GLM and delta-method for estimating confidence intervals")
## 
## -----------------------------------------
##  years   glm.index   lower_CI   upper_CI 
## ------- ----------- ---------- ----------
##  1996      1.109      0.9617     1.278   
## 
##  1999      1.141      0.9919     1.311   
## 
##  2001      1.158      0.9921     1.353   
## 
##  2003      1.204      1.042      1.392   
## 
##  2005      1.333      1.145      1.552   
## 
##  2007      1.223       1.06      1.412   
## 
##  2009      1.289      1.116      1.489   
## -----------------------------------------
## 
## Table: Survey abundance index using Delta-lognormal GLM and delta-method for estimating confidence intervals
p<-ggplot(glm.index,aes(x=years,y=glm.index))+geom_line()+geom_point()+
  geom_ribbon(aes(ymin=lower_CI, ymax=upper_CI),
              alpha=0.2)+xlab("Year")+ylab("Index of abundance")+scale_x_continuous(breaks=glm.index$years)+ggtitle(paste("Survey index for ",species.name, " (GLM and Delta method CI's)",sep=""))
p

For the bootstrapping method use the "predict.glm.bindex" function. This will take the two models (for presence-absence and abundance), combine them to create the index of abundance (pa.glm*cpue.glm with x variables held at their median values). It will output a table of year, estimate and 95% confidence intervals (based on the bootstrap) for the survey data. You can specify the number of bootstraps in the predict.glm.bindex function by setting boot_reps=..., otherwise the default is 500 replicates.

glm.indexb<-predict.glm.bindex(pa.glm,cpue.glm)

pander::pandoc.table(glm.indexb,row.names=FALSE,digits=4,caption="Survey abundance index using Delta-lognormal GLM and bootstrapping method for estimating confidence intervals")
## 
## -------------------------------------------------
##  years   glm.index   lower_bootCI   upper_bootCI 
## ------- ----------- -------------- --------------
##  1996      1.109        0.991          1.226     
## 
##  1999      1.141        1.036          1.245     
## 
##  2001      1.158        1.026          1.291     
## 
##  2003      1.204        1.107          1.302     
## 
##  2005      1.333        1.245          1.421     
## 
##  2007      1.223        1.116          1.331     
## 
##  2009      1.289        1.205          1.373     
## -------------------------------------------------
## 
## Table: Survey abundance index using Delta-lognormal GLM and bootstrapping method for estimating confidence intervals
p<-ggplot(glm.indexb,aes(x=years,y=glm.index))+geom_line()+geom_point()+
  geom_ribbon(aes(ymin=lower_bootCI, ymax=upper_bootCI),
              alpha=0.2)+xlab("Year")+ylab("Index of abundance")+scale_x_continuous(breaks=glm.index$years)+ggtitle(paste("Survey index for ",species.name, " (GLM and bootstrapped CI's)",sep=""))
p

At this point we have generated a glm estimate of the abundance of our species using the delta-lognormal method.

GENERALIZED ADDITIVE MODEL

The next step is to generate the survey index using a delta-GAM method. It follows a parallel process to the glm, with some minor tweaks to adjust inputs and outputs for the various functions. The GAM models are generated using the mgcv package (Wood 2006)

Presence-absence model

For the presence and absence model we specify the x and y variables and set up the formula. In this case we have chosen 4 x variables and converted the CPUE data to presence (1) or absence (0). It is important to include year as a factor in the formula so that the annual abundance index can be predicted. Habitat variables are included as part of the formula with an s() indicating fitting a smooth spline to the data, k=4 is used here to reduce the number of inflection points in the smooth fit. Since it is presence absence data, a binomial distribution and link=logit is used.

gam.pa.xvars<-c("s(inverts,k=4)","s(slope,k=4)","s(btemp,k=4)","s(bdepth,k=4)")
gam.pa.yvar<-ifelse(PA.data[species.name]>0,1,0)
gam.pa.form <-as.formula(paste("gam.pa.yvar ~", paste(gam.pa.xvars,collapse="+"),"+as.factor(year)",sep=""))

Next we fit the GAM to the presence-absence data and then iteratively reduce the model to its most parsimonious form by removing insignificant habitat variables. See Weinberg and Kotwicki 2009 for the details on variable removal.

for(i in 1:length(gam.pa.xvars)){
pa.gam <- gam(gam.pa.form, family = binomial, data = PA.data,control=list(keepData=TRUE))
gcv_gam<-pa.gam$gcv.ubre
    pvals<-summary(pa.gam)$s.pv
    least_sig<-which.max(pvals)

gam.pa.xvars1<-gam.pa.xvars[-least_sig]
gam.pa.form1 <- as.formula(paste("gam.pa.yvar ~", paste(gam.pa.xvars1,collapse="+"),"+as.factor(year)",sep=""))
pa.gam1 <- gam(gam.pa.form1, family = binomial, data =PA.data)
gcv_gam1<-pa.gam1$gcv.ubre

pts1<-ifelse(gcv_gam>gcv_gam1,1,0)
pts2<-ifelse(round(summary(pa.gam)$edf[least_sig],1)==1,1,0)
pts3<-ifelse(summary(pa.gam)$s.pv[least_sig]>.05,1,0)
if(pts1+pts2+pts3>1){
    gam.pa.form<-gam.pa.form1
    gam.pa.xvars<-gam.pa.xvars1}
if(pts1+pts2+pts3<2)break}
print(summary(pa.gam))
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## gam.pa.yvar ~ s(inverts, k = 4) + s(slope, k = 4) + s(btemp, 
##     k = 4) + s(bdepth, k = 4) + as.factor(year)
## 
## Parametric coefficients:
##                     Estimate Std. Error z value
## (Intercept)           -2.914      0.450   -6.47
## as.factor(year)1999    0.350      0.154    2.27
## as.factor(year)2001    0.365      0.172    2.12
## as.factor(year)2003    0.484      0.149    3.25
## as.factor(year)2005    0.897      0.145    6.20
## as.factor(year)2007    0.490      0.154    3.19
## as.factor(year)2009    1.003      0.144    6.98
##                     Pr(>|z|)    
## (Intercept)          9.8e-11 ***
## as.factor(year)1999   0.0229 *  
## as.factor(year)2001   0.0336 *  
## as.factor(year)2003   0.0011 ** 
## as.factor(year)2005  5.5e-10 ***
## as.factor(year)2007   0.0014 ** 
## as.factor(year)2009  3.1e-12 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df Chi.sq p-value    
## s(inverts) 2.93   3.00  178.9 < 2e-16 ***
## s(slope)   2.74   2.95   27.5 3.8e-06 ***
## s(btemp)   2.94   3.00   39.0 1.6e-08 ***
## s(bdepth)  2.09   2.20  127.8 1.6e-08 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.23   Deviance explained = 21.1%
## UBRE = -0.083632  Scale est. = 1         n = 4475

Using the best-fitting model we plot the typical data checks, residuals, and the observed and predicted values.

par(mfrow=c(2,2))
gam.check(pa.gam)

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 8 iterations.
## Gradient range [-4.54e-08,1.33e-07]
## (score -0.0836 & scale 1).
## Hessian positive definite, eigenvalue range [2.47e-05,0.0222].
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'  edf k-index p-value    
## s(inverts) 3.00 2.93    0.91  <2e-16 ***
## s(slope)   3.00 2.74    0.90  <2e-16 ***
## s(btemp)   3.00 2.94    0.91  <2e-16 ***
## s(bdepth)  3.00 2.09    0.98     0.2    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(pa.gam,scale=0,shade=TRUE,all.terms=TRUE,pages=1)

par(mfrow=c(1,1))
pa.gam.data<-data.frame(pa.gam$y,pa.gam$fitted.values,pa.gam$data$year)
ggplot(pa.gam.data,aes(x=as.factor(pa.gam.data.year),y=pa.gam.fitted.values,fill=as.factor(pa.gam.data[,1])))+geom_violin()+geom_boxplot(width=.1,position=position_dodge(.9))+xlab("Absence or Presence")+ylab("Predicted probability of presence")+scale_fill_brewer(palette="Blues")+theme(legend.position="none")

ABUNDANCE MODEL

For the abundance model we specify the x and y variables and set up the formula. In this case we have chosen the same 4 x variables as the presence-absence model (this doesn't have to be true) and log-transformed the CPUE data. Again, it is important to include year as a factor in the formula so that the annual abundance index can be predicted.

gam.cpue.yvar<-unlist(log(CPUE.data[species.name]))
gam.cpue.xvars<-c("s(inverts,k=4)","s(slope,k=4)","s(btemp,k=4)","s(bdepth,k=4)")
gam.cpue.form <- as.formula(paste("gam.cpue.yvar ~", paste(gam.cpue.xvars,collapse="+"),"+as.factor(year)",sep=""))

Next, we fit the initial gam model using a guassian distribution. We use backwards term selection based on the model fit (see Weinberg and Kotwicki 2009) to remove insignificant terms in the equation. The best-fitting model is determined when the removal of additional terms results in no improvement to the model.

for(i in 1:length(gam.cpue.xvars)){
cpue.gam <- gam(gam.cpue.form, family = gaussian, data = CPUE.data,control=list(keepData=TRUE))
gcv_gam<-cpue.gam$gcv.ubre
    pvals<-summary(cpue.gam)$s.pv
    least_sig<-which.max(pvals)

gam.cpue.xvars1<-gam.cpue.xvars[-least_sig]
gam.cpue.form1 <- as.formula(paste("gam.cpue.yvar ~", paste(gam.cpue.xvars1,collapse="+"),"+as.factor(year)",sep=""))
cpue.gam1 <- gam(gam.cpue.form1, family = gaussian, data =CPUE.data)
gcv_gam1<-cpue.gam1$gcv.ubre

pts1<-ifelse(gcv_gam>gcv_gam1,1,0)
pts2<-ifelse(round(summary(cpue.gam)$edf[least_sig],1)==1,1,0)
pts3<-ifelse(summary(cpue.gam)$s.pv[least_sig]>.05,1,0)
if(pts1+pts2+pts3>1){
    gam.cpue.form<-gam.cpue.form1
    gam.cpue.xvars<-gam.cpue.xvars1
}
if(pts1+pts2+pts3<2)break}
print(summary(cpue.gam))
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## gam.cpue.yvar ~ s(inverts, k = 4) + s(slope, k = 4) + s(btemp, 
##     k = 4) + s(bdepth, k = 4) + as.factor(year)
## 
## Parametric coefficients:
##                     Estimate Std. Error t value
## (Intercept)           0.8536     0.1348    6.33
## as.factor(year)1999   0.0785     0.1894    0.41
## as.factor(year)2001   0.2248     0.2107    1.07
## as.factor(year)2003   0.2969     0.1797    1.65
## as.factor(year)2005   0.4208     0.1700    2.48
## as.factor(year)2007   0.4265     0.1860    2.29
## as.factor(year)2009   0.4339     0.1695    2.56
##                     Pr(>|t|)    
## (Intercept)          3.5e-10 ***
## as.factor(year)1999    0.679    
## as.factor(year)2001    0.286    
## as.factor(year)2003    0.099 .  
## as.factor(year)2005    0.013 *  
## as.factor(year)2007    0.022 *  
## as.factor(year)2009    0.011 *  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value    
## s(inverts) 2.83   2.98 18.09 2.6e-11 ***
## s(slope)   1.25   1.45 12.36 0.00036 ***
## s(btemp)   2.69   2.93  4.31 0.00758 ** 
## s(bdepth)  2.88   2.99  7.45 4.5e-05 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0999   Deviance explained = 11.2%
## GCV = 2.3993  Scale est. = 2.3653    n = 1176

Using the best-fitting model we plot the typical data checks, residuals, and the observed and predicted values.

par(mfrow=c(2,2))
gam.check(cpue.gam)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 18 iterations.
## The RMS GCV score gradient at convergence was 5.43e-06 .
## The Hessian was positive definite.
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'  edf k-index p-value    
## s(inverts) 3.00 2.83    0.95   0.055 .  
## s(slope)   3.00 1.25    0.91  <2e-16 ***
## s(btemp)   3.00 2.69    0.98   0.255    
## s(bdepth)  3.00 2.88    0.99   0.400    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(cpue.gam,scale=0,shade=TRUE,all.terms=TRUE,pages=1)

par(mfrow=c(1,1))
cpue.gam.data<-data.frame(cpue.gam$y,cpue.gam$fitted.values,cpue.gam$data$year)

ggplot(cpue.gam.data,aes(x=cpue.gam.data[,1],y=cpue.gam.fitted.values,color=as.factor(cpue.gam.data.year)))+geom_point()+xlab("Observed CPUE")+ylab("Predicted CPUE")+theme(legend.position="right",legend.title=element_blank(),panel.background=element_blank())+geom_smooth(method="lm",formula=y~x,se=FALSE)

calc_RMSE(cpue.gam$y,cpue.gam$fitted.values)
## [1] 1.53

COMBINING GAMs INTO ESTIMATE

The next step in computing the annual index of abundance is to predict the value of CPUE in each year, holding the other variables constant at their median values. For each year, it is important to generate a variance estimate. In this package you can do this either by bootstrapping or by the delta method.

For the delta method use the "predict.gam.index" function. This will take the two models (for presence-absence and abundance), combine them to create the index of abundance (pa.gam*cpue.gam with x variables held at their median values). It will output a table of year, estimate and 95% confidence intervals (based on the delta-method) for the survey data.

gam.index<-predict.gam.index(pa.gam,cpue.gam)

pander::pandoc.table(gam.index,row.names=FALSE,digits=4, caption="Survey abundance index using Delta-lognormal GAM and delta-method for estimating confidence intervals")
## 
## -----------------------------------------
##  years   gam.index   lower_CI   upper_CI 
## ------- ----------- ---------- ----------
##  1996      1.099      0.8955     1.349   
## 
##  1999      1.156      0.9413     1.419   
## 
##  2001      1.205       0.97      1.498   
## 
##  2003      1.252      1.015      1.545   
## 
##  2005      1.41       1.132      1.755   
## 
##  2007      1.303      1.057      1.607   
## 
##  2009      1.449      1.175      1.789   
## -----------------------------------------
## 
## Table: Survey abundance index using Delta-lognormal GAM and delta-method for estimating confidence intervals
p<-ggplot(gam.index,aes(x=years,y=gam.index))+geom_line()+geom_point()+
  geom_ribbon(aes(ymin=lower_CI, ymax=upper_CI),
              alpha=0.2)+xlab("Year")+ylab("Index of abundance")+scale_x_continuous(breaks=gam.index$years)+ggtitle(paste("Survey index for ",species.name, " (GAM and Delta method CI's)",sep=""))
p

For the bootstrapping method use the "predict.gam.bindex" function. This will take the two models (for presence-absence and abundance), combine them to create the index of abundance (pa.gam*cpue.gam with x variables held at their median values). It will output a table of year, estimate and 95% confidence intervals (based on the bootstrap) for the survey data. You can specify the number of bootstraps in the predict.gam.bindex function by setting boot_reps=..., otherwise the default is 500 replicates.

gam.indexb<-predict.gam.bindex(pa.gam,cpue.gam)

pander::pandoc.table(gam.indexb,row.names=FALSE,digits=4,caption="Survey abundance index using Delta-lognormal GAM and bootstrapping method for estimating confidence intervals")
## 
## -------------------------------------------------
##  years   gam.index   lower_bootCI   upper_bootCI 
## ------- ----------- -------------- --------------
##  1996      1.099        0.9804         1.218     
## 
##  1999      1.156        1.042          1.269     
## 
##  2001      1.205        1.067          1.343     
## 
##  2003      1.252         1.14          1.364     
## 
##  2005      1.41         1.313          1.506     
## 
##  2007      1.303        1.193          1.414     
## 
##  2009      1.449        1.367          1.532     
## -------------------------------------------------
## 
## Table: Survey abundance index using Delta-lognormal GAM and bootstrapping method for estimating confidence intervals
p<-ggplot(gam.indexb,aes(x=years,y=gam.index))+geom_line()+geom_point()+
  geom_ribbon(aes(ymin=lower_bootCI, ymax=upper_bootCI),
              alpha=0.2)+xlab("Year")+ylab("Index of abundance")+scale_x_continuous(breaks=gam.indexb$years)+ggtitle(paste("Survey index for ",species.name, " (GAM and bootstrapped CI's)",sep=""))
p

At this point we have generated a gam estimate of the abundance of our species using the delta-lognormal method.

RANDOM FOREST

The next model version is the random forest model. The first setp is to set up the data again. Here we will do only a single model for CPUE. Random forest does not necessarily require some of the assumptions of the statistical models (like an assumed error distribution or absence of collinearity among independent variables). So the setup and implimentation of the model is a bit simpler.

rf.xvars<-c("inverts","slope","btemp","bdepth")
rf.yvar<-unlist(log(PA.data[species.name]+.5*min(subset(PA.data[species.name],PA.data[species.name]>0))))
rf.form <-as.formula(paste("rf.yvar ~", paste(rf.xvars,collapse="+"),"+year",sep=""))

Now we fit the random forest model using the PA.data dataset and the formula above.

##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  250 |    2.103    68.54 |
##  500 |    2.099    68.41 |
##  750 |    2.097    68.32 |
## 1000 |    2.096    68.30 |

## 
## Call:
##  randomForest(formula = rf.form, data = PA.data, mtry = 3, ntree = 1000,      importance = TRUE, do.trace = 250, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 2.1
##                     % Var explained: 31.7

Now plot diagnostics for the random forest model (the typical data checks, residuals, and the observed and predicted values).

par(mfrow=c(2,2))
plot(rf.cpue)
rf.lm<-lm(rf.cpue$predicted~rf.yvar)
plot(rf.lm)

summary(rf.lm)
## 
## Call:
## lm(formula = rf.cpue$predicted ~ rf.yvar)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.246 -0.594 -0.256  0.424  4.987 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89288    0.01656   -53.9   <2e-16
## rf.yvar      0.33892    0.00741    45.7   <2e-16
##                
## (Intercept) ***
## rf.yvar     ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.869 on 4473 degrees of freedom
## Multiple R-squared:  0.319,  Adjusted R-squared:  0.318 
## F-statistic: 2.09e+03 on 1 and 4473 DF,  p-value: <2e-16
varImpPlot(rf.cpue)

calc_RMSE(rf.cpue$predicted,rf.yvar)
## [1] 1.45
par(mfrow=c(length(rf.xvars)/2,2))
t2<-t(data.frame(cbind(unlist(apply(PA.data[,rf.xvars],2,FUN="median",na.rm=TRUE)))))
year<-rep(PA.data$y[1],1001)
#t2<-data.frame(t2,year=t4)
#t2<-data.frame(t2,year=t4)
for(i in 1:length(rf.xvars)){
    mindata<-min(PA.data[,rf.xvars[i]],na.rm=TRUE)
    maxdata<-max(PA.data[,rf.xvars[i]],na.rm=TRUE)
    xdata<-seq(mindata,maxdata,by=(maxdata-mindata)/1000)
  t3<-rep.row(t2,1001)
  colnames(t3)<-colnames(t2)
    t3[,i]<-xdata
    p1<-predict(rf.cpue,cbind(t3),type="response")
    plot(p1~xdata,xlab=rf.xvars[i],ylab="Effect on CPUE")
}

par(mfrow=c(1,1))
rf.data<-data.frame(rf.yvar,rf.cpue$predicted,PA.data$year)
ggplot(rf.data,aes(x=rf.yvar,y=rf.cpue.predicted,color=as.factor(PA.data.year)))+geom_point()+xlab("Observed CPUE")+ylab("Predicted CPUE")+theme(legend.position="right",legend.title=element_blank(),panel.background=element_blank())+geom_smooth(method="lm",formula=y~x,se=FALSE)

rf.index<-predict.rf.index(PA.data,rf.yvar,rf.xvars,rf.form,rf.cpue,100)
pander::pandoc.table(rf.index,row.names=FALSE,digits=4, caption="Survey abundance index using Random Forest model")
## 
## ------------------------------------------------
##  years   rf.index   lower_bootCI   upper_bootCI 
## ------- ---------- -------------- --------------
##  1996     0.232        0.2065         0.2576    
## 
##  1999     0.1453       0.1323         0.1582    
## 
##  2001     0.1251       0.1161         0.1342    
## 
##  2003     0.2643       0.2393         0.2893    
## 
##  2005     0.1596       0.1475         0.1716    
## 
##  2007     0.1741       0.162          0.1861    
## 
##  2009     0.4007       0.3579         0.4435    
## ------------------------------------------------
## 
## Table: Survey abundance index using Random Forest model
p<-ggplot(rf.index,aes(x=as.numeric(levels(years)),y=rf.index))+geom_line()+geom_point()+
  geom_ribbon(aes(ymin=lower_bootCI, ymax=upper_bootCI),
              alpha=0.2)+xlab("Year")+ylab("Index of abundance")+scale_x_continuous(breaks=as.numeric(levels(rf.index$years)))+ggtitle(paste("Survey index for ",species.name, " (Random Forest method CI's)",sep=""))
p

DESIGN BASED ESTIMATE

The design based estimate calls a function "Stratified_CPUE" that takes the CPUE values, the stratum names, stratum areas and year and spits out a time series of biomass estimates and variances (both total and by strata). The function also takes an estimate of the proportion of the strata that is "trawlable". This defaults to 1.

The stratum designation and stratum area for each haul is needed for this estimate, so here we read in this data and attach to the Juvnenile_POP_Data.

Next we use the stratum_area function to generate stratum areas for each haul in the data set. We need to specify the data set, the column name that has the stratum designations and the region of interest for the function.

Design.data<-get_strata_area(Juvenile_POP_Data,"STRATUM","GOA")
Design.data<-subset(Design.data,Design.data$STRATUM>0)

Then we use the Stratified_CPUE function to generate design based estimates. For this function we need to specify columns for the calculations: the CPUE column, the year column, the strata designator, the strata area, the region, and the proportion trawlable (default is 1). Note, for the Juvenile_POP_Data, there are only 4 years where data on the strata membership is known (due to the time when these data were extracted from RACEBASE), therefore, the time series is not complete.

design.index<-Stratified_CPUE(Design.data$juvenile_POP_CPUE,Design.data$year,Design.data$STRATUM,Design.data$AREA_KM2,"GOA",1, "CPUE")
pander::pandoc.table(design.index,row.names=FALSE,digits=4,caption="Survey abundance index design-based estimators")
## 
## ------------------------------------------------------
##  Year   CPUE     Var       SE     Lower_CI   Upper_CI 
## ------ ------- -------- -------- ---------- ----------
##  2003   3.081   0.9932   0.9966    1.086      5.076   
## 
##  2009   4.731   1.038    1.019     2.691      6.771   
## 
##  1999   2.567   0.4166   0.6455    1.275      3.859   
## 
##  2001   3.538   0.7133   0.8446    1.847      5.229   
## 
##  1996   2.376   0.5926   0.7698    0.8353     3.917   
## 
##  2007   2.533   0.3014   0.549     1.434      3.632   
## 
##  2005   3.723   0.2818   0.5308    2.661      4.786   
## ------------------------------------------------------
## 
## Table: Survey abundance index design-based estimators
p<-ggplot(design.index,aes(x=Year,y=CPUE))+geom_line()+geom_point()+
  geom_ribbon(aes(ymin=Lower_CI, ymax=Upper_CI),
              alpha=0.2)+xlab("Year")+ylab("Index of abundance")+scale_x_continuous(breaks=design.index$Year)+ggtitle(paste("Survey index for ",species.name, " (design based with CI's)",sep=""))
p

***

SPATIAL PATTERNS IN RESIDUALS

Look for any significant spatial patterns in the residuals, by fitting a kriging model and adding back into the predicted CPUE. This is done with the spatial_resids function from the MEHRSI package. First we do it for the GLM model.

glm.cpue.predict<-predict(cpue.glm,pa.glm$data,type="response")*pa.glm$fitted.values
spatial_resids(PA.data$long,PA.data$lat,(glm.cpue.predict-unlist(PA.data[species.name])),glm.cpue.predict,unlist(PA.data[species.name]),region="GOA")
knitr::include_graphics("spatial_resids.png")

Here's the code for the GAM model.

gam.cpue.predict<-predict(cpue.gam,pa.gam$data,type="response")*pa.gam$fitted.values
spatial_resids(PA.data$long,PA.data$lat,(gam.cpue.predict-unlist(PA.data[species.name])),gam.cpue.predict,unlist(PA.data[species.name]),region="GOA")
knitr::include_graphics("spatial_resids.png")


rooperc4/GLMGAMRF documentation built on May 17, 2019, 1:30 p.m.