R/iec_function.R

# This script defines the function "est_iec" which calculates Index of
# Ecological Condition (IEC) scores. The function "est_iec"
# ("iec_site_score" at the time) was originally used for Erin E. G. Giese's
# thesis (2012).  Giese's thesis used the script
# "IEC_Application_NoForLoopExp20120328.r" which this current function is based
# on.  The purpose of this script is to estimate IEC scores based on Biotic
# Response (BR) functions of species or other taxa for many sites at once.
# BR functions can be calculated using function "est_brc".
#
# File used for E. Giese's thesis: "IEC_Application_NoForLoopExp20120328.r"
# File "IEC_Builder20120328.r" was based on March Excel files but most closely
# resembles April files (right?  based on BRC script comments).
#
# Input:
# "method" is used to set the criteria to use for estimating IEC.  "method"
# can be set to "pa" (default), or "q".  Method "pa" is used
# where present/absence data are available, and method "q" is used if
# probability data are available.
# Note that method "pa" corresponds to the Excel tab "CalcPA" (formerly
# "CalcLik"), and method "q" corresponds to Excel tab "CalcQ" (formerly
# "CalcLSq").
#
# Note: Future versions of "est_iec" may be written so that only species in
# "brc" are used for scoring sites.
#
# Structure of function "est_iec":
# * "Error checking" ensures that input variables "sp" and "brc"
#   are both data frames, and that they contain the same species.  If any of
#   these criteria are not met, an error is raised and the function exits with
#   a message describing the issue.
# * "Declarations" contains variables that will be used later in the
#   function.  In general, these are the only variables you are likely to
#   modify in this script.
# * "Main Function" contains function "f" which returns the likelihood
#   function which is minimized by "nlminb".  Note that Dr. Howe's MS Excel
#   based versions of this function maximize this function. Because "nlminb"
#   can only minimize functions, the function here has the sign reversed
#   during analysis, but the sign is corrected before adding it to the output
#   so the numbers generated are comparable to those generated by Solver in
#   Excel.
# * "for loop over each site" contains a for loop that processes
#   each taxa in turn.  Each taxa has function "nlminb" called on it the
#   number of times specified by input variable "n_reps".  "nlminb" tries to
#   minimize the likelihood function returned by function "f".
#
# Authors: Nicholas G. Walton and Robert W. Howe
#
# Created: Mar 2013(?)
# Last modified: 25 Sept 2015

#' Estimate IEC site scores.
#'
#' \code{est_iec} estimates the Index of Ecological Condition (IEC) for each
#' site in \code{sp} based on the biotic response curves (BRC) in \code{brc}.
#'
#' An Index of Ecological Condition (IEC) score is estimated for each site (row)
#' in data frame \code{sp}. This score is based on the Biotic Response functions
#' (curves) in \code{brc} (output from \code{\link{est_brc}}). The column names
#' in \code{sp} and column 1 of \code{brc} must contain the same taxa in the
#' same order.
#'
#' The default method treats observations as presence/absence (\code{method =
#' "pa"}). When using \code{pa}, it is essential to use BR functions based on
#' probability or probability-like (e.g., scaled to 0-1) observation data
#' (\code{\link{est_brc}}). Note that input records do not need to be coded as
#' presence/absence as the method simply checks for 0 or > 0.
#'
#' When quantitative observations are available and were used to generate BR
#' functions, set \code{method} to \code{"q"}. As with \code{"pa"} it is
#' important to match the type of data used to score the sites with the type of
#' data used to estimate the BR functions. For example, if the data used to
#' construct the BR functions in \code{est_brc} were probabilities, then the
#' data used to score the sites (\code{sp}) should also be probabilities If BR
#' functions were constructed using log transformed (e.g.
#' \code{\link[base]{log1p}}) count data, then the observations used to score
#' the sites in \code{est_iec} should be transformed in the same way.
#'
#' \code{n_reps} determines the number of times the
#' optimization function (\code{\link[stats]{nlminb}}) is run with different
#' starting values. \code{n_reps} can be set to any positive integer, but the
#' authors recommend against using values of less than 10 except for testing
#' purposes.  Larger values will take longer to process.  The default value is
#' 30.  If \code{keep_zeros = TRUE} (default), all taxa in \code{brc} are used
#' for scoring the sites. But if \code{keep_zeros = FALSE}, only those taxa
#' detected at a given site are used to score that site.  With the latter
#' option, the number of taxa used to score each site may vary.
#'
#' @param brc BRC data frame generated by \code{\link{est_brc}}.
#' @param method one of \code{"pa"} (default; CalcPA) or \code{"q"} (CalcQ).
#' @param n_reps scalar integer setting the number of random starts for
#'   optimization (default is 30).
#' @param keep_zeros logical indicating if all taxa are used in scoring
#'   (\code{TRUE}; default) or only detected taxa.
#' @inheritParams est_brc
#' @return A data frame containing the IEC scores for each site in \code{sp} and
#'   the corresponding likelihood of the score.
#' @references Gnass Giese, E.E., R.W. Howe, A.T. Wolf, N.A. Miller, and N.G.
#'   Walton. 2015. Sensitivity of breeding birds to the "human footprint" in
#'   western Great Lakes forest landscapes. Ecosphere 6(6):90.
#'   http://dx.doi.org/10.1890/ES14-00414.1
#'
#'   Howe, R.W., R. R. Regal, J.M. Hanowski, G.J. Niemi, N.P. Danz, and C.R.
#'   Smith.  2007a.  An index of ecological condition based on bird assemblages
#'   in Great Lakes coastal wetlands.  Journal of Great Lakes Research 33
#'   (Special Issue 3): 93-105.
#'
#'   Howe, R.W., R. R. Regal, G.J. Niemi, N.P. Danz, J.M. Hanowski. 2007b.  A
#'   probability-based indicator of ecological condition. Ecological Indicators
#'   7:793-806.
#' @seealso \code{\link{est_brc}} to generate biotic response curves.
est_iec <- function(sp, brc, method = "pa", n_reps = 30, keep_zeros = TRUE) {

  # The input "sp" is a data frame containing the observations for each
  # species or other taxa (abundance or presence/absence) at each site.
  # The column names of "sp" must be the species or other taxa.
  # All rows must contain the observations of each species at each site
  # (one column per species).  Row names are site names.

  # Example of "sp":
  # row.names  Sp(1)   Sp(2)   Sp(?)   Sp(n)
  # Site(1)       0       0       0       0
  # Site(2)       1       0       1       0
  # Site(?)       0       0       0       0
  # Site(n)       0       3       0       0


  ## Error checking ----

  # Check that the input "sp" and "brc" are both data frames.
  if (!is.data.frame(sp) | !is.data.frame(brc)) {
    stop("Inputs 'sp' and 'brc' must be data frames.")
  }

  # Check that "sp" and "brc" contain the same number of taxa.
  if (ncol(sp) != nrow(brc)) {
    stop("Species observation and BRC data frames differ in number of taxa.")
  }

  # Check that "sp" and "brc" contain the same taxa.
  if (!identical(as.character(names(sp)), as.character(brc[, 1]))) {
    stop(paste("Data frames 'sp' and 'brc' do not contain the same taxa \n",
               "or are not ordered the same."))
  }

  # Check that "method" has been set correctly.
  if (!method %in% c("q", "pa")) {
    stop("Method must be 'pa' (default) or 'q'.")
  }


  ## Declarations ----

  # Generate a data frame to hold the output IEC site score data;
  # an empty data frame with 3 columns - data type defined.
  iec_scores <- data.frame(character(0), rep(list(numeric(0)), 2),
                              stringsAsFactors = FALSE)
  names(iec_scores) <- c("Site", "IEC", "LogLik")

  # Make an original copy of "brc" if zeros are to be dropped from each site
  if (!keep_zeros) brc_raw <- brc

  # Function to generate stratified random selection of starting IEC values.
  get_strat <- function(n) {
    # Generalized to allow for other numbers to be set for "n" ("n_reps").
    # "n" does not have to be a multiple of 10.
    # Note that it's ok to set "n_reps" as low as 10,
    # and it can be set even lower if you're just running tests.

    if (n < 10) {
      int <- 10/n
      llimit <- sapply(0:(n - 1), function(x) x * int)
      strat <- sapply(llimit, function(x) runif(1, min = x, max = x + int))
    } else {
      # Lower limit for each stratum (upper will be llimit + 1).
      llimit <- 0:9

      # This is the number of random values that will be selected from
      # each stratum.
      pick <- n %/% length(llimit)

      # Select stratified random values.
      strat <- sapply(llimit, function(x) runif(pick, min = x, max = x + 1))

      # If n has been set to a value that is not a multiple of 10,
      # the remainder will be filled using random values in [0, 10].
      # Note that only this part will be used to assign starting IEC values
      # if n is set to less then length(llimit) (AKA 10).
      if (length(strat) < n) {
        remain <- n - length(strat)
        strat[(length(strat) + 1):n] <- runif(remain, min = 0, max = 10)
      }
    }
    strat   # Return the vector of starting values for "nlminb".
  }

  # Set method/criteria for estimating IEC
  # Return function for "q" or "pa" based on "method"
  # Method must be set to "q" or "pa"
  if (method == "q") {
    criteria <- function(pc, observed) {
      # Least-squares method (AKA "CalcQ").
      # Note that currently the output from this will have the
      # opposite sign of that from the Excel spread sheet.
      # This needs to be fixed, but will also require modifications to "pa".
      numerator <- (observed - pc) ^ 2
      sum(numerator / pc)   # (Obs-Exp)^2 / Exp
    }
  } else {
    criteria <- function(pc, observed) {
      # Likelihood method (AKA "pa").
      # Calculate lack-of-fit expression ("lof") for each species.
      lof <- rep(NA, length(pc))  # set lof to length of pc.

      # Set pc > 1 to 1
      pc[pc > 1] <- 1

      # If species is present at the current site, log set to Log P(C).
      lof[observed > 0] <- log10(pc)[observed > 0]

      # If species is absent, set to Log (1-P(C)).
      # Per the Excel file, it's really Log(1.0001-P(C)) to avoid log of 0.
      lof[is.na(lof)] <- log10(1.0001 - pc)[is.na(lof)]

      # Return the negative sum of the lof.
      # I'm using the negative because the original function in Excel was
      # set to maximize, but "nlminb" can only minimize a function.
      # The result is the same.
      -sum(lof)
    }
  }


  ## Main Function ----
  f <- function(x) {
    # This function returns the function which
    # will be minimized with function "nlminb".
    # "x" is the current IEC score being tried by "nlminb".
    # Note that "observed" is called from outside the function, but there
    # doesn't seem to be a better way to do this with "nlminb".

    # Calculate P(C) for each species.
    # In the original version, P(C) was constrained to > 0 and <= 1.
    # P(C) is the probability or value of each species at given IEC.
    # Input values "Mean", "SD", and "H" are part of data frame "brc".
    pc <- with(brc, {dnorm(x, mean = Mean, sd = SD) * H})
    pc[pc < .001] <- .001        # Set 0 probabilities to .001.

    # criteria is set in the Declarations section of this script.
    # It is set to either "pa" (default) or "q".
    criteria(pc, observed)
  }


  ## for loop over each site ----
  # This for loop iteratively processes each site in the data frame "sp".

  for (site in 1:nrow(sp)) {
    # A list to contain the best solution found by "nlminb".
    best <- NULL

    # Extract the "observed" responses for the current site.
    # If "keep_zeros" is TRUE (default), all taxa are used for scoring.
    # Else, only those taxa that were detected are used for scoring.
    if (keep_zeros) {
      observed <- sp[site, ]
    } else {
      # "keep" contains the indices for detected species.
      keep <- which(sp[site, ] > 0)

      # If there are no species detected, set iec attributes to NA and
      # go to the next site
      if (length(keep) == 0) {
        iec_scores[site, ] <- list(row.names(sp)[site], NA_real_, NA_real_)
        next
      }

      # subset "observed" and "brc" to the detected species
      observed <- sp[site, keep]
      brc <- brc_raw[keep, ]
    }


    # Function "get_strat" is defined in the "Declarations" section.
    iec_start <- get_strat(n_reps)

    # Estimate IEC for current site using "n_reps" different starting values.
    for (i in 1:n_reps) {
      # "n_reps" is set in the call to function "siteScore" and
      # defaults to 30.

      # For help with function "nlminb" type "?nlminb" in R.
      # This is the optimization function.
      # The result is constrained to be between 0 and 10 inclusive.
      result <- nlminb(start = iec_start[i], objective = f,
                       lower = 0, upper = 10)

      # Check if the current random starting value resulted in an
      # improved estimate of IEC.
      if (i == 1) {
        # On the 1st iteration set "best" to "result".
        best <- result
      } else if (result$objective < best$objective) {
        # Otherwise check if newest "result" is better than the current
        # best result ("best").
        # If the new "result" is better than the current best,
        # update "best" to "result".
        best <- result
      }
    }

    # After fitting a best solution, add the solution to the
    # data frame "iec_scores".
    # best$objective is multiplied by -1 to put it on the same scale as
    # the original Excel based versions.
    iec_scores[site, ] <- list(row.names(sp)[site], best$par, -best$objective)
  }

  # Return "iec_scores"
  iec_scores
}
ngwalton/iec documentation built on May 23, 2019, 4:43 p.m.