zooper: Download and Integrate Zooplankton Datasets from the Upper San Francisco Estuary

Documented in Zoopsynther

#' Integrates zooplankton datasets collected by the Interagency Ecological Program from the Sacramento-San Joaquin Delta
#'
#' This function returns an integrated zooplankton dataset with taxonomic issues resolved, according to user-specifications, along with important caveats about the data. It requires the output of the \code{\link{Zoopdownloader}} function to run. This can be provided either as a list or paths to saved .Rds files generated by the \code{\link{Zoopdownloader}} function. The function defaults to loading pre-packaged combined datasets (which may be outdated).
#' @param Data_type What type of data are you looking for? This option allows you to to choose a final output dataset for either community (\code{Data_type = "Community"}; the default) or Taxa-specific (\code{Data_type = "Taxa"}) analyses. NOTE: If you set \code{Data_type="Community"} we do not recommend utilizing the \code{Taxa} argument. See below for more explanation of this argument.
#' @param Zoop Zooplankton data. You must provide the "Zooplankton" element from the list returned from \code{\link{Zoopdownloader}(Save_object = FALSE, Return_object = TRUE, Return_object_type="List")}. The default argument provides the built-in (and possibly outdated) version of this combined dataset. If you instead wish to provide paths to saved datasets from the \code{\link{Zoopdownloader}} function, set \code{Data_list = NULL} and provide \code{Zoop_path}.
#' @param ZoopEnv Accessory environmental data. You must provide the "Environment" element from the list returned from \code{\link{Zoopdownloader}(Save_object = FALSE, Return_object = TRUE, Return_object_type="List")}. The default argument provides the built-in (and possibly outdated) version of this combined dataset. If you instead wish to provide paths to saved datasets from the \code{\link{Zoopdownloader}} function, set \code{Data_list = NULL} and provide \code{Env_path}.
#' @param Zoop_path If you wish to save time by saving the combined zooplankton datasets returned from the \code{zoopdatadownloader} to disk, provider here the path to the combined zooplankton dataset on disk. You must also set \code{Data_list = NULL}.
#' @param Env_path If you wish to save time by saving the combined zooplankton datasets returned from the \code{zoopdatadownloader} to disk, provider here the path to the combined accessory environmental data on disk. You must also set \code{Data_list = NULL}.
#' @param Sources Source datasets to be included. Choices include "EMP" (Environmental Monitoring Program), "FRP" (Fish Restoration Program), "FMWT" (Fall Midwater Trawl), "STN" (Townet Survey), "DOP" (Directed Outflow Project), and "20mm" (20mm survey). The YBFMP datasets cannot be used in this function due to taxonomic and life stage issues with that dataset. Defaults to \code{Sources = c("EMP", "FRP", "FMWT", "STN", "20mm", "DOP")}.
#' @param Size_class Zooplankton size classes (as defined by net mesh sizes) to be included in the integrated dataset. Choices include "Micro" (43 \eqn{\mu}m), "Meso" (150 - 160 \eqn{\mu}m), and "Macro" (500-505 \eqn{\mu}m). Defaults to \code{Size_class = c("Micro", "Meso", "Macro")}.
#' @param Time_consistency Would you like to apply a fix to enforce consistent taxonomic resolution over time? Only available for the Community option.
#' @param Intro_lag Only applicable if \code{Time_consistency = TRUE}. How many years after a species is introduced should we expect surveys to start counting them? Defaults to 2.
#' @param Response Which response variable(s) would you like for the zooplankton data? Choices are "CPUE" (catch per unit effort) and "BPUE" (carbon biomass per unit effort (\eqn{\mu}g/ \ifelse{html}{\out{m<sup>3</sup>}}{\eqn{m^{3}}})). Defaults to \code{Response = "CPUE"}.
#' @param Taxa If you only wish to include a subset of taxa, provide a character vector of the taxa you wish included. This can include taxa of any taxonomic level (e.g., \code{Taxa = "Calanoida"}) to include only calanoids. NOTE: we do not recommend you use this feature AND set \code{Data_type="Community"}. This is better suited to selecting higher-level taxa. If you wish to just include one or a few species, it would be faster to just filter the output of \code{\link{Zoopdownloader}} to include those taxa. Defaults to \code{NULL}, which includes all taxa.
#' @param Date_range Range of dates to include in the final dataset. To filter within a range of dates, include a character vector of 2  dates formatted in the yyyy-mm-dd format exactly, specifying the upper and lower bounds. To specify an infinite upper or lower bound (to include all values above or below a limit) input \code{NA} for that infinite bound. Defaults to \code{Date_range = c(NA, NA)}, which includes all dates.
#' @param Months Months (as integers) to be included in the integrated dataset. If you wish to only include data from a subset of months, input a vector of integers corresponding to the months you wish to be included. Defaults to \code{Months = NA}, which includes all months.
#' @param Years Years to be included in the integrated dataset. If you wish to only include data from a subset of years, input a vector of years you wish to be included. Defaults to \code{Years = NA}, which includes all months.
#' @param Sal_bott_range Filter the data by bottom salinity values. Include a vector of length 2 specifying the minimum and maximum values you wish to include. To include all values above or below a limit, utilize Inf or -Inf for the upper or lower bound respectively. Defaults to \code{Sal_bott_range = NA}, which includes all bottom salinities.
#' @param Sal_surf_range Same as previous, but for surface salinity.
#' @param Temp_range Same as \code{Sal_bott_range} but for surface temperature.
#' @param Lat_range Latitude range to include in the final dataset. Include a vector of length 2 specifying the minimum and maximum values you wish to include, in decimal degree format. Defaults to \code{Lat_range = NA}, which includes all latitudes.
#' @param Long_range Same as previous, but for longitude. Don't forget that Longitudes should be negative in the Delta!
#' @param Reload_data If set to \code{Reload_data = T} runs the \code{\link{Zoopdownloader}} function to re-combine source datasets. To include local versions of the datasets without redownloading them from online, set \code{Reload_data = TRUE} and \code{Redownload_data = FALSE}. Defaults to \code{Reload_data= FALSE}
#' @param Redownload_data Should data be re-downloaded from the internet? If set to \code{Redownload_data = TRUE}, runs \code{\link{Zoopdownloader}(Redownload_data=Redownload_data, Zoop_path=Zoop_path, Env_path=Env_path, ...)}. Defaults to \code{Redownload_data = FALSE}.
#' @param All_env Should all environmental parameters be included? Defaults to \code{All_env = TRUE}.
#' @param Shiny Is this function being used within the shiny app? If set to \code{Shiny = TRUE}, outputs a list with the integrated dataset as one component and the caveats as the other component. Defaults to \code{Shiny = FALSE}.
#' @inheritParams Zoopdownloader
#' @param Undersampled A table listing the taxonomic names and life stages of plankton undersampled by each net mesh size (i.e. size class). See \code{\link{undersampled}} (the default) for an example.
#' @param CompleteTaxaList Character vector of all taxonomic names in source datasets. Defaults to \code{\link{completeTaxaList}}.
#' @param StartDates Tibble with the starting dates of each source dataset. Defaults to \code{\link{startDates}}.
#' @param ... Arguments passed to \code{\link{Zoopdownloader}}.
#' @keywords integration synthesis zooplankton.
#' @import data.table
#' @importFrom magrittr %>%
#' @importFrom rlang .data
#' @return An integrated zooplankton dataset.
#' @details This function combines any combination of the zooplankton datasets (included as parameters)
#' and calculates least common denominator taxa to facilitate comparisons across datasets with differing
#' levels of taxonomic resolution. For more information on the source datasets see \code{\link{zooper}}.
#' @section Data type:
#' The \code{Data_type} parameter toggles between two approaches to resolving differences in taxonomic resolution.
#' If you want all available data on given Taxa, use \code{Data_type="Taxa"} but if you want to conduct a community
#' analysis, use \code{Data_type = "Community"}.\cr\cr
#' Briefly, \code{Data_type = "Community"} optimizes for community-level analyses by taking all taxa x life stage
#' combinations that are not measured in every input dataset, and summing them up taxonomic levels to the lowest
#' taxonomic level they belong to that is covered by all datasets. Remaining Taxa x life stage combos that are not
#' covered in all datasets up to the phylum level (usually something like Annelida or Nematoda or Insect Pupae) are
#' removed from the final dataset. However, some taxa x life stage combos are retained if they are taxonomic levels
#' higher than species that are counted in some surveys, and a lower taxonomic level within this group is counted in all surveys.
#' For example, if we had 3 surveys where surveys A and B count \emph{Pseudodiaptomus forbesi}, \emph{Pseudodiaptomus marinus},
#' and \emph{Pseudodiaptomus} spp. (UnID) but survey C only counts \emph{P. forbesi} and \emph{P. marinus} then the
#' \emph{Pseudodiaptomus} spp. (UnID) category would be retained after applying the community approach.\cr\cr
#' \code{Data_type = "Taxa"} optimizes for the Taxa-level user by maintaining all data at the original taxonomic level
#' (but it outputs warnings for taxa not measured in all datasets, which we call "orphans").
#' To facilitate comparisons across datasets, this option also sums data into general categories that are comparable
#' across all datasets and years: "summed groups." The new variable "Taxatype" identifies which taxa are summed groups
#' (\code{Taxatype = "Summed group"}), which are measured to the species level (\code{Taxatype = "Species"}), and which
#' are higher taxonomic groupings with the species designation unknown: (\code{Taxatype = "UnID species"}).
#' @author Sam Bashevkin
#' @examples
#' MyZoops <- Zoopsynther(Data_type = "Community",
#' Sources = c("EMP", "FRP", "FMWT"),
#' Size_class = "Meso",
#' Date_range = c("1990-10-01", "2000-09-30"))
#' @seealso \code{\link{Zoopdownloader}}, \code{\link{Taxnamefinder}}, \code{\link{SourceTaxaKeyer}}, \code{\link{crosswalk}}, \code{\link{undersampled}}, \code{\link{zoopComb}}, \code{\link{zoopEnvComb}}, \code{\link{zooper}}
#' @export

Zoopsynther<-function(
  Data_type = NULL,
  Zoop = zooper::zoopComb,
  ZoopEnv = zooper::zoopEnvComb,
  Zoop_path = NULL,
  Env_path = NULL,
  Sources = c("EMP", "FRP", "FMWT", "STN", "20mm", "DOP"),
  Size_class = c("Micro", "Meso", "Macro"),
  Time_consistency = FALSE,
  Intro_lag = 2,
  Response="CPUE",
  Taxa = NULL,
  Date_range = c(NA, NA),
  Months = NA,
  Years = NA,
  Sal_bott_range = NA,
  Sal_surf_range = NA,
  Temp_range = NA,
  Lat_range = NA,
  Long_range = NA,
  Reload_data = F,
  Redownload_data = F,
  All_env = T,
  Shiny = F,
  Crosswalk = zooper::crosswalk,
  Undersampled = zooper::undersampled,
  CompleteTaxaList = zooper::completeTaxaList,
  StartDates = zooper::startDates,
  ...){


  # Setup -------------------------------------------------------------------

  if(utils::packageVersion("dplyr") < "1.0.4") {
    stop("dplyr version >= 1.0.4 is required, please update dplyr.")
  }

  if(utils::packageVersion("dtplyr") < "1.1.0") {
    stop("dplyr version >= 1.1.0 is required, please update dtplyr.")
  }

  #Warnings for improper arguments
  if("YBFMP" %in% Sources){
    stop("YBFMP cannot be synthesized with this function due to taxonomic and life stage issues with that dataset. Please acccess the YBFMP data with the zoop_downloader function or the built-in datasets zoopComb and zoopEnvComb.")
  }

  if (!purrr::every(Sources, ~.%in%c("EMP", "FRP", "FMWT", "STN", "20mm", "DOP"))){
    stop("Sources must contain one or more of the following options: EMP, FRP, FMWT, STN, 20mm, DOP")
  }

  if (!purrr::every(Size_class, ~.%in%c("Micro", "Meso", "Macro"))){
    stop("Size_class must contain one or more of the following options: Micro, Meso, Macro")
  }


  if (is.null(Data_type)){
    stop("You must set Data_type to either 'Taxa' or 'Community'")
  }


  if (!(Data_type=="Taxa" | Data_type=="Community")){
    stop("Data_type must be either 'Taxa' or 'Community'")
  }
  if (!purrr::every(Response, ~.%in%c("CPUE", "BPUE"))){
    stop("Response must contain one or more of the following options: CPUE, BPUE")
  }

  if(!purrr::every(list(Shiny, Reload_data, Redownload_data, All_env, Time_consistency), is.logical)){
    stop("Shiny, All_env, Reload_data, Redownload_data, and Time_consistency must all have logical arguments.")
  }

  if(Data_type=="Taxa" & Time_consistency){
    stop("Time_consistency can only applied if Data_type=='Community'. It is only designed to apply consistent taxonomic resolution across time for community-level analyses.")
  }

  if(!dplyr::near(Intro_lag, round(Intro_lag))){
    stop("Intro_lag must be a whole number.")
  }

  if(!is.character(Taxa) & !is.null(Taxa)){
    stop("Taxa must be either a character vector of taxa to include or NA to include all taxa.")
  }

  if(!is.null(Taxa) & !purrr::every(Taxa, ~.%in%CompleteTaxaList)){
    warning("Check spelling of Taxa, some are not present in the completeTaxaList and are likely misspelled.")
  }

  #Make it possible to re-download data if desired
  if(Reload_data | Redownload_data){
    Zoopdownloader(Redownload_data = Redownload_data, Zoop_path = Zoop_path, Env_path = Env_path, ...)
  }

  #Recode Source
  Sources <- dplyr::recode(Sources, "20mm"= "twentymm")

  # Read in data if needed
  if(!is.null(Zoop_path) & !is.null(Env_path) & is.null(Zoop) & is.null(ZoopEnv)){
    Zoop<-readRDS(Zoop_path)
    ZoopEnv<-readRDS(Env_path)
  } else{
    if(!(is.null(Zoop_path) & is.null(Env_path) & !is.null(Zoop) & !is.null(ZoopEnv))){
      stop("You must either set Zoop_path=NULL, Env_path=NULL, and provide a list of datasets to Data_list OR set Data_list=NULL and paths to saved combined datasets to Zoop_path and Env_path. See ?zooper for more info.")
    }
  }


  # Filter data -------------------------------------------------------------

  #Filter to desired data sources. To save memory, environmental data is filtered
  #then at the end the remaining samples are selected from the zooplankton dataset

  #Filter by taxa

  if(!is.null(Taxa)){
    Zoop<-Zoop%>%
      dplyr::filter(.data$Taxname%in%Taxnamefinder(Crosswalk, Taxa))
  }

  # Recode 20mm to twentymm to keep this script running correctly

  ZoopEnv<-dplyr::mutate(ZoopEnv, Source=dplyr::recode(.data$Source, "20mm"= "twentymm"))
  Zoop<-dplyr::mutate(Zoop, Source=dplyr::recode(.data$Source, "20mm"= "twentymm"))

  #Filter by data sources and environment

  ZoopEnv<-dplyr::filter(ZoopEnv, .data$Source%in%Sources)

  #Filter data by specified variables if users provide appropriate ranges

  if(!purrr::every(Sal_bott_range, is.na)) {
    if(purrr::some(Sal_bott_range, is.na)) {
      stop("One element of Sal_bott_range cannot be NA, use Inf or -Inf to set limitless bounds")
    }
    ZoopEnv<-dplyr::filter(ZoopEnv, dplyr::between(.data$SalBott, min(Sal_bott_range), max(Sal_bott_range)))
  }

  if(!purrr::every(Sal_surf_range, is.na)) {
    if(purrr::some(Sal_surf_range, is.na)) {
      stop("One element of Sal_surf_range cannot be NA, use Inf or -Inf to set limitless bounds")
    }
    ZoopEnv<-dplyr::filter(ZoopEnv, dplyr::between(.data$SalSurf, min(Sal_surf_range), max(Sal_surf_range)))
  }

  if(!purrr::every(Temp_range, is.na)) {
    if(purrr::some(Temp_range, is.na)) {
      stop("One element of Temp_range cannot be NA, use Inf or -Inf to set limitless bounds")
    }
    ZoopEnv<-dplyr::filter(ZoopEnv, dplyr::between(.data$Temperature, min(Temp_range), max(Temp_range)))
  }

  if(!is.na(Date_range[1])){
    Datemin<-lubridate::as_date(Date_range[1])
    if(is.na(Datemin)) {stop("Date_range[1] in incorrect format. Reformat to 'yyyy-mm-dd'")}
    ZoopEnv<-dplyr::filter(ZoopEnv, .data$Date > Datemin)
  }

  if(!is.na(Date_range[2])){
    Datemax<-lubridate::as_date(Date_range[2])
    if(is.na(Datemax)) {stop("Date_range[2] in incorrect format. Reformat to 'yyyy-mm-dd'")}
    ZoopEnv<-dplyr::filter(ZoopEnv, .data$Date < Datemax)
  }

  if(!purrr::every(Months, is.na)) {
    ZoopEnv<-dplyr::filter(ZoopEnv, lubridate::month(.data$Date)%in%Months)
  }

  if(!purrr::every(Years, is.na)) {
    ZoopEnv<-dplyr::filter(ZoopEnv, .data$Year%in%Years)
  }

  if(!purrr::every(Lat_range, is.na)) {
    if(purrr::some(Lat_range, is.na)) {
      stop("One element of Lat_range cannot be NA, use Inf or -Inf to set limitless bounds")
    }
    ZoopEnv<-dplyr::filter(ZoopEnv, dplyr::between(.data$Latitude, min(Lat_range), max(Lat_range)))
  }

  if(!purrr::every(Long_range, is.na)) {
    if(purrr::some(Long_range, is.na)) {
      stop("One element of Long_range cannot be NA, use Inf or -Inf to set limitless bounds")
    }
    if(purrr::some(Long_range>0, isTRUE)) {warning("Longitudes should be negative for the Delta")}
    ZoopEnv<-dplyr::filter(ZoopEnv, dplyr::between(.data$Longitude, min(Long_range), max(Long_range)))
  }

  #Find remaining samples to retain
  Samples<-ZoopEnv%>%
    dplyr::pull("SampleID")%>%
    unique()

  #Retain remaining samples and filter to preferred size classes
  Zoop<-dplyr::filter(Zoop, .data$SampleID%in%Samples & .data$SizeClass%in%Size_class)%>%
    dplyr::select(-tidyselect::any_of(setdiff(c("CPUE", "BPUE"), Response)))

  #Also filter zoopEnv to only contain samples in Zoop
  ZoopEnv<-dplyr::filter(ZoopEnv, .data$SampleID%in%unique(Zoop$SampleID))


  if(!nrow(Zoop)>0){
    stop("Your filters were too strong and no data were returned.")
  }

  # Apply LCD approach ------------------------------------------------------

  #Make vector of taxonomic categories that we will use later
  Taxcats<-c("Genus", "Family", "Order", "Class", "Phylum")

  # Make list of taxa x life stage combinations present in each source dataset
  SourceTaxaKey<-SourceTaxaKeyer(Zoop, Crosswalk)

  #Create variable for size classes present in dataset
  Size_classes<-Zoop%>%
    dplyr::pull("SizeClass")%>%
    unique()%>%
    rlang::set_names()

  #Find all common Taxname x life stage combinations, turn into vector of Taxlifestages
  Commontax<-purrr::map(Size_classes, function(x)
    Commontaxer(SourceTaxaKey, "Taxname", x)%>%
      dplyr::mutate(Taxlifestage = paste(.data$Taxname, .data$Lifestage))%>%
      dplyr::pull("Taxlifestage"))

  #Apply commontaxer to every taxa level
  Commontaxkey<-purrr::map2_dfr(rep(Size_classes, each=length(Taxcats)),
                                rep(Taxcats, length(Size_classes)),
                                ~Commontaxer(SourceTaxaKey, .y, .x),
                                .id = "SizeClass")

  # Make list of taxalifestages that do not appear in all datasets
  Lumper<-function(Sizeclass){
    Taxa<-Zoop%>%
      dplyr::filter(.data$SizeClass==Sizeclass)%>%
      dplyr::pull("Taxlifestage")%>%
      unique()
    setdiff(Taxa, Commontax[[Sizeclass]])
  }

  Lumped<-purrr::map(Size_classes, Lumper)

  #Create reduced versions of crosswalk
  Crosswalk_reduced_stage<-Crosswalk%>%
    dplyr::select(c("Taxname", tidyselect::all_of(Taxcats), "Lifestage"))%>%
    dplyr::mutate(Taxlifestage=paste(.data$Taxname, .data$Lifestage))%>%
    dplyr::distinct()

  Crosswalk_reduced<-Crosswalk_reduced_stage%>%
    dplyr::select(-"Lifestage", -"Taxlifestage")%>%
    dplyr::distinct()

  #Create dataframe of undersampled taxa
  Undersampled<-Undersampled%>%
    dplyr::left_join(Crosswalk_reduced, by="Taxname")%>%
    tidyr::pivot_longer(cols=c("Phylum", "Class", "Order", "Family", "Genus", "Taxname"), names_to = "Level", values_to = "Taxa")%>%
    tidyr::drop_na()%>%
    dplyr::mutate(Taxlifestage = paste(.data$Taxa, .data$Lifestage),
                  Undersampled = TRUE)%>%
    dplyr::select("SizeClass", "Taxlifestage", "Undersampled")%>%
    dplyr::distinct()

  #Create vector of all unique taxa in dataset
  UniqueTaxa<-Zoop%>%
    dplyr::select("Taxname")%>%
    dplyr::distinct()%>%
    dplyr::left_join(dplyr::select(Crosswalk, "Taxname", "Level")%>%
                       dplyr::distinct(),
                     by="Taxname",
                     relationship = "many-to-one")%>%
    dplyr::filter(.data$Level!="Species")%>%
    dplyr::pull("Taxname")

  # Apply LCD approach for taxa-level data user ----------------------

  if(Data_type=="Taxa"){

    #Make vector of _g Taxonomic variables
    Taxcats_g<-paste0(Taxcats, "_g")

    # Select higher level groupings that correspond to Taxnames, i.e., are a
    # classification category in one of the original datasets, and rename
    # them as "Taxonomiclevel_g"

    #Create vector of groups sampled all datasets
    All_groups<-purrr::map(Size_classes, function(x) Datareducer(Data=dplyr::filter(Commontaxkey, .data$SizeClass%in%x), Reduced_vars = Taxcats))

    #Only retain groups if that taxa is present in all datasets (Must be present in the genus, family, order, class, or phylum column of all datasets)
    #### !!! THIS ENSURES THAT SUMMED GROUPS ARE REPRESENTATIVE ACROSS ALL DATASETS !!! ###
    Zoop<-Zoop%>%
      dplyr::mutate(dplyr::across(tidyselect::all_of(Taxcats), list(g=~dplyr::if_else(.%in%UniqueTaxa, ., NA_character_))))%>%
      dplyr::group_by(.data$SizeClass)%>%
      dplyr::mutate(dplyr::across(tidyselect::all_of(Taxcats_g), ~dplyr::if_else(.%in%All_groups[[unique(SizeClass)]], ., NA_character_)))%>%
      dplyr::ungroup()

    #Extract vector of grouping taxa (i.e. all unique taxa retained in the above step)
    Groups<-purrr::map(Size_classes, function(x) Datareducer(Data=dplyr::filter(Zoop, .data$SizeClass%in%x), Reduced_vars = Taxcats_g))

    #Only retain groups if that taxa is present in all datasets (Must be present in the genus, family, order, class, or phylum column of all datasets)
    Groups<-purrr::map(Size_classes, function(x) Groups[[x]][which(Groups[[x]]%in%All_groups[[x]])])

    # Output list of taxa that were not measured in all datasets, and are
    # not higher taxa that can be calculated by summing lower taxa, i.e.
    # "orphan taxa"

    Orphans<-purrr::map(Size_classes, Taxaremover, Taxlifestage_list=Lumped, Remove_taxa=Groups)

    rm(Lumped)

    #Turn Orphans into dataframe so it can be added to Zoop data
    Orphansdf<-Orphans[which(nchar(Orphans)>0)]
    Orphansdf<-purrr::map(rlang::set_names(names(Orphansdf)), ~strsplit(Orphansdf[[.x]], ", ")[[1]])%>%
      tibble::enframe(name = "SizeClass", value = "Taxlifestage")%>%
      tidyr::unnest("Taxlifestage")%>%
      dplyr::mutate(Orphan=TRUE)


    # Calculate summed groups and create a final dataset
    Zoop<-purrr::map_dfr(Taxcats_g, .f=LCD_Taxa, Data=Zoop%>%
                           dplyr::select(-"Phylum", -"Class", -"Order", -"Family", -"Genus", -"Species", -"Taxlifestage"),
                         Response=Response)%>% #Taxonomic level by level, summarise for each of these grouping categories and bind them all together
      dplyr::left_join(Crosswalk_reduced, by="Taxname")%>%
      dplyr::mutate(Taxlifestage=paste(.data$Taxname, .data$Lifestage))%>%
      dplyr::left_join(Undersampled, by=c("Taxlifestage", "SizeClass"))%>% #Add warnings for taxa undersampled by different methods
      dplyr::mutate(Undersampled=tidyr::replace_na(.data$Undersampled, FALSE))%>%
      dplyr::select(-"Taxlifestage")%>%
      dplyr::mutate(Taxname = paste(.data$Taxname, "all", .data$SizeClass, sep="_"), #Differentiate grouped Taxnames from others
                    Taxatype = "Summed group")%>% #Add a label to these summed groups so they can be removed later if users wish)
      dplyr::bind_rows(Zoop%>% #Bind these summarized groupings to the original taxonomic categories in the original dataset
                         dplyr::mutate(Taxatype=dplyr::if_else(.data$Taxname%in%unique(unlist(Groups, use.names = FALSE)), "UnID species", "Species"))%>%
                         dplyr::left_join(Undersampled, by=c("Taxlifestage", "SizeClass"))%>% #Add warnings for taxa undersampled by different methods
                         dplyr::mutate(Undersampled=tidyr::replace_na(.data$Undersampled, FALSE)))%>%
      dplyr::ungroup()

    rm(Groups)
    gc()

    Zoop<-Zoop%>%
      dplyr::mutate(Taxlifestage = paste(.data$Taxname, .data$Lifestage))%>% #add back in the Taxlifestage variable (removed by the LCD_Taxa function)
      {if(nrow(Orphansdf)>0){
        dplyr::left_join(., Orphansdf, by=c("Taxlifestage", "SizeClass"))%>%
          dplyr::mutate(Orphan = tidyr::replace_na(.data$Orphan, FALSE)) #add an identifier for orphan taxa (species not counted in all data sources)
      } else{
        dplyr::mutate(., Orphan=FALSE)
      }}%>%
      dplyr::mutate(Taxname = dplyr::if_else(.data$Taxatype=="UnID species", paste0(.data$Taxname, "_UnID"), .data$Taxname))%>%
      dplyr::select(-tidyselect::all_of(Taxcats_g))%>%
      dplyr::left_join(ZoopEnv%>%
                         {if(!All_env){
                           dplyr::select(., "Year", "Date", "SalSurf", "Latitude", "Longitude", "SampleID")
                         } else{
                           dplyr::select(., -"Source")
                         }}, by="SampleID")

    #Output caveats specific to these data
    caveats<-c(dplyr::if_else(!purrr::some(purrr::map_dbl(Orphans, nchar), function(x) x>0), "No orphaned taxa here!", "Some taxa were not measured in all datasets and are indicated by the flag Orphan = TRUE"), "NOTE: Do not use this data to make additional higher-level taxonomic summaries or any other operations to add together taxa above the species level unless you first filter out all rows with Taxatype=='Summed group' and, depending on your purpose, Orphan==TRUE. Orphan status varies with size class. Do not compare UnID categories across data sources.")
    rm(Orphans)
  }

  # Apply LCD approach for community data user ------------------------------

  if(Data_type=="Community"){

    #Create vector of species level taxa
    UniqueSpecies<-Crosswalk%>%
      dplyr::select("Taxname", "Level")%>%
      dplyr::filter(.data$Level=="Species")%>%
      dplyr::pull("Taxname")%>%
      unique()

    if(Time_consistency){
      datasets<-Zoop%>%
        dplyr::select("Source", "SizeClass")%>%
        dplyr::filter(.data$Source%in%c("EMP", "STN", "FMWT", "twentymm", "DOP", "FRP"))%>%
        dplyr::mutate(Source=dplyr::recode(.data$Source, STN="FMWT"))%>% #STN currently represented as FMWT for start/end years
        dplyr::distinct()

      AllYears<-ZoopEnv%>%
        dplyr::select("Date")%>%
        dplyr::distinct()%>%
        dplyr::mutate(Year=as.integer(lubridate::year(.data$Date)))%>%
        dplyr::pull("Year")%>%
        unique()

      StartDates<-StartDates%>%
        dplyr::filter(.data$Source!="YBFMP")%>%
        dplyr::mutate(Source=dplyr::recode(.data$Source, "20mm"= "twentymm"))

      BadYears<-purrr::map2_dfr(datasets$Source, datasets$SizeClass, ~ Uncountedyears(Source = .x,
                                                                                      Size_class = .y,
                                                                                      Crosswalk=Crosswalk,
                                                                                      Start_year = StartDates%>%
                                                                                        dplyr::filter(.data$Source==.x & .data$SizeClass==.y)%>%
                                                                                        dplyr::pull("Startdate")%>%
                                                                                        lubridate::year(),
                                                                                      Intro_lag = Intro_lag))%>%
        dplyr::filter(purrr::map_lgl(.data$Years, function(x) any(x%in%AllYears)))%>%
        dplyr::mutate(Taxlifesize = paste(.data$Taxlifestage, .data$SizeClass))

      if(nrow(BadYears)>0){
        Lumped<-purrr::map(Size_classes, ~ unique(c(Lumped[[.]], dplyr::filter(BadYears, .data$SizeClass%in%.)%>%dplyr::pull("Taxlifestage")%>%unique())))
      }
    }

    #Exit this process early if all studies are taxonomically consistent
    if(!purrr::some(purrr::map_dbl(Lumped, length), function(x) x>0)){

      Zoop<-Zoop%>%
        dplyr::left_join(Undersampled, by=c("Taxlifestage", "SizeClass"))%>%
        dplyr::mutate(Undersampled=tidyr::replace_na(.data$Undersampled, FALSE))%>%
        dplyr::mutate(Source = dplyr::recode(.data$Source, twentymm = "20mm"))%>%
        dplyr::mutate(Taxname=dplyr::if_else(.data$Taxname%in%UniqueSpecies, .data$Taxname, paste0(.data$Taxname, "_UnID")))%>%
        dplyr::mutate(Taxlifestage = paste(.data$Taxname, .data$Lifestage))%>%
        dplyr::left_join(ZoopEnv%>%
                           {if(!All_env){
                             dplyr::select(., "Year", "Date", "SalSurf", "Latitude", "Longitude", "SampleID")
                           } else{
                             dplyr::select(., -"Source")
                           }}, by="SampleID")

      out<-list(Data=Zoop, Caveats="No disclaimers here! Enjoy the clean data!")

      if(Shiny){
        return(out)
      } else{
        print(out$Caveats)
        return(out$Data)
      }
    }

    #Create vector of all unique taxlifestages
    UniqueTaxlifesize<-Zoop%>%
      dplyr::mutate(Taxlifesize = paste(.data$Taxlifestage, .data$SizeClass))%>%
      {if(Time_consistency){ #To correct for time, remove all members of this list that were not counted in all years. This will ensure they do not appear in the Commontaxkey
        if(nrow(BadYears)>0){
          dplyr::select(., "Taxlifesize")%>%
            dplyr::filter(!(.data$Taxlifesize%in%unique(BadYears$Taxlifesize)))
        } else{
          .
        }
      } else{
        .
      }
      }%>%
      dplyr::pull("Taxlifesize")%>%
      unique()

    #Create taxonomy table for all taxonomic levels present (and measured) in all datasets. If the taxonomic level is not present as a taxname (i.e. there is no spp. category for that taxonomic level) it will be removed.
    Commontaxkey<-Commontaxkey%>%
      dplyr::mutate(dplyr::across(tidyselect::all_of(Taxcats), list(lifestage=~dplyr::if_else(is.na(.), NA_character_, paste(., .data$Lifestage)))))%>% #Create taxa x life stage variable for each taxonomic level
      dplyr::mutate(dplyr::across(tidyselect::all_of(paste0(Taxcats, "_lifestage")), ~dplyr::if_else(paste(., .data$SizeClass)%in%UniqueTaxlifesize, ., NA_character_)))%>%
      dplyr::select("Genus_lifestage", "Family_lifestage", "Order_lifestage", "Class_lifestage", "Phylum_lifestage", "SizeClass")%>% #only retain columns we need
      dplyr::filter(dplyr::if_any(-c("SizeClass"), ~!is.na(.x)))

    #Create taxonomy table for taxa not present in all datasets, then select their new names corresponding to taxa x life stage combinations that are measured in all datasets
    LCD_Com<-function(Lumped, crosswalk, Commontaxkey){
      tibble::tibble(Taxlifestage = Lumped)%>%
        dplyr::left_join(Crosswalk_reduced_stage, by="Taxlifestage")%>%
        dplyr::mutate(dplyr::across(tidyselect::all_of(c("Genus", "Family", "Order", "Class", "Phylum")), list(lifestage=~dplyr::if_else(is.na(.), NA_character_, paste(., .data$Lifestage)))))%>% #Create taxa x life stage variable for each taxonomic level
        dplyr::mutate(Taxname_new = dplyr::case_when( #This will go level by level, look for matches with the Commontaxkey, and assign the taxonomic level that matches. The "TRUE" at end specifies what to do if no conditions are met.
          !is.na(.data$Genus_lifestage) & .data$Genus_lifestage%in%Commontaxkey$Genus_lifestage ~ paste0(.data$Genus, "_Genus"),
          !is.na(.data$Family_lifestage) & .data$Family_lifestage%in%Commontaxkey$Family_lifestage ~ paste0(.data$Family, "_Family"),
          !is.na(.data$Order_lifestage) & .data$Order_lifestage%in%Commontaxkey$Order_lifestage ~ paste0(.data$Order, "_Order"),
          !is.na(.data$Class_lifestage) & .data$Class_lifestage%in%Commontaxkey$Class_lifestage ~ paste0(.data$Class, "_Class"),
          !is.na(.data$Phylum_lifestage) & .data$Phylum_lifestage%in%Commontaxkey$Phylum_lifestage ~ paste0(.data$Phylum, "_Phylum"),
          TRUE ~ "REMOVE_" #If no match is found, change to the Taxname REMOVE so we know to later remove these data from the final database
        ))%>%
        dplyr::mutate(Genus=dplyr::if_else(stringr::str_detect(.data$Taxname_new, "\\_Genus"), .data$Genus, NA_character_),
                      Family=dplyr::if_else(stringr::str_detect(.data$Taxname_new, "\\_Genus|\\_Family"), .data$Family, NA_character_),
                      Order=dplyr::if_else(stringr::str_detect(.data$Taxname_new, "\\_Genus|\\_Family|\\_Order"), .data$Order, NA_character_),
                      Class=dplyr::if_else(stringr::str_detect(.data$Taxname_new, "\\_Genus|\\_Family|\\_Order|\\_Class"), .data$Class, NA_character_)
        )%>% # Addtaxonomoic levels into appropriate columns, making sure all are filled as appropriate.
        dplyr::mutate(Taxname_new=stringr::str_extract(.data$Taxname_new, "^[^_]+(?=_)")) # Remove the _Genus etc. labels from taxname
    }

    Lumpedkey<-purrr::map_dfr(Size_classes,
                              ~LCD_Com(Lumped[[.]], Crosswalk, dplyr::filter(Commontaxkey, .data$SizeClass==.)),
                              .id = "SizeClass")

    rm(Lumped)
    rm(Commontaxkey)

    #Remove taxa with no relatives across datasets
    if("REMOVE"%in%Lumpedkey$Taxname_new){

      Removed<-Lumpedkey%>%
        dplyr::filter(.data$Taxname_new=="REMOVE")%>%
        dplyr::mutate(Taxlifestage=paste0(.data$Taxlifestage, " (", .data$SizeClass, ")"))%>%
        dplyr::pull("Taxlifestage")%>%
        unique()

      Removed<-paste(Removed, collapse=", ")

      caveats<-paste0("These species have no relatives in their size class common to all datasets and have been removed from one or more size classes: ", Removed)
    } else{
      caveats<-"No disclaimers here! Enjoy the clean data!"
    }

    #Rename and sum taxa that are not measured in all datasets, add taxonomic info back to data
    Zoop<-Zoop%>%
      dplyr::left_join(Lumpedkey%>%
                         dplyr::select("Taxlifestage", "Taxname_new", "SizeClass"),
                       by=c("Taxlifestage", "SizeClass"))%>%
      dplyr::mutate(Taxname = dplyr::if_else(is.na(.data$Taxname_new), .data$Taxname, .data$Taxname_new))%>%
      dplyr::select(-"Taxname_new")%>%
      dplyr::filter(.data$Taxname!="REMOVE")%>%
      dplyr::mutate(Taxlifestage=paste(.data$Taxname, .data$Lifestage))%>%
      dplyr::select(-"Phylum", -"Class", -"Order", -"Family", -"Genus", -"Species")%>%
      dtplyr::lazy_dt()%>%
      dplyr::group_by(dplyr::across(-tidyselect::all_of(Response)))%>%
      dplyr::summarise(dplyr::across(tidyselect::all_of(Response), ~sum(.x)))%>%
      dplyr::ungroup()%>%
      tibble::as_tibble()%>%
      dplyr::left_join(Crosswalk%>%
                         dplyr::select("Taxname", "Lifestage", "Phylum", "Class", "Order", "Family", "Genus", "Species")%>%
                         dplyr::mutate(Taxlifestage = paste(.data$Taxname, .data$Lifestage))%>%
                         dplyr::select(-"Taxname", -"Lifestage")%>%
                         dplyr::bind_rows(Lumpedkey%>%
                                            dplyr::filter(.data$Taxname_new!="REMOVE")%>%
                                            dplyr::select("Taxname_new", "Lifestage", "Phylum", "Class", "Order", "Family", "Genus")%>%
                                            dplyr::mutate(Taxlifestage = paste(.data$Taxname_new, .data$Lifestage))%>%
                                            dplyr::select(-"Taxname_new", -"Lifestage"))%>%
                         dplyr::distinct(),
                       by="Taxlifestage"
      )%>%
      dplyr::left_join(Undersampled, by=c("Taxlifestage", "SizeClass"))%>% #Add warnings for taxa undersampled by different methods
      dplyr::mutate(Undersampled=tidyr::replace_na(.data$Undersampled, FALSE))%>%
      dplyr::mutate(Taxname=dplyr::if_else(.data$Taxname%in%UniqueSpecies, .data$Taxname, paste0(.data$Taxname, "_UnID")))%>%
      dplyr::mutate(Taxlifestage = paste(.data$Taxname, .data$Lifestage))%>%
      dplyr::left_join(ZoopEnv%>%
                         {if(!All_env){
                           dplyr::select(., "Year", "Date", "SalSurf", "Latitude", "Longitude", "SampleID")
                         } else{
                           dplyr::select(., -"Source")
                         }}, by="SampleID")
  }

  Zoop<-Zoop%>%
    dplyr::mutate(Source = dplyr::recode(.data$Source, twentymm = "20mm"))

  out<-list(Data=Zoop, Caveats=caveats)

  if(Shiny){
    return(out)
  } else{
    print(out$Caveats)
    return(out$Data)
  }
}

# Ideas for testing -------------------------------------------------------

# 1)  Make sure rows aren't repeated by looking for repeats of
#   paste(SurveyID, Taxlifestage)



# Issues ------------------------------------------------------------------



# TODO --------------------------------------------------------------------

# 1)  Test case_when statements that are assigning appropriate NAs and 0s
#   depending on intro, start, and end years to make sure they are working
#   correctly

# 2)  Do we need to make some of these case_when statements for FRP data or
#   are 0s and NAs already assigned appropriately?

# 4)  Try speeding up code more. Bottleneck now is probably read_excel()

# 5)  Do we want to append something like "_UnID" to the taxnames of Lumped
# (LCDed) taxa under option Data_type=="Community"???


# 6)  How do we apply the LCD approach for community data to issues arrising
#   from changing taxonomic resolution over time in individual datasets
InteragencyEcologicalProgram/zooper documentation built on Feb. 6, 2025, 9:01 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
InteragencyEcologicalProgram/zooper
Download and Integrate Zooplankton Datasets from the Upper San Francisco Estuary

R/zoop_synthesizer.R
In InteragencyEcologicalProgram/zooper: Download and Integrate Zooplankton Datasets from the Upper San Francisco Estuary

Defines functions Zoopsynther

Documented in Zoopsynther

R Package Documentation

Browse R Packages

We want your feedback!

InteragencyEcologicalProgram/zooper Download and Integrate Zooplankton Datasets from the Upper San Francisco Estuary

R/zoop_synthesizer.R In InteragencyEcologicalProgram/zooper: Download and Integrate Zooplankton Datasets from the Upper San Francisco Estuary

Defines functions Zoopsynther

Documented in Zoopsynther

R Package Documentation

Browse R Packages

We want your feedback!

InteragencyEcologicalProgram/zooper
Download and Integrate Zooplankton Datasets from the Upper San Francisco Estuary

R/zoop_synthesizer.R
In InteragencyEcologicalProgram/zooper: Download and Integrate Zooplankton Datasets from the Upper San Francisco Estuary