#' Bootstrap the baseline model, write to NetCDF4 file
#'
#' Run a subset of the baseline model a number of times while resampling data.
#'
#' @param outputData The output of the function from an earlier run.
#' @param outputMemoryFile The path to the output memory file to copy the \code{outputData} to in the case that \code{UseOutputData} is TRUE.
#' @param projectPath The path to the project to containing the baseline to bootstrap.
#' @param BootstrapMethodTable A table of the columns ProcessName, ResampleFunction and Seed, where each row defines the resample function to apply to the output of the given process, and the seed to use in the resampling. The seed is used to draw one seed per bootstrap run using \code{\link[RstoxBase]{getSeedVector}}. See Details for a list of resampling functions and the limitations and benefits of each function.
#' @param NumberOfBootstraps Integer: The number of bootstrap replicates.
#' @param OutputProcesses A vector of the processes to save from each bootstrap replicate.
#' @param OutputVariables An optional list of variables to keep in the output. A typical set of variables could be ["Survey", "Stratum", "SpeciesCategory", "IndividualTotalLength", "IndividualAge", "Abundance", "Biomass"], which should cover the most frequently used variables in reports. Any variable that is used in a report must be present in \code{OutputVariables}. Empty list (the default) implies to keep all variables. This parameter is included to facilitate smaller disc space for the bootstrap objects and faster writing/reading of that file. The \code{OutputVariables} are extracted for all processes listed in \code{OutputProcesses}. See \code{\link{getBootstrapOutputVariables}} for finding the variables used in ReportBootstrap processes of a project.
#' @param UseOutputData Logical: Bootstrapping can be time consuming, and by setting \code{UseOutputData} to TRUE the output file generated by a previous run of the process will be used instead of re-running the bootstrapping. Use this parameter with caution. Any changes made to the Baseline model or to the parameters of the Bootstrap itself will not be accounted for unless UseOutputData = FALSE. The option UseOutputData = TRUE is intended only for saving time when one needs to generate a report from an existing Bootstrap run."
#' @param NumberOfCores The number of cores to use for parallel processing. A copy of the project is created in tempdir() for each core, also when using only one core. Note that this will require disc space equivalent to the \code{NumberOfCores} time the size of the project folder (excluding the output/analysis/Bootstrap folder, which will be deleted before copies are made).
#' @param BaselineSeedTable A table of ProcessName and Seed, giving the seed to use for the Baseline processes that requires a Seed parameter. The seed is used to draw one seed per bootstrap run using \code{\link[RstoxBase]{getSeedVector}}.
#'
#' @details
#' \strong{Resampling of BioticAssignment}
#'
#' For acoustic-trawl survey estimates there are two possible resampling functions for the biotic assignment; \code{\link{ResampleBioticAssignmentByStratum}} and \code{\link{ResampleBioticAssignmentByAcousticPSU}}. Both of these have their limitations and effect on the mean and variance, and the choice of resampling function should be based on the survey design and how the Hauls are assigned to the AcousticPSUs. The effect on the mean and variance will also depend on the characteristics of the biotic data:
#'
#' When ResampleFunction is \code{\link{ResampleBioticAssignmentByStratum}} in the \code{BootstrapMethodTable} (the only option before StoX 4.0.0), if the AcousticPSUs of a Stratum have different assigned Hauls, there is a probability that none of the assigned Hauls of an AcousticPSU are re-sampled in a bootstrap replicate. Different assigned Hauls can be the result of using \code{DefinitionMethod} "Radius" or "EllipsoidalDistance", or manually assigning different Hauls to each AcousticPSU in \code{\link[RstoxBase]{DefineBioticAssignment}}. This will lead to missing acoustic density for that PSU for the target species, which will propagate throughout to the reports. The only option in order to avoid missing values in the reports in this case is to use RemoveMissingValues = TRUE, which introduces under-estimation of the mean compared to what the estimate would be if none of the AcousticPSUs came out with missing acoustic density. The degree of under-estimation depends on how many of the bootstrap replicates have the problem of AcousticPSUs with missing assignment length distribution.
#'
#' If the problem of under-estimation due to different assigned hauls per AcousticPSU is present in a StoX project, one alternative is to use the ResampleFunction \code{\link{ResampleBioticAssignmentByAcousticPSU}} instead. With this method, the assigned hauls are resampled for \emph{each individual} AcousticPSU. A potential consequence of this resampling function is however that the variance can be under-estimated compared to using the ResampleFunction \code{\link{ResampleBioticAssignmentByStratum}}. The reason for this is that while extreme outcomes of the resampling are equal for all AcousticPSUs of a stratum in the case that ResampleFunction is \code{\link{ResampleBioticAssignmentByStratum}}, extreme outcomes are likely to be counteracted by other outcomes from the rest of the AcousticPSUs in the case that ResampleFunction is \code{\link{ResampleBioticAssignmentByAcousticPSU}}. The under-estimation can be dependent on the number of AcousticPSUs for each Stratum.
#'
#' I addition, using ResampleFunction \code{\link{ResampleBioticAssignmentByAcousticPSU}} will require a sufficient number of Hauls assigned to each AcousticPSU to achieve meaningful bootstrapping. Surely, only one assigned Haul is not sufficient for any contribution to the variance, and will lead to a warning.
#'
#'
#' A different option when experiencing missing values in reports due to different assigned Hauls is to split a Stratum into smaller strata, each for which the \code{DefinitionMethod} "Stratum" can be used in \code{\link[RstoxBase]{DefineBioticAssignment}}.
#'
#' \strong{Resampling of MeanNASCData}
#'
#' The ResampleFunction \code{\link{ResampleMeanNASCData}} resamples, with replacement, the AcousticPSUs within Stratum in the \code{\link{MeanNASCData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The column NASC is scaled by the number of occurrences of each AcousticPSUs from the resampling.
#'
#' \strong{Resampling of ResampleMeanLengthDistributionData}
#'
#' The ResampleFunction \code{\link{ResampleMeanLengthDistributionData}} resamples, with replacement, the BioticPSUs within Stratum in the \code{\link{MeanLengthDistributionData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The column WeightedNumber is scaled by the number of occurrences of each BioticPSUs from the resampling.
#'
#' \strong{Resampling of ResampleMeanSpeciesCategoryCatchData}
#'
#' The ResampleFunction \code{\link{ResampleMeanSpeciesCategoryCatchData}} resamples, with replacement, the BioticPSUs within Stratum in the \code{\link{MeanSpeciesCategoryCatchData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The columns TotalCatchWeight and TotalCatchNumber are scaled by the number of occurrences of each BioticPSUs from the resampling.
#'
#' \strong{General}
#'
#' Run RstoxFramework::getRstoxFrameworkDefinitions("resampleFunctions") in R to get a list of the implemented resampling functions. Note that if a process is selected in \code{BootstrapMethodTable} that is not used in the model up to the \code{OutputProcesses}, the bootstrapping of that process will not be effective on the end result (e.g. select the correct process that returns BioticAssignment data type).
#'
#' A copy of the project is made for each core given by \code{NumberOfCores}.
#'
#' @return
#' A \code{\link{BootstrapData}} object, which is a list of the RstoxData \code{\link[RstoxData]{DataTypes}} and RstoxBase \code{\link[RstoxBase]{DataTypes}}.
#'
#' @export
#'
Bootstrap <- function(
outputData,
outputMemoryFile,
projectPath,
# Table with ProcessName, ResamplingFunction, ResampleWithin, Seed:
BootstrapMethodTable = data.table::data.table(),
NumberOfBootstraps = 1L,
OutputProcesses = character(),
OutputVariables = character(),
UseOutputData = FALSE,
NumberOfCores = 1L,
BaselineSeedTable = data.table::data.table()
) {
# Use previously generated output data if specified:
if(UseOutputData) {
warning("StoX: Using UseOutputData = TRUE in the function Bootstrap implies reading the BootstrapData from a previous run (stored in the output folder)). Any changes made to the Baseline model or to the parameters of the Bootstrap itself will not be accounted for unless UseOutputData = FALSE. The option UseOutputData = TRUE is intended only for saving time when one needs to generate a report from an existing Bootstrap run.")
# Read the outputData (from a former run of the proecss). Use functionArguments["outputData"] to add the data, since using functionArguments$outputData will delete this element if trying to give it the value NULL:
if(is.character(outputData)) {
if(length(outputData)) {
if(!file.exists(outputData)) {
oldRataFile <- file.path(dirname(outputData), "BootstrapData.RData")
if(file.exists(oldRataFile)) {
# We decided not to convert existing RData file and rather assume that the use re-runs the bootstrapping in StoX 4.0.0.
# Convert the old bootstrap RData file to a new NetCDF4 file:
#timeUsed <- system.time(bootstrapRDataToNetCDF4(oldRataFile, processName = processName, ow = T))
# ... with a message:
#warning(
# "An old bootstrap RData file was found and coverted to a NetCDF4 file fitted for StoX >= 4.0.0. This operation is usually time cosuming, and lasted for ", timeUsed[[3]], " seconds.")
stop("A bootstrap output file produced with StoX <= 3.6.2 (RData file) exists, but this file is not compatible with StoX >= 4.0.0 (nc file). Please re-run the Bootstrap process.")
}
else {
stop("The file ", outputData, " does not exist. Please re-run the Bootstrap process.")
}
}
#if(file.exists(outputData)) {
outputData <- tryCatch(
{
if(!file.exists(dirname(outputMemoryFile))) {
dir.create(dirname(outputMemoryFile), recursive = TRUE)
}
file.copy(outputData, outputMemoryFile)
# Restore the class:
outputMemoryFile <- createStoXNetCDF4FileDataType(outputMemoryFile)
return(outputMemoryFile)
},
error = function(err) list(NULL)
)
#}
#else {
# stop("Somethting went wrong when converting RData to NetCDF4...")
#}
}
else {
outputData <- list(NULL)
}
}
else {
stop("outputData must be the path to the output NetCDF4 file from a preivous Bootstrap() run.")
}
return(outputData)
}
# Prepare the bootstrapping:
bootstrapSpec <- prepareBootstrap(projectPath, BootstrapMethodTable, OutputProcesses, NumberOfBootstraps, NumberOfCores, BaselineSeedTable)
# Grap the outputs from the bootstrap preparation:
processesSansProcessData = bootstrapSpec$processesSansProcessData
NumberOfBootstrapsFile = bootstrapSpec$NumberOfBootstrapsFile
bootstrapProgressFile = bootstrapSpec$bootstrapProgressFile
projectPath_copies = bootstrapSpec$projectPath_copies
stopBootstrapFile = bootstrapSpec$stopBootstrapFile
replaceDataList = bootstrapSpec$replaceDataList
BaselineSeedList = bootstrapSpec$BaselineSeedList
processNames = bootstrapSpec$processNames
dataTypes = bootstrapSpec$dataType
# Delete files on exit:
on.exit(unlink(NumberOfBootstrapsFile, force = TRUE, recursive = TRUE))
on.exit(unlink(bootstrapProgressFile, force = TRUE, recursive = TRUE), add = TRUE)
on.exit(unlink(projectPath_copies, force = TRUE, recursive = TRUE), add = TRUE)
on.exit(if(file.exists(NumberOfBootstrapsFile)) unlink(NumberOfBootstrapsFile, force = TRUE, recursive = TRUE), add = TRUE)
on.exit(if(file.exists(bootstrapProgressFile)) unlink(bootstrapProgressFile, force = TRUE, recursive = TRUE), add = TRUE)
on.exit(if(file.exists(stopBootstrapFile)) unlink(stopBootstrapFile, force = TRUE, recursive = TRUE), add = TRUE)
# Run the bootstrap:
bootstrapIndex <- seq_len(NumberOfBootstraps)
temp <- RstoxData::mapplyOnCores(
FUN = runOneBootstrap,
NumberOfCores = NumberOfCores,
ind = bootstrapIndex,
replaceArgsList = BaselineSeedList,
replaceDataList = replaceDataList,
projectPath = projectPath_copies,
# Other inputs:
MoreArgs = list(
projectPath_original = projectPath,
startProcess = min(processesSansProcessData$processIndex),
endProcess = max(processesSansProcessData$processIndex),
outputProcessesIDs = OutputProcesses,
outputVariableNames = OutputVariables,
bootstrapProgressFile = bootstrapProgressFile,
stopBootstrapFile = stopBootstrapFile
),
SIMPLIFY = FALSE
)
# Store the dimensions:
dims <- lapply(temp, "[[", "dims")
nchars <- lapply(temp, "[[", "nchars")
# Treat messages, warnings and errors:
allMessages <- unlist(lapply(temp, "[[", "message"))
allWarnings <- unlist(lapply(temp, "[[", "warning"))
allErrors <- unlist(lapply(temp, "[[", "error"))
# Issue the (unique) messages, warnings and errors, so that these can be picked up:
allMessages <- uniqueText(allMessages, tag = "messages")
allWarnings <- uniqueText(allWarnings, tag = "warnings")
allErrors <- uniqueText(allErrors, tag = "errors")
if(length(allMessages)) {
# In order to avoid the warning "In lapply(c(allWarnings), warning) :longer object length is not a multiple of shorter object length" we do a for loop:
#lapply(allMessages, message)
for(ind in seq_along(allMessages)) message(allMessages[ind])
}
if(length(allWarnings)) {
# In order to avoid the warning "In lapply(c(allWarnings), warning) :longer object length is not a multiple of shorter object length" we do a for loop:
#lapply(allWarnings, warning)
for(ind in seq_along(allWarnings)) warning(allWarnings[ind])
}
if(length(allErrors)) {
# In order to avoid the warning "In lapply(c(allWarnings), warning) :longer object length is not a multiple of shorter object length" we do a for loop:
#lapply(allErrors, stop)
for(ind in seq_along(allErrors)) stop(allErrors[ind])
}
# List all saved memory data folders:
memoryDataFolders <- lapply(projectPath_copies, getProjectPaths, "dataModelsFolder")
memoryDataSubFolders <- unlist(lapply(memoryDataFolders, list.dirs, recursive = FALSE))
# Get the original folders:
originalFolders <- memoryDataSubFolders[basename(memoryDataSubFolders) == "baseline"]
# Start by deleting all original folders (for all cores):
unlink(originalFolders, recursive = TRUE)
# Get only those with suffix:
hasSuffix <- grepl("_", basename(memoryDataSubFolders))
memoryDataSubFolders <- memoryDataSubFolders[hasSuffix]
# Order the sub folders by suffix:
suffix <- as.numeric(sub("^[^_]*_", "", basename(memoryDataSubFolders)))
memoryDataSubFolders <- memoryDataSubFolders[order(suffix)]
# Create a tempfile to write the bootstrap data to, and return the path of this tempfile at the end. Then runProcess() will recognize the tempfile and rename it to a memory file and make a copy to the output file.
filePath <- tempfile()
nc <- NULL
# Run through the bootstrap runs and rename the saved baseline folders to the original name:
message("StoX: Writing netCDF4 file ...")
nc <- writeBootstrapOutputFromIndividualProjects(
memoryDataSubFolders = memoryDataSubFolders,
outputProcesses = OutputProcesses,
outputVariables = OutputVariables,
dataTypes = dataTypes,
filePath = filePath,
nc = nc,
dims = dims,
nchars = nchars,
bootstrapProgressFile = bootstrapProgressFile
)
# Close the nc object that was created in writeBootstrapOutputFromIndividualProjects():
ncdf4::nc_close(nc)
filePath <- createStoXNetCDF4FileDataType(filePath)
# Moved this message to writeProcessOutputTextFile():
#message("StoX: The bootstrap file can be read into R using bootstrapData <- RstoxFramework::readBootstrapData(\"", filePath, "\", selection = NA). Note that using selection = NA reads in the entire file. Use ?RstoxFramework::readBootstrapData in R for futher details.")
#names(filePath) <- "StoXNetCDF4File"
return(filePath)
}
#' List variables used in the reports of a project
#'
#' @inheritParams general_arguments
#'
#' @export
#'
getOutputVariablesUsedInReport <- function(projectPath) {
# The report parameters to get variable names from:
parameters <- c("GroupingVariables", "InformationVariables", "WeightingVariable", "TargetVariable")
pr <- getProcessesSansProcessData(projectPath, modelName = "report")
unique(unlist(lapply(pr$functionParameters, function(x) x[parameters])))
}
writeBootstrapOutputFromIndividualProjects <- function(
memoryDataSubFolders,
outputProcesses,
outputVariables,
dataTypes,
filePath,
nc,
dims,
nchars,
bootstrapProgressFile
) {
validOutputDataClasses = getRstoxFrameworkDefinitions("validOutputDataClasses")
for(ind in seq_along(memoryDataSubFolders)) {
# Get the project path:
thisProjectPath <- dirname(dirname(dirname(dirname(dirname(memoryDataSubFolders[ind])))))
# Rename the sub folder:
originalFolder <- file.path(dirname(memoryDataSubFolders[ind]), "baseline")
file.rename(memoryDataSubFolders[ind], originalFolder)
# Read the data:
processOutput <- getModelData(
projectPath = thisProjectPath,
modelName = "baseline",
processes = outputProcesses,
drop.datatype = FALSE # Keep list of lists
)
# If outputVariableNames is given keep only those variables in all tables:
if(length(outputProcesses)) {
processOutput <- lapply(processOutput, function(x) lapply(x, selectRobust, outputVariables))
}
# Rename back:
file.rename(originalFolder, memoryDataSubFolders[ind])
# Set the dataTypes and processNames as attributes, which are both needed to get the dataType from the netCDF4 file in getDataTypeFromBootstrap():
attr(processOutput, "dataType") <- unname(dataTypes[outputProcesses])
attr(processOutput, "processName") <- outputProcesses
# Write to NetCDF4:
###write_list_NetCDFF4(processOutput, filePath, step = ind, append = TRUE, ow = ind == 1, missval = -9, numberOfSteps = NumberOfBootstraps, verbose = FALSE)
# The following test was done to performance for different compression settings on the tobis test project with NumberOfBootstraps = 200:
# Compression 9:
# Old: (time used: 115.822 s)
# NetCDF4: (time used: 291.861 s)
# FileSize: 27.3 MB
# Compression 1:
# Old: (time used: 133.591 s)
# NetCDF4: (time used: 213.191 s)
# FileSize: 29.1 MB
# Compression NA:
# Old: (time used: 107.817 s)
# NetCDF4: (time used: 170.776 s)
# FileSize: 65.9 MB
#
# However, using compression does not work well with appending to the file. The file may actually become larger with compression than without, so we do not use compression at all:
nc <- write_list_as_tables_NetCDFF4(processOutput, filePath, nc = nc, index = ind, dims = dims, nchars = nchars, append = ind > 1, ow = FALSE, missval = -9, compression = NA, verbose = FALSE, validOutputDataClasses = validOutputDataClasses)
# Add a dot to the progess file:
cat(".", file = bootstrapProgressFile, append = TRUE)
}
return(nc)
}
getDimsAndNcharsFromOldBootstrapData <- function(oldBootstrapData) {
# Assume processName - tableName - variableName:
numberOfBootstraps <- oldBootstrapData[[1]][[1]][, max(BootstrapID)]
dims <- lapply(
seq_len(numberOfBootstraps), function(ind) lapply(
oldBootstrapData, function(x) {
if(data.table::is.data.table(x)) {
dim(x[BootstrapID == ind, ])
}
else {
lapply(x, function(y) dim(y[BootstrapID == ind, ]))
}
}
)
)
nchars <- lapply(
seq_len(numberOfBootstraps), function(ind) lapply(
oldBootstrapData, function(x) {
if(data.table::is.data.table(x)) {
lapply(x[BootstrapID == ind, ], getMaxNchar)
}
else {
lapply(x, function(y) lapply(y[BootstrapID == ind, ], getMaxNchar))
}
}
)
)
#dims <- lapply(processOutput, function(x) lapply(x, dim))
#nchars <- lapply(processOutput, function(x) lapply(x, function(y) lapply(y, getMaxNchar)))
return(
list(
dims = dims,
nchars = nchars
)
)
}
getBootstrapOneTable <- function(BootstrapDataOneTable, ind, outputVariables) {
if(data.table::is.data.table(BootstrapDataOneTable)) {
subset(
BootstrapDataOneTable,
subset = BootstrapID == ind,
select = if(length(outputVariables)) intersect(outputVariables, names(BootstrapDataOneTable)) else names(BootstrapDataOneTable)
)
#BootstrapDataOneTable[BootstrapID == ind, outputVariables]
}
else {
lapply(
BootstrapDataOneTable, function(table)
subset(
table,
subset = BootstrapID == ind,
select = if(length(outputVariables)) intersect(outputVariables, names(table)) else names(table)
)
)
}
}
#singleTableOutputInBootstrapData <- function(BootstrapData) {
# !any("_Data" %in% names(BootstrapData))
#}
#' Convert a bootstrap RData file to NetCDF4. THIS FUNCTION IS NOT FINISHED, AND SHOULD NOT BE!!!
#'
#' @param bootstrapRDataFile The path to file holding the BoostrapData from a StoX <= 3.6.2 project run.
#' @param bootstrapNetCDF4File The path to the new netCDF4 file.
#' @param outputVariables An optional list of variables to keep in the output. A typical set of variables could be ["Survey", "Stratum", "SpeciesCategory", "IndividualTotalLength", "IndividualAge", "Abundance", "Biomass"], which should cover the most frequently used variables in reports. Any variable that is used in a report must be present in \code{outputVariables}. Empty list (the default) implies to keep all variables. This parameter is included to facilitate smaller disc space for the bootstrap objects and faster writing/reading of that file.
#' @param ow Logical: If TRUE, overwrite the \code{bootstrapNetCDF4File}.
#'
bootstrapRDataToNetCDF4 <- function(
bootstrapRDataFile,
bootstrapNetCDF4File = file.path(dirname(bootstrapRDataFile), "BootstrapData.nc"),
outputVariables = NULL,
ow = FALSE
) {
startTime <- proc.time()[3]
# Load the RData:
message("StoX: Reading RData file ", bootstrapRDataFile)
BootstrapData <- loadRData(bootstrapRDataFile)
timeSpent <- proc.time()[3] - startTime
message("StoX: Time used: ", round(timeSpent, digits = 3), " s")
# Try to intepret the table name:
namesSplit <- strsplit(names(BootstrapData), "_")
processName <- sapply(namesSplit, "[", 1)
tableName <- lapply(namesSplit, "[", 2)
atMissingTableName <- which(is.na(tableName))
for(ind in atMissingTableName) {
tableName[[ind]] <- interpretTableNameFromVariables(BootstrapData[[ind]])
}
BootstrapData <- split(BootstrapData, processName)
tableNameSplit <- split(tableName, processName)
for(ind in seq_along(BootstrapData)) {
names(BootstrapData[[ind]]) <- tableNameSplit[[ind]]
}
# Get dims and nchars:
dimsNchars <- getDimsAndNcharsFromOldBootstrapData(BootstrapData)
midTime <- proc.time()[3]
message("StoX: Writing nc file ", bootstrapNetCDF4File)
nc <- writeBootstrapOutputFromOldBootstrapData(
oldBootstrapData = BootstrapData,
outputVariables = outputVariables,
filePath = bootstrapNetCDF4File,
dims = dimsNchars$dims,
nchars = dimsNchars$nchars,
ow = ow
)
timeSpent <- proc.time()[3] - midTime
message("StoX: Time used: ", round(timeSpent, digits = 3), " s")
# Close the nc object that was created in writeBootstrapOutputFromOldBootstrapData():
ncdf4::nc_close(nc)
timeSpent <- proc.time()[3] - midTime
message("StoX: Total time used: ", round(timeSpent, digits = 3), " s")
return(bootstrapNetCDF4File)
}
interpretTableNameFromVariables <- function(x) {
individualsVariableNames <- c("Stratum", "Layer", "Individual")
superIndividualsVariableNames <- c(individualsVariableNames, "Abundance")
if(all(superIndividualsVariableNames %in% names(x))) {
return("SuperIndividualsData")
}
else if(all(individualsVariableNames %in% names(x))) {
return("IndividualsData")
}
else {
stop("Unknown bootstrap output table.")
}
}
# Function to load RData to a variable (https://stackoverflow.com/questions/5577221/can-i-load-a-saved-r-object-into-a-new-object-name):
loadRData <- function(fileName){
#loads an RData file, and returns it
load(fileName)
get(ls()[ls() != "fileName"])
}
writeBootstrapOutputFromOldBootstrapData <- function(oldBootstrapData, outputVariables, filePath, dims, nchars, ow = FALSE) {
#dataTypeAttribute <- attr(oldBootstrapData, "dataType")
#processNameAttribute <- attr(oldBootstrapData, "processName")
dataTypes <- unlist(lapply(oldBootstrapData, attr, "dataType"))
processNames <- names(oldBootstrapData)
#if(!singleTableOutputInBootstrapData(oldBootstrapData)) {
# stop("StoX: This conversion only works on BootstrapData with single table outputs (like SuperIndividualsData but not AbundanceData which has tables Data and Resolution).")
#}
# Assume that there are no multi-table outputs:
numberOfBootstraps <- oldBootstrapData[[1]][[1]][, max(BootstrapID)]
# Get this only once to save time:
validOutputDataClasses = getRstoxFrameworkDefinitions("validOutputDataClasses")
nc <- NULL
# Write one bootstrap at the time:
for(ind in seq_len(numberOfBootstraps)) {
# Subset the data:
processOutput <- lapply(oldBootstrapData, getBootstrapOneTable, ind = ind, outputVariables = outputVariables)
# Set the dataType and processName attributes:
attr(processOutput, "dataType") <- dataTypes
attr(processOutput, "processName") <- processNames
nc <- write_list_as_tables_NetCDFF4(processOutput, filePath, nc = nc, index = ind, dims = dims, nchars = nchars, append = ind > 1, ow = ow, missval = -9, compression = NA, verbose = FALSE, validOutputDataClasses = validOutputDataClasses)
}
return(nc)
}
prepareBootstrap <- function(projectPath, BootstrapMethodTable, OutputProcesses, NumberOfBootstraps, NumberOfCores, BaselineSeedTable) {
# Identify the first process set for resampling by the BootstrapMethodTable (here BootstrapMethodTable$ProcessName can be a vector, in which case the table is returned from the first process of these and onwards):
processesSansProcessData <- getProcessesSansProcessData(projectPath, modelName = "baseline", startProcess = BootstrapMethodTable$ProcessName, endProcess = OutputProcesses, return.processIndex = TRUE, only.valid = TRUE)
# Check the output processes:
if(length(OutputProcesses) && !all(OutputProcesses %in% processesSansProcessData$processName)) {
stop("StoX: The following processes specified in OutputProcesses were not recognized: ", paste(setdiff(OutputProcesses, processesSansProcessData$processName), collapse = ", "))
OutputProcesses <- intersect(OutputProcesses, processesSansProcessData$processName)
}
if(!length(OutputProcesses)) {
stop("StoX: At least one valid process name must be given in OutputProcesses")
}
# The baseline must have been run until the last output process:
activeProcessIndex <- getProcessIndexFromProcessID(
projectPath = projectPath,
modelName = "baseline",
processID = getActiveProcess(
projectPath = projectPath,
modelName = "baseline"
)$processID
)
baselineMustHaveBeenRunTo <- max(processesSansProcessData$processIndex)
if(activeProcessIndex < baselineMustHaveBeenRunTo) {
stop("The Baseline model must be run at least until the last output process \"", getProcessNameFromProcessID(projectPath, modelName = "baseline", processID = processesSansProcessData[processIndex == baselineMustHaveBeenRunTo, processID]), "\"")
}
# Draw the seeds in a table, and then split into a list for each bootstrap run:
#SeedTable <- data.table::as.data.table(lapply(BootstrapMethodTable$Seed, RstoxBase::getSeedVector, size = NumberOfBootstraps))
#names(SeedTable) <- BootstrapMethodTable$ProcessName
#SeedList <- split(SeedTable, seq_len(nrow(SeedTable)))
# Create the replaceDataList input to runProcesses, which defines the seed for each bootstrap run:
replaceDataList <- createReplaceData(
BootstrapMethodTable = BootstrapMethodTable,
NumberOfBootstraps = NumberOfBootstraps
)
# Scan through the baseline processes to be run and look for processes with the parameter Seed:
hasSeed <- sapply(processesSansProcessData$functionParameters, function(x) "Seed" %in% names(x))
# Error if the BaselineSeedTable does not contain exactly the processes using Seed in the Baseline processes to be run:
if(any(hasSeed)) {
presentInBaselineButNotInBaselineSeedTable <- setdiff(
processesSansProcessData$processName[hasSeed],
BaselineSeedTable$ProcessName
)
if(length(presentInBaselineButNotInBaselineSeedTable)) {
stop("The BaselineSeedTable must contain Seed for the processes ", paste(processesSansProcessData$processName[hasSeed], collapse = ", "), ".")
}
}
# Construct a list of lists, where each list contains a list of Seed named by the processes using Seed in the Baseline:
BaselineSeedList <- drawNamedSeedList(
table = BaselineSeedTable,
NumberOfBootstraps = NumberOfBootstraps
)
# Get the number of cores to open:
NumberOfCores <- RstoxData::getNumberOfCores(NumberOfCores, n = NumberOfBootstraps)
# Copy the project to the tempdir for each core:
projecName <- basename(projectPath)
# As of 2021-05-27 (v3.0.23) make a copy even if running on only one core. This for safety:
#if(NumberOfCores > 1) {
projectPath_copies <- file.path(tempdir(), paste(projecName, seq_len(NumberOfCores), sep = "_"))
message("Copying the project ", projecName, " to the ", length(projectPath_copies), " projects \n\t", paste(projectPath_copies, collapse = "\n\t"))
temp <- RstoxData::mapplyOnCores(
copyProject,
projectPath = projectPath,
newProjectPath = projectPath_copies,
MoreArgs = list(
ow = TRUE,
empty.memory = c("analysis", "report"),
empty.output = TRUE
),
NumberOfCores = NumberOfCores
)
# Run the subset of the baseline model:
bootstrapProgressFile <- getProjectPaths(projectPath, "progressFile")$analysis
NumberOfBootstrapsFile <- getProjectPaths(projectPath, "NFile")$analysis
stopBootstrapFile <- getProjectPaths(projectPath, "stopFile")$analysis
if(file.exists(stopBootstrapFile)) {
unlink(stopBootstrapFile, force = TRUE, recursive = TRUE)
}
# Prepare for progress info. Multiply the number of bootstraps by 2 to include merging the individual nc files to one single file, which happens after the possibly parallelized bootstrapping:
writeLines(as.character(NumberOfBootstraps * 2), NumberOfBootstrapsFile)
# Get the dataTypes:
processNames <- processesSansProcessData$processName
processIDs <- unlist(
mapply(
getProcessIDFromProcessName,
processName = processNames,
MoreArgs = list(
projectPath = projectPath,
modelName = "baseline"
),
only.processID = TRUE,
SIMPLIFY = FALSE
)
)
dataTypes <- unlist(
mapply(
getDataType,
processID = processIDs,
MoreArgs = list(
projectPath = projectPath,
modelName = "baseline"
),
SIMPLIFY = FALSE
)
)
output <- list(
processesSansProcessData = processesSansProcessData,
NumberOfBootstrapsFile = NumberOfBootstrapsFile,
bootstrapProgressFile = bootstrapProgressFile,
projectPath_copies = projectPath_copies,
stopBootstrapFile = stopBootstrapFile,
replaceDataList = replaceDataList,
BaselineSeedList = BaselineSeedList,
processNames = processNames,
dataTypes = dataTypes
)
return(output)
}
#' Get bootstrap data saved in a NetCDF4 file
#'
#' @param filePath The path to the file.
#' @param nc A netCDF4 object, overiding the \code{filePath}.
#' @param selection Hierarchical list of names of the groups/variables. The last element must be a vector of the variables to return from the table specified by the other elements. E.g., list("ImputeSuperIndividuals", "SuperIndividualsData", c("Stratum", "IndividualAge", "Abundance")) will return a data.table of the three columns "Stratum", "IndividualAge" and "Abundance", added the BootstrapID specified in \code{BootstrapID}.
#' @param BootstrapID A sequence of bootstrap IDs, i.e., the indices of the bootstrap replicates. The default returns all bootstrap replicates.
#' @param unlistSingleTable Logical: For use when only single table process outputs are among the requested processes in \code{OutputProcesses} of \code{\link{Bootstrap}}. If FALSE (default) return a list named by the \code{selection} with a sub-list named by the datatype holding the output data (e.g. $ImputeSuperIndividuals$SuperIndividualsData). If TRUE return a list named by the \code{selection} holding the output data, replicating the output of \code{\link{Bootstrap}} in StoX <= 3.6.2.
#' @param close Logical: If TRUE (the default) close the netCDF4 file after reading the data.
#'
#' @export
#'
readBootstrapData <- function(filePath, nc, selection = list(), BootstrapID = NA, unlistSingleTable = FALSE, close = TRUE) {
# Use the open file if given:
if(!missing(filePath) && file.exists(filePath)) {
if(missing(nc) || !inherits(nc, "ncdf4")) {
nc <- ncdf4::nc_open(filePath)
}
else {
warning("The nc argument is given as a ncdf4 object. Ignoring filePath.")
}
}
else {
if(missing(nc) || !inherits(nc, "ncdf4")) {
stop("Either filePath or nc must be given.")
}
}
if(close) {
on.exit(ncdf4::nc_close(nc))
}
# Get the list of selections:
selection <- getSelection(selection, nc)
# Get each table separately:
output <- lapply(
selection,
readBootstrapDataOne,
nc = nc,
BootstrapID = BootstrapID
)
# Unlist to a list of the individual tables:
groupNames <- sub("/.*", "", names(output))
# The output of the lapply above must be unlisted:
output <- lapply(output, "[[", 1)
#names(output) <- groupNames
# Split by the groupNames, while keeping the order intact (by setting the levels of the factor):
output <- split(output, factor(groupNames, levels = unique(groupNames)))
# The output of readBootstrapDataOne() requires we unlist one level to get down to the table:
output <- lapply(output, function(x) structure(unlist(x, recursive = FALSE), names = sapply(x, names)))
# If there are only single table outputs, recreate the StoX <= 3.6.2 output, as a list named by the processes:
if(unlistSingleTable) {
# Find single table processes:
singleTable <- lengths(output) == 1
# Unlist those:
output[singleTable] <- lapply(output[singleTable], "[[", 1)
#names(output[singleTable]) <- groupNames[singleTable]
#processNames <- names(output)
#output <- unlistToDataType(output, nlevel = 2)
#names(output) <- processNames
}
#else {
# output <- lapply(
# unique(groupNames), function(name) {
# at <- which(names(output) == name)
# # If the current list element is a valid StoX class (e.g. data.table) return it unchcanged:
# if(length(at) == 1 && isValidOutputDataClass(output[[at]][[1]])) {
# output[[at]][[1]]
# }
# # Otherwise, unlist out the lapply using readBootstrapDataOne() above, and c tables of the same process:
# else {
# unlisted1 <- unlist(output[at], recursive = FALSE)
# tableNames <- sapply(unlisted1, names)
# structure(c(unlist(unlisted1, recursive = FALSE)), names = tableNames)
# }
# }
# )
# names(output) <- unique(groupNames)
#}
return(output)
}
getVarList <- function(nc, flatten = TRUE) {
# Get all the variable ids as slash separated strings:
varStrings <- sapply(nc$var, "[[", "name")
# Split by "/":
varStringsSplit <- strsplit(varStrings, "/")
# Rbind to a data.table:
dt <- data.table::rbindlist(lapply(varStringsSplit, as.list))
names(dt) <- c("processName", "tableName", "variableName")
# Split by the first two columns, which are processName and tableName
output <- split(dt, by = c("processName", "tableName"), flatten = flatten)
if(flatten) {
names(output) <- sapply(output, function(x) paste(x[1,1], x[1,2], sep = "/"))
}
return(output)
}
readBootstrapDataOne <- function(selectionOneTable, nc, BootstrapID = NA) {
# Get the variables:
var <- nc$var
ndims <- lapply(var, "[[", "ndims")
# Get the selection elements:
selection <- splitSelectionOneTable(selectionOneTable)
listNames <- selection$listNames
requestedVariables <- selection$requestedVariables
requestedVariablesFullName <- selection$requestedVariablesFullName
# Check against the stored variables:
varList <- getVarList(nc, flatten = FALSE)
varListFullName <- unname(unlist(lapply(varList, function(x) lapply(x, function(y) apply(y, 1, paste, collapse = "/")))))
notPresentInFile <- setdiff(requestedVariablesFullName, varListFullName)
if(length(notPresentInFile)) {
stop("The following requested variables are not present in the file ", nc$filename, " (These variables can be added using the parameter OutputVariables of the function Bootstrap()): \n", paste(notPresentInFile, collapse = "\n"))
}
# Get the nrow variable, which determines what to read, and which will be added as BootstrapID to the output table:
nrowsVariable <- paste(paste(listNames, collapse = "/"), "nrow", sep = "/")
nrows <- ncdf4::ncvar_get(nc, nrowsVariable)
# Check that BootstrapID is an unbroken sequence:
if(length(BootstrapID) == 1 && is.na(BootstrapID)) {
BootstrapID <- seq_len( length(nrows))
}
if(any(diff(BootstrapID) != 1)) {
warning("BootstrapID must be an unbroken sequence. Converted to the sequence from min(BootstrapID) to max(BootstrapID) (", min(BootstrapID), " - ", max((BootstrapID)), ").")
BootstrapID <- seq(min(BootstrapID), max(BootstrapID))
}
# Exclude BootstrapID larger than length(nrows):
if(any(BootstrapID < 1)) {
BootstrapID <- BootstrapID[BootstrapID > 1]
}
if(any(BootstrapID > length(nrows))) {
BootstrapID <- BootstrapID[BootstrapID <= length(nrows)]
}
start <- 1 + sum(nrows[seq_len(min(BootstrapID) - 1)])
count <- sum(nrows[BootstrapID])
# Read the variables:
list <- lapply(
requestedVariablesFullName,
getVarFromNC,
nc = nc,
ndims = ndims,
start = start,
count = count
)
# ncvar_get() may add a single dimension to the output, so we drop dimensions:
list <- lapply(list, c)
# Set names to the variables and combine to a data.table:
names(list) <- requestedVariables
table <- data.table::setDT(list)
# Set character "NA" to NA:
atCharacter <- which(sapply(table, is.character))
for(ind in atCharacter){
set(table, i = which(table[[ind]] == "NA"), j = ind, value = NA)
}
# Set time:
atTime <- which(names(table) == "DateTime")
if(length(atTime)) {
# Get the DateTime format used by StoX:
StoxDateTimeFormat <- RstoxData::getRstoxDataDefinitions("StoxDateTimeFormat")
StoxTimeZone <- RstoxData::getRstoxDataDefinitions("StoxTimeZone")
# Convert to POSIX (origin = "1970-01-01" used here to support R 4.2 on Windows and Linux):
table[, DateTime := as.POSIXct(DateTime, format = StoxDateTimeFormat, tz = StoxTimeZone, origin = "1970-01-01")]
}
# Add the BootstrapID:
BootstrapID_repeated <- rep(BootstrapID, nrows[BootstrapID])
data.table::set(table, j = "BootstrapID", value = BootstrapID_repeated)
#table[, BootstrapID := ..BootstrapID]
# Expand to a list like the output from a Bootstrap process:
for(sel in rev(listNames)) {
table <- structure(list(table), names = sel)
}
return(table)
}
getVarFromNC <- function(var, nc, ndims, start, count) {
tryCatch(
ncdf4::ncvar_get(
nc,
var,
start = if(ndims[[var]] == 2) c(1, start) else start,
count = if(ndims[[var]] == 2) c(-1, count) else count
),
error = function(e) {
NULL
}
)
}
getNumberOfBootstraps <- function(nc, selection = list()) {
# Get the variables:
var <- nc$var
ndims <- lapply(var, "[[", "ndims")
if(!length(selection)) {
warning("No selection of table/variables. Returning NULL.")
return(NULL)
}
selection <- splitSelectionOneTable(selection)
listNames <- selection$listNames
#requestedVariables <- selectionList$requestedVariables
#requestedVariablesFullName <- selectionList$requestedVariablesFullName
# Get the nrow variable,, which determine what to read, and which will be added as BootstrapID to the output table:
nrowsVariable <- paste(paste(listNames, collapse = "/"), "nrow", sep = "/")
nrows <- ncdf4::ncvar_get(nc, nrowsVariable)
NumberOfBootstraps <- length(nrows)
return(NumberOfBootstraps)
}
splitSelectionOneTable <- function(selectionOneTable) {
# Extract list elements:
listNames <- unlist(utils::head(strsplit(selectionOneTable[1], "/", fixed = TRUE)[[1]], -1))
requestedVariables <- unlist(lapply(strsplit(selectionOneTable, "/", fixed = TRUE), utils::tail, 1))
requestedVariablesFullName <- selectionOneTable
list(
listNames = listNames,
requestedVariables = requestedVariables,
requestedVariablesFullName = requestedVariablesFullName
)
}
getSelection <- function(selection, nc) {
if(!length(selection)) {
warning("No selection of tables/variables in the NetCDF4 file. Returning NULL. Use selection = NA to get all variables.")
return(NULL)
}
else if(length(selection) == 1 && (is.na(selection) || is.na(selection[[1]]))) {
# Get all variables in a list by table:
output <- getVarList(nc, flatten = TRUE)
}
else{
if(is.list(selection)) {
# Get the hierarchical structure of processName, tableName and variableName:
varList <- getVarList(nc, flatten = FALSE)
# Select the processes:
output <- varList[selection[[1]]]
# Select the tables:
if(length(selection) >= 2) {
output <- mapply(
function(table, element) if(length(element) == 1 && is.na(element)) table else table[element],
output,
selection[[2]],
SIMPLIFY = FALSE
)
# Unlist non-recursively and concatenate the names:
output <- unlist(output, recursive = FALSE)
names(output) <- sapply(output, function(x) paste(x[1,1], x[1,2], sep = "/"))
}
# Select the variables:
if(length(selection) == 3) {
# But only if all elements of selection have equal length (where lists are allwed as elements in selection):
lens <- lengths(selection)
if(! all(lens == lens[1])) {
stop("If selection has length 3, all elements must have equal length, implying to get the variables specified in the third element from the tables specified by the first two (process name and table name).")
}
processTableNames <- paste(selection[[1]], selection[[2]], sep = "/")
output <- mapply(
function(table, element) if(length(element) == 1 && is.na(element)) table else table[ intersect(table$variableName, element), on = "variableName"],
output,
selection[[3]],
SIMPLIFY = FALSE
)
}
}
else {
output <- selection
}
}
if(is.list(output)) {
# Paste:
output <- lapply(output, function(x) apply(subset(x, variableName != "nrow"), 1, paste, collapse = "/"))
}
else {
processTableNames <- sub('/[^/]*$', '', output)
output <- split(output, processTableNames)
}
return(output)
}
getProcessNameTableNameVariableName <- function(nc, add.class = FALSE) {
list <- getVariableNameElementsList(nc)
# Add NA table name for the single table data that are dropped of the data type (table name):
list <- lapply(list, function(x) if(length(x) == 2) c(x[1], NA, x[2]) else x)
table <- data.table::rbindlist(lapply(list, as.list))
data.table::setnames(table, c("processName", "tableName", "variableName"))
# Add the class stored in the file:
if(add.class) {
table[, class := sapply(nc$var, "[[", "prec")]
}
return(table)
}
getVariableNameElementsList <- function(nc) {
lapply(lapply(sapply(nc$var, "[[", "name"), strsplit, "/", fixed = TRUE), unlist)
}
getProcessNamesAndTableNames <- function(nc) {
# Get the variable name elements:
variableNameElementsList <- getVariableNameElementsList(nc)
# Discard the last element which is the column name:
variableNameElementsList <- unique(lapply(variableNameElementsList, utils::head, -1))
# Add names dataType and tableName:
variableNameElementsList <- lapply(variableNameElementsList, function(x) structure(as.list(x), names = c("processName", "tableName")[seq_along(x)]))
# Rbind into a table:
processNamesAndTableNames <- data.table::rbindlist(variableNameElementsList, fill = TRUE)
#names(processNamesAndTableNames) <- c("processName", "tableName")
#processNamesAndTableNames <- unique(processNamesAndTableNames)
return(processNamesAndTableNames)
}
# Get the datatype of a specific processName in a BootstrapNetCDF4 file:
getDataTypeFromBootstrap <- function(nc, processName) {
# Get the dataType and processName global attributes:
#processNames <- ncdf4::ncatt_get(nc, varid = 0, attname = "processName")$value
#dataTypes <- ncdf4::ncatt_get(nc, varid = 0, attname = "dataType")$value
processNames <- readStringVectorAttribute(nc, varid = 0, attname = "processName")
dataTypes <- readStringVectorAttribute(nc, varid = 0, attname = "dataType")
if(! processName %in% processNames) {
warning("The process ", processName, " is not present with a dataType in the file \"", nc$filename, "\". Present processes are ", paste(processNames, collapse = ", "), ".")
return(NULL)
}
atProcecssName <- which(processName == processNames)
if(length(atProcecssName) > 1) {
stop("The file \"", nc$filename, "\" specifies more than one dataType for the process ", processName, ".")
}
thisDataType <- dataTypes[atProcecssName]
return(thisDataType)
}
# Funcion to draw seeds for each process given in the table and each bootstrap run, and reshape the seeds into a list of either data.table with rows
drawNamedSeedList <- function(table, NumberOfBootstraps) {
if(!length(table)) {
SeedList <- vector("list", NumberOfBootstraps)
SeedList <- lapply(SeedList, as.data.table)
}
else {
SeedTable <- data.table::as.data.table(lapply(table$Seed, RstoxBase::getSeedVector, size = NumberOfBootstraps))
names(SeedTable) <- table$ProcessName
SeedList <- split(SeedTable, seq_len(nrow(SeedTable)))
# ERROR: This does not work, as the result is not a list with an element named Seed:
SeedList <- lapply(SeedList, function(x) structure(as.list(x), names = names(x)))
# The following will work:
#warning("StoX: NEW BASELINESEED METHOD")
#SeedList <- lapply(SeedList, function(x) lapply(x, function(y) list(Seed = y)))
}
return(SeedList)
}
drawSeedTable <- function(table, NumberOfBootstraps) {
if(!length(table)) {
SeedTable <- vector("list", NumberOfBootstraps)
}
else {
SeedTable <- data.table::as.data.table(lapply(table$Seed, RstoxBase::getSeedVector, size = NumberOfBootstraps))
names(SeedTable) <- table$ProcessName
SeedTable <- split(SeedTable, seq_len(nrow(SeedTable)))
}
return(SeedTable)
}
rbindlistByName <- function(name, x) {
subnames <- names(x[[1]][[name]])
out <- lapply(subnames, rbindSublistByName, name, x)
names(out) <- subnames
return(out)
}
rbindSublistByName <- function(subname, name, x) {
data.table::rbindlist(lapply(x, function(y) y[[name]][[subname]]))
}
# Function to create a list of the ReplaceData input to runProcesses(),
createReplaceDataSansFunctionName <- function(Seed) {
lapply(Seed, function(x) list(MoreArgs = list(Seed = x)))
}
addFunctionNameToReplaceData <- function(replaceData, BootstrapMethodTable) {
out <- lapply(names(replaceData), function(name)
c(
# Add the full name of the function:
list(FunctionName = paste("RstoxFramework", BootstrapMethodTable[ProcessName == name, ResampleFunction], sep = "::")),
replaceData[[name]]
))
# Add the process names to the list to enable replaceDataList in runProcesses():
names(out) <- names(replaceData)
return(out)
}
createReplaceData <- function(BootstrapMethodTable, NumberOfBootstraps) {
# Create a table of seed values:
SeedTable <- drawSeedTable(
table = BootstrapMethodTable,
NumberOfBootstraps = NumberOfBootstraps
)
replaceDataList <- lapply(SeedTable, createReplaceDataSansFunctionName)
replaceDataList <- lapply(replaceDataList, addFunctionNameToReplaceData, BootstrapMethodTable = BootstrapMethodTable)
return(replaceDataList)
}
# Define a function to run processes and save the output of the last process to the output folder:
runOneBootstrap <- function(ind, replaceArgsList, replaceDataList, projectPath, projectPath_original, startProcess, endProcess, outputProcessesIDs, outputVariableNames, bootstrapProgressFile, stopBootstrapFile) {
# Stop if the file stopBootstrap.txt exists:
if(file.exists(stopBootstrapFile)) {
stop("Bootstrap aborted by the user.")
}
# Run the part of the baseline that contains the processes to be modified to those to be returned:
#runProcesses(
# projectPath,
# modelName = "baseline",
# startProcess = startProcess,
# endProcess = endProcess,
# save = FALSE,
# # Be sure to not touch the process data and file output,
# # the latter mostly for speed if the bootstrap is run on a copy of the project
# # (see if this is the case in the code of Bootstrap()):
# saveProcessData = FALSE,
# # Do not save the output (text) files:
# fileOutput = FALSE,
# setUseProcessDataToTRUE = FALSE,
# replaceArgsList = replaceArgsList,
# replaceDataList = replaceDataList,
# msg = FALSE
#)
temp <- runFunction(
what = "runProcesses",
args = list(
projectPath = projectPath,
modelName = "baseline",
startProcess = startProcess,
endProcess = endProcess,
save = FALSE,
# Be sure to not touch the process data and file output,
# the latter mostly for speed if the bootstrap is run on a copy of the project
# (see if this is the case in the code of Bootstrap()):
saveProcessData = FALSE,
# Do not save the output (text) files:
fileOutput = FALSE,
setUseProcessDataToTRUE = FALSE,
replaceArgsList = replaceArgsList,
replaceDataList = replaceDataList,
try = FALSE,
msg = FALSE
),
onlyStoxMessages = FALSE,
onlyStoxWarnings = FALSE,
maxLength.Message = 10000,
maxLength.Warning = 10000,
maxLength.Error = 10000,
collapseMessages = FALSE,
collapseWarnings = FALSE,
collapseErrors = FALSE,
header = NULL
)
# Get the output dimensions for use when writing NetCDF4 later:
processOutput <- getModelData(
projectPath = projectPath,
modelName = "baseline",
processes = outputProcessesIDs,
drop.datatype = FALSE
)
# If outputVariableNames is given keep only those variables in all tables. This must also be done when writing the final NetCDF4 file:
if(length(outputVariableNames)) {
processOutput <- lapply(processOutput, function(x) lapply(x, selectRobust, outputVariableNames))
}
dims <- lapply(
processOutput, function(x) {
if(data.table::is.data.table(x)) {
dim(x)
}
else {
lapply(x, dim)
}
}
)
nchars <- lapply(
processOutput, function(x) {
if(data.table::is.data.table(x)) {
lapply(x, getMaxNchar)
}
else {
lapply(x, function(y) lapply(y, getMaxNchar))
}
}
)
# Rename the baseline output data memory folder. This effectively saves the output:
renameDataFolders(
projectPath = projectPath,
modelName = "baseline",
processes = outputProcessesIDs,
suffix = ind,
restore = TRUE
)
# Add a dot to the progess file:
cat(".", file = bootstrapProgressFile, append = TRUE)
#cat(paste(".", ind, basename(projectPath), "\n"), file = bootstrapProgressFile, append = TRUE)
# Clear memory to avoid memory leak when running parallel bootstrapping. NEVER remove this unless the performance has been thoroughly tested by running bootstrap on multiple large StoX projects sequentially, to give R a chance to pile up memory:
gc()
return(list(
dims = dims,
nchars = nchars,
message = temp$message,
warning = temp$warning,
error = temp$error
))
}
# Select variable of a table, ignoring variable names that are not present:
selectRobust <- function(x, var) {
var <- intersect(var, names(x))
if(length(var)) {
x <- subset(x, select = var)
}
return(x)
}
getMaxNchar <- function(x) {
if("POSIXct" %in% class(x)) {
24 # format = "%Y-%m-%dT%H:%M:%OS3Z")
}
else if(is.character(x)){
suppressWarnings(max(1, if(any(is.na(x))) 2, nchar(x, type = "bytes"), na.rm = TRUE))
}
else {
NA
}
}
renameDataFolders <- function(
projectPath,
modelName,
processes,
suffix,
sep = "_",
restore = TRUE
){
# Get the processes to get rename data folder for:
processTable <- readProcessIndexTable(
projectPath = projectPath,
modelName = modelName,
processes = processes
)
processID = processTable$processID
# Get the current and new dataModelsFolder:
outputMemoryDataFolder <- getProjectPaths(projectPath)[["dataModelsFolders"]][modelName]
outputMemoryDataFolders <- file.path(outputMemoryDataFolder, processID)
outputMemoryDataFolder_new <- paste(outputMemoryDataFolder, suffix, sep = sep)
outputMemoryDataFolders_new <- file.path(outputMemoryDataFolder_new, processID)
# Create the new folder:
dir.create(outputMemoryDataFolder_new)
# Rename:
file.rename(outputMemoryDataFolders, outputMemoryDataFolders_new)
return(outputMemoryDataFolders_new)
}
### #' Stop a bootstrap run.
### #'
### #' @inheritParams general_arguments
### #'
### #' @export
### #'
### stopBootstrap <- function(projectPath) {
### write("", file = getProjectPaths(projectPath, "stopBootstrapFile"))
### }
#' Stop runProcesses().
#'
#' @inheritParams general_arguments
#'
#' @export
#'
stopProcesses <- function(projectPath, modelName) {
stopFile <- getProjectPaths(projectPath, "stopFile")[[modelName]]
write("", file = stopFile)
}
#' Get process progress
#'
#' @inheritParams general_arguments
#' @param percent Logical: If TRUE return the progress in percent, otherwise in [0,1].
#'
#' @export
#'
progressOfProcesses <- function(projectPath, modelName, percent = FALSE) {
# Get the paths to the progress and N files:
ProgressFile <- getProjectPaths(projectPath, "progressFile")[[modelName]]
NFile <- getProjectPaths(projectPath, "NFile")[[modelName]]
if(!file.exists(ProgressFile)) {
return(NA)
}
# Read number of dots and compare the N:
N <- as.numeric(readLines(NFile, warn = FALSE))
Progress <- readLines(ProgressFile, warn = FALSE)
n <- lengths(regmatches(Progress, gregexpr(".", Progress, fixed = TRUE)))
Progress <- n / N
if(percent) {
Progress <- Progress * 100
}
return(Progress)
}
#' Get process progress
#'
#' @inheritParams general_arguments
#' @param string.out Logical: If TRUE return a string listing the time remaining and the total estimated time.
#'
#' @export
#'
estimateTimeOfProcesses <- function(projectPath, modelName, string.out = TRUE) {
# Get progress:
progress <- progressOfProcesses(projectPath, modelName)
# Read time of NFile:
NFile <- getProjectPaths(projectPath, "NFile")$analysis
if(!file.exists(NFile)) {
return(NA)
}
startTime <- file.info(NFile)$ctime
# And compare to the present time:
now <- Sys.time()
timeUsed <- now - startTime
# Predict the total time:
totalTimePredicted <- timeUsed / progress
# Predict time remaninig:
remainingTimePredicted <- totalTimePredicted - timeUsed
# Paste to a string message:
if(string.out) {
output <- paste0("Estimated ", round(remainingTimePredicted, 1), " ", attr(remainingTimePredicted, "units"), " remaining (", 100 * progress, " percent elapsed of total estimated time ", round(totalTimePredicted, 1), " ", attr(totalTimePredicted, "units"), ")")
}
else {
output <- remainingTimePredicted
}
return(output)
}
#readBootstrap <- function(projectPath, processName) {
# # Bootstrap lives in the model "analysis" only:
# modelName = "analysis"
# # Get the folder holding the output form the bootstrap:
# processID <- getProcessIDFromProcessName(projectPath = projectPath, modelName = modelName, processName = processName)
# folderPath <- getProcessOutputFolder(projectPath = projectPath, modelName = modelName, processID = processID, type = "output")
#
# # Return NULL if the folder does not exist:
# if(!file.exists(folderPath)) {
# warning("StoX: The process ", folderPath, " does not have an output folder. Has the proccess been run?")
# return(NULL)
# }
#
# # Get a nested list of all output files from the bootstrap:
# bootstrapOutputFiles <- getFilesRecursiveWithOrder(folderPath)
#
# # Read all files:
# processOutput <- rapply(
# bootstrapOutputFiles,
# readProcessOutputFile,
# how = "replace"
# )
#
#}
# # Function to get the name of the bootsrap run, which will be used as the subfolder name hoding the output from the bootstrap:
# getBootstrapRunName <- function(ind, NumberOfBootstraps, prefix = "Run") {
# paste0(
# prefix,
# formatC(ind, width = NumberOfBootstraps, format = "d", flag = "0")
# )
# }
#getReplaceData <- function(x, size) {
# # Get seeds:
# seeds <- RstoxBase::getSeedVector(x$Seed, size)
# x_expanded <- data.table::data.table(
# x[, !"Seed"],
# Seed = seeds
# )
#
# # Split rows into a list:
# x_split <- split(x_expanded, by = "Seed")
#
# x_split <- lapply(x_split, function(x) list(x$ResampleFunction, as.list(x[, !c("ResampleFunction")])))
#
# return(x_split)
#}
# This function resamples varToResample with replacement by altering the input data. Used in ResampleMeanLengthDistributionData(), ResampleBioticAssignment() and ResampleMeanNASCData():
resampleDataBy_new4.0.0_excludingPSUsThatAreNotResampled <- function(data, seed, varToScale, varToResample, resampleBy) {
#resampleDataBy <- function(data, seed, varToScale, varToResample, resampleBy) {
# Get the unique resampleBy, and sort in C-locale for consistensy across platforms:
#uniqueResampleBy <- unique(data[[resampleBy]])
#uniqueResampleBy <- stringi::stri_sort(unique(data[[resampleBy]]), locale = "C")
# Get the unique resampleBy, and sort in en_US_POSIX for consistency across platforms:
uniqueResampleBy <- stringi::stri_sort(unique(data[[resampleBy]]), locale = "en_US_POSIX", na_last = FALSE)
# Build a table of (usually) Stratum and Seed and merge with the MeanLengthDistributionData:
seedTable <- data.table::data.table(
resampleBy = uniqueResampleBy,
seed = RstoxBase::getSeedVector(seed, size = length(uniqueResampleBy))
)
data.table::setnames(seedTable, c(resampleBy, "seed"))
data <- merge(data, seedTable, by = resampleBy, all = TRUE)
# Resample the data:
data[, BootstrapSampleFactor := getResampledCount(.SD, seed = seed[1], varToResample = varToResample), by = resampleBy]
# Scale the varToScale
for(var in varToScale) {
data[, eval(var) := get(var) * BootstrapSampleFactor]
}
# Subset out those that have BootstrapSampleFactor = 0:
data <- subset(data, BootstrapSampleFactor > 0)
data[, seed := NULL]
return(data)
}
# Function to resample Hauls of one subset of the data:
getResampledCount <- function(subData, seed, varToResample) {
# Resample the unique:
if(length(varToResample) != 1) {
stop("StoX: varToResample must be a single string naming the variable to resample")
}
resampled <- RstoxBase::sampleSorted(unique(subData[[varToResample]]), seed = seed, replace = TRUE)
# Tabulate the resamples:
resampleTable <- data.table::as.data.table(table(resampled, useNA = "ifany"))
# Set an unmistakable name to the counts:
data.table::setnames(resampleTable, c("resampled","N"), c(varToResample, "BootstrapSampleFactor"))
# Merge the resampled counts into the data:
# The sort = FALSE is very important, as it retains the order of the data. This should probably be replaced by a more robust solution, e.g. by merging in resampleDataBy():
count <- merge(subData, resampleTable, by = varToResample, all.x = TRUE, sort = FALSE)
# Insert the new count into varToScale (with NAs replaced by 0):
count[, BootstrapSampleFactor := ifelse(
is.na(count$BootstrapSampleFactor),
0,
count$BootstrapSampleFactor
)]
return(count$BootstrapSampleFactor)
}
# This function resamples varToResample with replacement by altering the input data. Used in ResampleMeanLengthDistributionData(), ResampleBioticAssignment() and ResampleMeanNASCData():
resampleData_UsingScaling <- function(data, seed, varToResample, resampleBy) {
# Get the unique resampleBy, and sort in C-locale for consistensy across platforms:
#uniqueResampleBy <- unique(data[[resampleBy]])
#uniqueResampleBy <- stringi::stri_sort(unique(data[[resampleBy]]), locale = "C")
# Get the unique resampleBy, and sort in en_US_POSIX for consistency across platforms:
uniqueResampleBy <- stringi::stri_sort(unique(data[[resampleBy]]), locale = "en_US_POSIX")
# Build a table of resampleBy and Seed and merge with the data:
seedTable <- data.table::data.table(
resampleBy = uniqueResampleBy,
seed = RstoxBase::getSeedVector(seed, size = length(uniqueResampleBy))
)
data.table::setnames(seedTable, c(resampleBy, "seed"))
data <- merge(data, seedTable, by = resampleBy)
# Resample the data:
# Changed this to not scale but simply accumulate the resamplingFactor:
#for(var in varToScale) {
#data[, eval(var) := resampleOne_UsingScaling(.SD, seed = seed[1], varToScale = var, varToResample = varToResample), by = resampleBy]
#}
if(!"resamplingFactor" %in% names(data)) {
data[, resamplingFactor := 1]
}
data[, resamplingFactor := resamplingFactor * resampleOne_UsingScaling(.SD, seed = seed[1], varToResample = varToResample), by = resampleBy]
# Delete the seed:
data[, seed := NULL]
#return(MeanLengthDistributionData)
}
# Function to resample Hauls of one subset of the data:
resampleOne_UsingScaling <- function(subData, seed, varToResample) {
if(nrow(subData) == 1) {
return(1)
}
# Resample the unique:
if(length(varToResample) != 1) {
stop("StoX: varToResample must be a single string naming the variable to resample")
}
resampled <- RstoxBase::sampleSorted(unique(subData[[varToResample]]), seed = seed, replace = TRUE)
# Tabulate the resamples:
resampleTable <- data.table::as.data.table(table(resampled, useNA = "ifany"))
# Set an unmistakable name to the counts:
data.table::setnames(resampleTable, c("resampled","N"), c(varToResample, "resampledCountWithUniqueName"))
# Merge the resampled counts into the data:
# The sort = FALSE is very important, as it retains the order of the data. This should probably be replaced by a more robust solution, e.g. by merging in resampleDataBy():
count <- merge(subData, resampleTable, by = varToResample, all.x = TRUE, sort = FALSE)
# Insert the new count into varToScale (with NAs replaced by 0):
RstoxBase::replaceNAByReference(count, cols = "resampledCountWithUniqueName")
# as of StoX 4.0.1-9003 this was changed from scaling the varToScale here to rather scaling at the end of the parent function, allowing for resampling in multiple stages (heirarchical resampling):
# count[, eval(varToScale) := resampledCountWithUniqueName * get(varToScale)]
return(count$resampledCountWithUniqueName)
}
resampleData <- function(data, seed, varToResample, resampleBy, makeUniqueVars = FALSE) {
if(length(varToResample) > 1 || length(resampleBy) > 1) {
if(length(varToResample) == length(resampleBy)) {
if(isTRUE(makeUniqueVars)) {
# Create replacement variables which will be unquified by appending integers in resampleOneGroup():
varToResample_temp <- paste0("UniqueResampleVariable_", varToResample)
resampleBy_temp <- paste0("UniqueResampleVariable_", resampleBy)
for(ind in seq_along(varToResample_temp)) {
data[, c(varToResample_temp[ind]) := get(varToResample[ind])]
data[, c(resampleBy_temp[ind]) := get(resampleBy[ind])]
}
# Resample through the varToResample_temp:
nsteps <- length(varToResample_temp)
for(ind in seq_len(nsteps)) {
data <- resampleDataOne(
data = data,
seed = seed,
varToResample = varToResample_temp[ind],
# Append integers to all remaining resampleBy_temp:
nextResampleBy = if(ind < nsteps) resampleBy_temp[seq(ind + 1, eval(nsteps))] else NULL,
resampleBy = resampleBy_temp[ind]
)
}
# Delete the replacement variables:
data[, c(varToResample_temp) := NULL]
data[, c(resampleBy_temp) := NULL]
}
else {
#browser()
for(ind in seq_along(varToResample)) {
data <- resampleDataOne(
data = data,
seed = seed,
varToResample = varToResample[ind],
resampleBy = resampleBy[ind]
)
}
}
}
else {
stop("If varToResample or resampleBy has length > 1, they both need to be of the same length.")
}
}
else {
data <- resampleDataOne(
data = data,
seed = seed,
varToResample = varToResample,
resampleBy = resampleBy
)
}
return(data)
}
# This function resamples varToResample with replacement by altering the input data. Used in ResampleMeanLengthDistributionData(), ResampleBioticAssignment() and ResampleMeanNASCData():
resampleDataOne <- function(data, seed, varToResample, resampleBy, nextResampleBy = NULL) {
# Get the unique resampleBy, and sort in en_US_POSIX for consistency across platforms:
uniqueResampleBy <- stringi::stri_sort(unique(data[[resampleBy]]), locale = "en_US_POSIX")
# Build a table of resampleBy and Seed and merge with the data:
seedTable <- data.table::data.table(
resampleBy = uniqueResampleBy,
seed = RstoxBase::getSeedVector(seed, size = length(uniqueResampleBy))
)
data.table::setnames(seedTable, c(resampleBy, "seed"))
data <- merge(data, seedTable, by = resampleBy)
# Resample the data:
data <- data[, resampleOneGroup(
.SD,
seed = seed[1],
varToResample = varToResample,
nextResampleBy = nextResampleBy
), by = resampleBy]
data[, seed := NULL]
#return(MeanLengthDistributionData)
}
# Function to resample Hauls of one subset of the data:
resampleOneGroup <- function(subData, seed, varToResample, nextResampleBy = NULL) {
if(nrow(subData) <= 1) {
return(subData)
}
# Resample the unique:
if(length(varToResample) != 1) {
stop("StoX: varToResample must be a single string naming the variable to resample")
}
uniqueVarToResample <- unique(subData[[varToResample]])
if(length(uniqueVarToResample) <= 1) {
return(subData)
}
# Don't know how to set the name of a column using a variable, so we set the names after:
# Consider NA as a category here:
resampled <- as.data.table(
table(
RstoxBase::sampleSorted(
uniqueVarToResample,
seed = seed,
replace = TRUE
),
useNA = "ifany"
))
data.table::setnames(resampled, c(varToResample, "temporaryScaleFromResampling"))
# Merge in the resample factors:
originalColumnOrder <- names(subData)
subData <- merge(subData, resampled, by = varToResample)
temporaryScaleFromResampling <- subData$temporaryScaleFromResampling
repeatInd <- rep(seq_len(nrow(subData)), temporaryScaleFromResampling)
# ... and repeat the rows according to the factors:
subData <- subData[repeatInd, ]
# Add a suffix to the varToResample of the next step:
if(length(nextResampleBy)) {
for(this in nextResampleBy) {
subData[, eval(this) := paste(get(this), sep = "_", sequence(..temporaryScaleFromResampling))]
}
}
# Delete the temporary column:
subData[, temporaryScaleFromResampling := NULL]
# Restore column order
data.table::setcolorder(subData, originalColumnOrder)
return(subData)
}
#' Resamples biotic PSUs within Stratum in MeanLengthDistributionData
#'
#' This function resamples, with replacement, the BioticPSUs within Stratum in the \code{\link{MeanLengthDistributionData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The column WeightedNumber is scaled by the number of occurrences of each BioticPSUs from the resampling.
#'
#' @inheritParams RstoxBase::ModelData
#' @inheritParams general_arguments
#'
#' @export
#'
ResampleMeanLengthDistributionData <- function(MeanLengthDistributionData, Seed) {
# 2024-04: This function will be renamed to ResampleBioticPSUsInStratum
# Warn if there are strata with only one PSU, which may result in loss of variance:
onlyOneToResample_Warning(
MeanLengthDistributionData$Data,
toResample = "PSU",
within = "Stratum",
functionName = "ResampleMeanLengthDistributionData",
toResamplePrefix = "Biotic"
)
# Resample PSUs within Strata, modifying the weighting variable of MeanLengthDistributionData:
MeanLengthDistributionData$Data <- resampleData_UsingScaling(
data = MeanLengthDistributionData$Data,
seed = Seed,
# If any other values than varToResample = "PSU" and resampleBy = "Stratum" are used in the future, warnings for only one and not all varToResample in resampleBy should be added in RstoxBase!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:
varToResample = "PSU",
resampleBy = "Stratum"
)
# Scale the WeightedNumber:
applyResamplingFactor(
MeanLengthDistributionData$Data,
varToScale = "WeightedNumber"
)
return(MeanLengthDistributionData)
}
applyResamplingFactor <- function(data, varToScale) {
# This function assumes that resampleData_UsingScaling() has been run to produce the resamplingFactor:
for(var in varToScale) {
data[, eval(var) := get(var) * resamplingFactor]
}
}
#' Resamples biotic PSUs within Stratum in MeanSpeciesCategoryCatchData
#'
#' This function resamples, with replacement, the BioticPSUs within Stratum in the \code{\link{MeanSpeciesCategoryCatchData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The columns TotalCatchWeight and TotalCatchNumber are scaled by the number of occurrences of each BioticPSUs from the resampling.
#'
#' @inheritParams RstoxBase::ModelData
#' @inheritParams general_arguments
#'
#' @export
#'
ResampleMeanSpeciesCategoryCatchData <- function(MeanSpeciesCategoryCatchData, Seed) {
# This function will be renamed to ResampleBioticPSUsInStratum
# Warn if there are strata with only one PSU, which may result in loss of variance:
onlyOneToResample_Warning(
MeanSpeciesCategoryCatchData$Data,
toResample = "PSU",
within = "Stratum",
functionName = "ResampleMeanSpeciesCategoryCatchData",
toResamplePrefix = "Biotic"
)
# Resample PSUs within Strata, modifying the weighting variable of MeanSpeciesCategoryCatchData:
MeanSpeciesCategoryCatchData$Data <- resampleData_UsingScaling(
data = MeanSpeciesCategoryCatchData$Data,
seed = Seed,
# If any other values than varToResample = "PSU" and resampleBy = "Stratum" are used in the future, warnings for only one and not all varToResample in resampleBy should be added in RstoxBase!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:
varToResample = "PSU",
resampleBy = "Stratum"
)
# Scale the TotalCatchWeight and TotalCatchNumber:
applyResamplingFactor(
MeanSpeciesCategoryCatchData$Data,
varToScale = c("TotalCatchWeight", "TotalCatchNumber")
)
return(MeanSpeciesCategoryCatchData)
}
#' Resamples biotic PSUs within Stratum in MeanSpeciesCategoryCatchData
#'
#' This function resamples, with replacement, the BioticPSUs within Stratum, then the Individuals within Sample, and then the PreySpeciesCategory within Individual in the \code{\link{MeanSpeciesCategoryCatchData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The columns TotalPreyCatchWeight and TotalPreyCatchNumber are scaled by the number of occurrences of each BioticPSUs from the resampling.
#'
#' Note that this resampling function resamples also the individuals of samples that belong to BioticPSUs that are scaled by 0, and similarly resamples also PreySpeciesCategory in Individuals that belong to Individuals that are scaled by 0. Consequently, this is in essence a reversed hierarchical resampling of PreySpeciesCategory within Individual, then Individuals within Sample, then BioticPSUs within Stratum.
#'
#' @inheritParams RstoxBase::ModelData
#' @inheritParams general_arguments
#'
ResamplePreySpeciesCategoryCatchDataHierarchicalUsingScaling <- function(PreySpeciesCategoryCatchData, Seed) {
warning("It should be decided which of the ResamplingFunctions ResamplePreySpeciesCategoryCatchDataHierarchicalUsingScaling, ResamplePreySpeciesCategoryCatchDataHierarchical or ResamplePreySpeciesCategoryCatchDataHierarchicalNotUsingmakeUniqueVars should be included in StoX")
# Export this function when prey is official
# Warn if there are strata with only one PSU, which may result in loss of variance:
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "PSU",
within = "Stratum",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = "Biotic"
)
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "Individual",
within = "Sample",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = ""
)
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "PreySpeciesCategory",
within = "Individual",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = ""
)
# Resample PSUs within Strata, modifying the weighting variable of MeanSpeciesCategoryCatchData:
PreySpeciesCategoryCatchData <- resampleData_UsingScaling(
data = PreySpeciesCategoryCatchData,
seed = Seed,
varToResample = "PSU",
resampleBy = "Stratum"
)
PreySpeciesCategoryCatchData <- resampleData_UsingScaling(
data = PreySpeciesCategoryCatchData,
seed = Seed,
varToResample = "Individual",
resampleBy = "Sample"
)
PreySpeciesCategoryCatchData <- resampleData_UsingScaling(
data = PreySpeciesCategoryCatchData,
seed = Seed,
varToResample = "PreySpeciesCategory",
resampleBy = "Individual"
)
# Scale the TotalPreyCatchWeight and TotalPreyCatchNumber:
applyResamplingFactor(
PreySpeciesCategoryCatchData,
varToScale = c("TotalPreyCatchWeight", "TotalPreyCatchNumber")
)
return(PreySpeciesCategoryCatchData)
}
#' Resamples biotic PSUs within Stratum in MeanSpeciesCategoryCatchData
#'
#' This function resamples, with replacement, the BioticPSUs within Stratum, then the Individuals within Sample, and then the PreySpeciesCategory within Individual in the \code{\link{MeanSpeciesCategoryCatchData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The function performs actual resampling, as opposed to scaling a data variable as is done for most other resampling functions in StoX.
#'
#' @inheritParams RstoxBase::ModelData
#' @inheritParams general_arguments
#'
ResamplePreySpeciesCategoryCatchDataHierarchical <- function(PreySpeciesCategoryCatchData, Seed) {
warning("It should be decided which of the ResamplingFunctions ResamplePreySpeciesCategoryCatchDataHierarchicalUsingScaling, ResamplePreySpeciesCategoryCatchDataHierarchical or ResamplePreySpeciesCategoryCatchDataHierarchicalNotUsingmakeUniqueVars should be included in StoX")
# Export this function when prey is official
# Warn if there are strata with only one PSU, which may result in loss of variance:
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "PSU",
within = "Stratum",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = "Biotic"
)
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "Individual",
within = "Sample",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = ""
)
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "PreySpeciesCategory",
within = "Individual",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = ""
)
# Resample PSUs within Strata, modifying the weighting variable of MeanSpeciesCategoryCatchData:
PreySpeciesCategoryCatchData <- resampleData(
data = PreySpeciesCategoryCatchData,
seed = Seed,
varToResample = c("PSU", "Individual", "PreySpeciesCategory"),
resampleBy = c("Stratum", "Sample", "Individual"),
makeUniqueVars = TRUE
)
#PreySpeciesCategoryCatchData <- resampleData(
# data = PreySpeciesCategoryCatchData,
# seed = Seed,
# varToResample = "Individual",
# resampleBy = "Sample"
#)
#PreySpeciesCategoryCatchData <- resampleData(
# data = PreySpeciesCategoryCatchData,
# seed = Seed,
# varToResample = "PreySpeciesCategory",
# resampleBy = "Individual"
#)
return(PreySpeciesCategoryCatchData)
}
#' Resamples biotic PSUs within Stratum in MeanSpeciesCategoryCatchData
#'
#' This function resamples, with replacement, the BioticPSUs within Stratum, then the Individuals within Sample, and then the PreySpeciesCategory within Individual in the \code{\link{MeanSpeciesCategoryCatchData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The function performs actual resampling, as opposed to scaling a data variable as is done for most other resampling functions in StoX.
#'
#' Note that this function resamples only unique Individual in the resampling of Individual within Sample, ignoring that these individuals may have been duplicated in the resampling of BioticPSU within Stratum. The effect is that Individuals are resampled equally for all equal Samples, whereas the function \code{\link{ResamplePreySpeciesCategoryCatchDataHierarchical}} resamples differently for each Sample. The latter may suppress extreme outcomes, which should be considered when choosing the final resampling function to use for Prey.
#'
#' @inheritParams RstoxBase::ModelData
#' @inheritParams general_arguments
#'
ResamplePreySpeciesCategoryCatchDataHierarchicalNotUsingmakeUniqueVars <- function(PreySpeciesCategoryCatchData, Seed) {
# Export this function when prey is official
warning("It should be decided which of the ResamplingFunctions ResamplePreySpeciesCategoryCatchDataHierarchicalUsingScaling, ResamplePreySpeciesCategoryCatchDataHierarchical or ResamplePreySpeciesCategoryCatchDataHierarchicalNotUsingmakeUniqueVars should be included in StoX")
# Warn if there are strata with only one PSU, which may result in loss of variance:
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "PSU",
within = "Stratum",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = "Biotic"
)
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "Individual",
within = "Sample",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = ""
)
onlyOneToResample_Warning(
PreySpeciesCategoryCatchData$Data,
toResample = "PreySpeciesCategory",
within = "Individual",
functionName = "ResamplePreySpeciesCategoryCatchData",
toResamplePrefix = ""
)
# Resample PSUs within Strata, modifying the weighting variable of MeanSpeciesCategoryCatchData:
PreySpeciesCategoryCatchData <- resampleData(
data = PreySpeciesCategoryCatchData,
seed = Seed,
varToResample = c("PSU", "Individual", "PreySpeciesCategory"),
resampleBy = c("Stratum", "Sample", "Individual"),
makeUniqueVars = FALSE
)
#PreySpeciesCategoryCatchData <- resampleData(
# data = PreySpeciesCategoryCatchData,
# seed = Seed,
# varToResample = "Individual",
# resampleBy = "Sample"
#)
#PreySpeciesCategoryCatchData <- resampleData(
# data = PreySpeciesCategoryCatchData,
# seed = Seed,
# varToResample = "PreySpeciesCategory",
# resampleBy = "Individual"
#)
return(PreySpeciesCategoryCatchData)
}
#' Resamples biotic PSUs
#'
#' This function resamples biotic PSUs with replacement within each Stratum, changing the MeanLengthDistributionWeight.
#'
#' @inherit RstoxBase::ProcessData
#' @inheritParams general_arguments
#'
#' @export
#'
ResampleBioticAssignmentByStratum <- function(BioticAssignment, Seed) {
# This function will be renamed to ResampleAssignedHaulsInStratum
# Warn if there are strata with only one Haul, which may result in loss of variance:
onlyOneToResample_Warning(
BioticAssignment,
toResample = "Haul",
within = "Stratum",
functionName = "ResampleBioticAssignmentByStratum",
toResamplePrefix = "assigned "
)
# Resample Hauls within Strata, modifying the weighting variable of BioticAssignment:
BioticAssignment <- resampleData_UsingScaling(
#data = BioticAssignment$BioticAssignment,
data = BioticAssignment,
seed = Seed,
# If any other values than varToResample = "Haul" and resampleBy = "Stratum" are used in the future, warnings for only one and not all varToResample in resampleBy should be added in RstoxBase!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:
varToResample = "Haul",
resampleBy = "Stratum"
)
# Scale the WeightingFactor:
applyResamplingFactor(
BioticAssignment,
varToScale = "WeightingFactor"
)
return(BioticAssignment)
}
#' Resamples biotic PSUs
#'
#' This function resamples biotic PSUs with replacement within each Stratum, changing the MeanLengthDistributionWeight.
#'
#' @inherit RstoxBase::ProcessData
#' @inheritParams general_arguments
#'
#' @export
#'
ResampleBioticAssignmentByAcousticPSU <- function(BioticAssignment, Seed) {
# This function will be renamed to ResampleAssignedHaulsInStratum
# Warn if there are PSUs with only one Haul, which may result in loss of variance:
onlyOneToResample_Warning(
BioticAssignment,
toResample = "Haul",
within = "PSU",
functionName = "ResampleBioticAssignmentByAcousticPSU",
toResamplePrefix = "assigned "
)
# Resample Hauls within Strata, modifying the weighting variable of BioticAssignment:
BioticAssignment <- resampleData_UsingScaling(
#data = BioticAssignment$BioticAssignment,
data = BioticAssignment,
seed = Seed,
# If any other values than varToResample = "Haul" and resampleBy = "Stratum" are used in the future, warnings for only one and not all varToResample in resampleBy should be added in RstoxBase!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:
varToResample = "Haul",
resampleBy = "PSU"
)
# Scale the WeightingFactor:
applyResamplingFactor(
BioticAssignment,
varToScale = "WeightingFactor"
)
return(BioticAssignment)
}
#' Resamples acoustic PSUs
#'
#' This function resamples, with replacement, the AcousticPSUs within Stratum in the \code{\link{MeanNASCData}}, where Stratum is the stratum associated to the PSU, and not necessarily the actual stratum polygon. The column NASC is scaled by the number of occurrences of each AcousticPSUs from the resampling.
#'
#' @inheritParams RstoxBase::ModelData
#' @inheritParams general_arguments
#'
#' @export
#'
ResampleMeanNASCData <- function(MeanNASCData, Seed) {
# This function will be renamed to ResampleAcousticPSUsInStratum
# Warn if there are strata with only one PSU, which may result in loss of variance:
onlyOneToResample_Warning(
MeanNASCData$Data,
toResample = "PSU",
within = "Stratum",
functionName = "ResampleMeanNASCData",
toResamplePrefix = "Acoustic"
)
# Resample PSUs within Strata, modifying the weighting variable of MeanLengthDistributionData:
MeanNASCData$Data <- resampleData_UsingScaling(
data = MeanNASCData$Data,
seed = Seed,
varToResample = "PSU",
resampleBy = "Stratum"
)
# Scale the NASC:
applyResamplingFactor(
MeanNASCData$Data,
varToScale = "NASC"
)
return(MeanNASCData)
}
#ResampleIndividualAge <- function(SuperIndividualsData, AgeErrorMatrix, Seed) {
#
#
#
# return(SuperIndividualsData)
#}
##################################################
##################################################
#' Report Bootstrap
#'
#' Reports the sum, mean or other statistics on a variable of the \code{\link{BootstrapData}}.
#'
#' @inheritParams ModelData
#' @inheritParams RstoxBase::general_report_arguments
#' @param BaselineProcess A strings naming the baseline process to report from the \code{\link{BootstrapData}}. If a process with
#' @param BootstrapReportFunction The function to apply across bootstrap run, such as "cv" or "c".
#' @param Percentages The percentages to report Percentiles for when BootstrapReportFunction = "summaryStox".
#' @param AggregationFunction Deprecated, use ReportFunction instead. An alias for \code{ReportFunction}, kept for backward compatibility.
#'
#' @details This function works in two steps. First, the \code{ReportFunction} is applied to the \code{TargetVariable} of the table given by \code{BaselineProcess} for each unique combination of the \code{GroupingVariables} and for each bootstrap run. Second, a grid of all possible combinations of the \code{GroupingVariables} is formed and the result from the first step placed onto the grid. This creates 0 for each position in the grid where data from the first step are not present. E.g., if a particularly large fish is found in only one haul, and this haul by random is not selected in a bootstrap run, the \code{TargetVariable} will be 0 to reflect the variability in the data. To complete the second step, the \code{BootstrapReportFunction} is applied over the bootstrap runs for each cell in the grid.
#'
#' The parameter \code{RemoveMissingValues} should be used with extreme caution. The effect of setting \code{RemoveMissingValues} to TRUE is that missing values (NAs) are removed in both the first and second step. This can be dangerous both in the first and in the second step. E.g., if the Abundance of \code{\link[RstoxBase]{SuperIndividualsData}} is positive for super-individuals with missing IndividualWeight, then the Biomass of those super-individuals will be missing as well. If one the wants to sum the Biomass by using \code{ReportFunction} = "sum" one will get NA if \code{RemoveMissingValues} = FALSE. If \code{RemoveMissingValues} = TRUE one will ignore the missing Biomass, and the summed Biomass will only include the super-individuals that have non-missing IndividualWeight, effectively discarding a portion of the observed abundance. The summed Biomass will in this case be underestimated!
#'
#' In the second step, setting \code{RemoveMissingValues} to TRUE can be even more dangerous, as the only option currently available for the \code{BootstrapReportFunction} is the function RstoxBase::summaryStox(), which includes average and standard deivation which are highly influenced by removing missing data.
#'
#' Instead of setting \code{RemoveMissingValues} to TRUE, it is advised to apply the function \code{\link[RstoxBase]{ImputeSuperIndividuals}} to fill in e.g. IndividualWeight where missing. Missing values in the output of this function can also be avoided by adding variables to \code{GroupingVariables}, such as adding "Stratum" e.g. if there are strata that are known from Baseline to contain no fish. These strata will then be present but with missing values, but these missing values will not affect other strata if "Stratum" is included in \code{GroupingVariables}. It is also recommended to include "Survey" and "SpeciesCategory" in the \code{GroupingVariables}, as these are key variables for which summary statistics should rarely be computed across.
#'
#' @return
#' A \code{\link{ReportBootstrapData}} object.
#'
#' @export
#'
ReportBootstrap <- function(
BootstrapData,
BaselineProcess = character(),
TargetVariable = character(),
TargetVariableUnit = character(),
# The aggregation on Baseline:
ReportFunction = RstoxBase::getReportFunctions(use = "Baseline"),
GroupingVariables = character(),
InformationVariables = character(),
WeightingVariable = character(),
ConditionOperator = character(),
ConditionValue = character(),
FractionOverVariable = character(),
# The aggreggation over bootstraps:
BootstrapReportFunction = RstoxBase::getReportFunctions(use = "Bootstrap"),
Percentages = double(),
# General arguments:
Filter = character(),
RemoveMissingValues = FALSE,
AggregationFunction = character()
)
{
# AggregationFunction is kept for backward compatibility, but deprecated:
if(length(AggregationFunction)) {
warning("AggregationFunction is deprecated. Use ReportFunction instead. AggregationFunction is kept for backward compatibility and overrides ReportFunction.")
ReportFunction <- RstoxData::match_arg_informative(AggregationFunction, RstoxBase::getReportFunctions(use = "Baseline"))
}
else {
ReportFunction <- RstoxData::match_arg_informative(ReportFunction)
}
BootstrapReportFunction <- RstoxData::match_arg_informative(BootstrapReportFunction)
if(!length(BootstrapData)) {
return(NULL)
}
# Issue a warning if RemoveMissingValues = TRUE:
if(isTRUE(RemoveMissingValues)) {
warning(RstoxBase::getRstoxBaseDefinitions("RemoveMissingValuesWarning")(TargetVariable))
}
# Open the file:
if(length(unlist(BootstrapData))) {
nc <- ncdf4::nc_open(unlist(BootstrapData))
on.exit(ncdf4::nc_close(nc))
}
else {
warning("StoX: Bootstrap output NetCDF4 file missing.")
return(NULL)
}
# Read the baseline process names:
processNameAndTableName <- getProcessNameAndTableName(BaselineProcess, nc, na.rm = TRUE)
processNameSlashTableName <- paste(processNameAndTableName, collapse = "/")
# Read the relevant variables from the BootstrapData file:
toKeep <- c(TargetVariable, GroupingVariables, InformationVariables, WeightingVariable)
selection <- paste(processNameSlashTableName, toKeep, sep = "/")
# Get the data type:
dataType <- getDataTypeFromBootstrap(nc, BaselineProcess)
# Set the NetCDF4ChunkSize:
NumberOfBootstraps <- getNumberOfBootstraps(nc, selection = selection)
# Check the size of one bootstrap:
test1 <- readBootstrapData(nc = nc, selection = selection, BootstrapID = 1, close = FALSE)
size <- utils::object.size(test1)
maximumSize <- 1e10 # 10 GB
NetCDF4ChunkSize <- pretty(maximumSize / size)[1]
# Get the number of bootstraps:
bootstrapIDs <- seq_len(NumberOfBootstraps)
bootstrapIDs <- split(bootstrapIDs, ceiling(bootstrapIDs / NetCDF4ChunkSize))
# Run the initial aggregation on each bootstrap replicated, reading from the file at each step instead of reading all steps and then aggregating:
output <- lapply(bootstrapIDs, initialAggregateBootstrapDataOne,
nc = nc,
selection = selection,
ReportFunction = ReportFunction,
TargetVariable = TargetVariable,
TargetVariableUnit = TargetVariableUnit,
GroupingVariables = c(GroupingVariables, "BootstrapID"),
InformationVariables = InformationVariables,
RemoveMissingValues = RemoveMissingValues,
Specification = list(
WeightingVariable = WeightingVariable,
ConditionOperator = ConditionOperator,
ConditionValue = ConditionValue,
FractionOverVariable = FractionOverVariable
),
dataType = dataType
)
unit <- output[[1]]$unit
output <- lapply(output, "[[", "output")
# Rbind the output:
output <- data.table::rbindlist(output)
# Get the name of the new TargetVariable:
TargetVariableAfterInitialAggregation <- RstoxBase::getReportFunctionVariableName(
functionName = ReportFunction,
TargetVariable = TargetVariable
)
# Store the unique combinations of the GroupingVariables from the output:
if(length(GroupingVariables)) {
uniqueGroupingVariablesToKeep <- unique(subset(output, select = GroupingVariables))
setorder(uniqueGroupingVariablesToKeep)
}
else {
uniqueGroupingVariablesToKeep <- NULL
}
# Run the report function of the bootstraps:
output <- RstoxBase::aggregateBaselineDataOneTable(
stoxData = output,
TargetVariable = TargetVariableAfterInitialAggregation,
ReportFunction = BootstrapReportFunction,
GroupingVariables = GroupingVariables,
InformationVariables = InformationVariables,
na.rm = RemoveMissingValues,
padWithZerosOn = "BootstrapID",
Specification = list(
Percentages = Percentages
),
uniqueGroupingVariablesToKeep = uniqueGroupingVariablesToKeep
)
# Add the unit to the report:
if(length(TargetVariableUnit)) {
output <- cbind(output, Unit = TargetVariableUnit)
}
output <- RstoxBase::filterTable(output, filter = Filter)
# Add the grouping and information variables as attributes, for use e.g. when plotting:
attr(output, "GroupingVariables") <- GroupingVariables
attr(output, "InformationVariables") <- InformationVariables
return(output)
}
getProcessNameAndTableName <- function(BaselineProcess, nc, na.rm = FALSE) {
# Read the baseline process names:
processNamesAndTableNames <- getProcessNamesAndTableNames(nc)
processNames <- unique(processNamesAndTableNames$processName)
if(! BaselineProcess %in% processNames) {
# If the BootstrapData is empty, stop:
if(!length(processNames)) {
warning("Empty BootstrapData.")
return(NULL)
}
else {
stop("The BaselineProcess ", BaselineProcess, " is not one of the outputs of the Bootstrap. Possible values are ", paste(processNames, collapse = ", "), ".")
}
}
else {
# Expand the BaselineProcess with the table name, where for multi table outputs with Data and Resolution the Data are used:
processNamesAndTableNames <- subset(processNamesAndTableNames, processName %in% BaselineProcess)
if(nrow(processNamesAndTableNames) == 1) {
processNamesAndTableNames <- unlist(processNamesAndTableNames)
return(if(na.rm) stats::na.omit(processNamesAndTableNames) else processNamesAndTableNames)
}
else if("Data" %in% processNamesAndTableNames$tableName) {
return(c(processNamesAndTableNames$processName[1], "Data"))
}
else {
stop("StoX: For multi-table process requested by the BaselineProcess, the table \"Data\" must be present.")
}
}
}
initialAggregateBootstrapDataOne <- function(
bootstrapID,
nc,
selection,
ReportFunction,
TargetVariable,
TargetVariableUnit,
GroupingVariables,
InformationVariables,
RemoveMissingValues,
Specification,
dataType
) {
# Get the table of current bootstrapID:
relevantBootstrapDataOne <- readBootstrapData(nc = nc, selection = selection, BootstrapID = bootstrapID, unlistSingleTable = TRUE, close = FALSE)[[1]]
# Set the unit of the target variable:
relevantBootstrapDataOne[[TargetVariable]] <- RstoxBase::setUnitRstoxBase(
relevantBootstrapDataOne[[TargetVariable]],
dataType = dataType,
variableName = TargetVariable,
unit = TargetVariableUnit
)
# Store the unit for output:
unit <- RstoxData::getUnit(relevantBootstrapDataOne[[TargetVariable]], property = "shortname")
# Run the initial aggregation (only applicable for functions with output of length 1):
output <- RstoxBase::aggregateBaselineDataOneTable(
stoxData = relevantBootstrapDataOne,
TargetVariable = TargetVariable,
ReportFunction = ReportFunction,
GroupingVariables = GroupingVariables,
InformationVariables = InformationVariables,
na.rm = RemoveMissingValues,
Specification = Specification
)
output <- list(
output = output,
unit = unit
)
return(output)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.