#' Resample data
#'
#' Resample data from a data set using common resampling methods.
#' For bootstrap or jackknife resampling, package users usually do not need to
#' call this function but directly use [resamplecSEMResults()] instead.
#'
#' The function `resampleData()` is general purpose. It simply resamples data
#' from a data set according to the resampling method provided
#' via the `.resample_method` argument and returns a list of resamples.
#' Currently, `bootstrap`, `jackknife`, `permutation`, and `cross-validation`
#' (both leave-one-out (LOOCV) and k-fold cross-validation) are implemented.
#'
#' The user may provide the data set to resample either explicitly via the `.data`
#' argument or implicitly by providing a [cSEMResults] objects to `.object`
#' in which case the original data used in the call that created the
#' [cSEMResults] object is used for resampling.
#' If both, a [cSEMResults] object and a data set via `.data` are provided
#' the former is ignored.
#'
#' As [csem()] accepts a single data set, a list of data sets as well as data sets
#' that contain a column name used to split the data into groups,
#' the [cSEMResults] object may contain multiple data sets.
#' In this case, resampling is done by data set or group. Note that depending
#' on the number of data sets/groups provided this computation may be slower
#' as resampling will be repeated for each data set/group.
#'
#' To split data provided via the `.data` argument into groups, the column name or
#' the column index of the column containing the group levels to split the data
#' must be given to `.id`. If data that contains grouping is taken from
#' a [cSEMResults] object, `.id` is taken from the object information. Hence,
#' providing `.id` is redundant in this case and therefore ignored.
#'
#' The number of bootstrap or permutation runs as well as the number of
#' cross-validation repetitions is given by `.R`. The default is
#' `499` but should be increased in real applications. See e.g.,
#' \insertCite{Hesterberg2015;textual}{cSEM}, p.380 for recommendations concerning
#' the bootstrap. For jackknife `.R` is ignored as it is based on the N leave-one-out data sets.
#'
#' Choosing `resample_method = "permutation"` for ungrouped data causes an error
#' as permutation will simply reorder the observations which is usually not
#' meaningful. If a list of data is provided
#' each list element is assumed to represent the observations belonging to one
#' group. In this case, data is pooled and group adherence permuted.
#'
#' For cross-validation the number of folds (`k`) defaults to `10`. It may be
#' changed via the `.cv_folds` argument. Setting `k = 2` (not 1!) splits
#' the data into a single training and test data set. Setting `k = N` (where `N` is the
#' number of observations) produces leave-one-out cross-validation samples.
#' Note: 1.) At least 2 folds required (`k > 1`); 2.) `k` can not be larger than `N`;
#' 3.) If `N/k` is not not an integer the last fold will have less observations.
#'
#' Random number generation (RNG) uses the L'Ecuyer-CRMR RGN stream as implemented
#' in the \href{https://github.com/futureverse/future.apply/}{future.apply package}
#' \insertCite{Bengtsson2018a}{cSEM}.
#' See [?future_lapply][future.apply::future_lapply] for details. By default
#' a random seed is chosen.
#'
#' @usage resampleData(
#' .object = NULL,
#' .data = NULL,
#' .resample_method = c("bootstrap", "jackknife", "permutation",
#' "cross-validation"),
#' .cv_folds = 10,
#' .id = NULL,
#' .R = 499,
#' .seed = NULL
#' )
#'
#' @param .data A `data.frame`, a `matrix` or a `list` of data of either type.
#' Possible column types or classes of the data provided are:
#' "`logical`", "`numeric`" ("`double`" or "`integer`"), "`factor`" (ordered and unordered)
#' or a mix of several types. The data may also include
#' **one** character column whose column name must be given to `.id`.
#' This column is assumed to contain group identifiers used to split the data into groups.
#' If `.data` is provided, `.object` is ignored. Defaults to `NULL`.
#' @param .resample_method Character string. The resampling method to use. One of:
#' "*bootstrap*", "*jackknife*", "*permutation*", or "*cross-validation*".
#' Defaults to "*bootstrap*".
#' @param .R Integer. The number of bootstrap runs, permutation runs
#' or cross-validation repetitions to use. Defaults to `499`.
#' @inheritParams csem_arguments
#'
#' @return The structure of the output depends on the type of input and the
#' resampling method:
#' \describe{
#' \item{Bootstrap}{If a `matrix` or `data.frame` without grouping variable
#' is provided (i.e., `.id = NULL`), the result is a list of length `.R`
#' (default `499`). Each element of that list is a bootstrap (re)sample.
#' If a grouping variable is specified or a list of data is provided
#' (where each list element is assumed to contain data for one group),
#' resampling is done by group. Hence,
#' the result is a list of length equal to the number of groups
#' with each list element containing `.R` bootstrap samples based on the
#' `N_g` observations of group `g`.}
#' \item{Jackknife}{If a `matrix` or `data.frame` without grouping variable
#' is provided (`.id = NULL`), the result is a list of length equal to the number
#' of observations/rows (`N`) of the data set provided.
#' Each element of that list is a jackknife (re)sample.
#' If a grouping variable is specified or a list of data is provided
#' (where each list element is assumed to contain data for one group),
#' resampling is done by group. Hence,
#' the result is a list of length equal to the number of group levels
#' with each list element containing `N` jackknife samples based on the
#' `N_g` observations of group `g`.}
#' \item{Permutation}{If a `matrix` or `data.frame` without grouping variable
#' is provided an error is returned as permutation will simply reorder the observations.
#' If a grouping variable is specified or a list of data is provided
#' (where each list element is assumed to contain data of one group),
#' group membership is permuted. Hence, the result is a list of length `.R`
#' where each element of that list is a permutation (re)sample.}
#' \item{Cross-validation}{If a `matrix` or `data.frame` without grouping variable
#' is provided a list of length `.R` is returned. Each list element
#' contains a list containing the `k` splits/folds subsequently
#' used as test and training data sets.
#' If a grouping variable is specified or a list of data is provided
#' (where each list element is assumed to contain data for one group),
#' cross-validation is repeated `.R` times for each group. Hence,
#' the result is a list of length equal to the number of groups,
#' each containing `.R` list elements (the repetitions) which in turn contain
#' the `k` splits/folds.
#' }
#' }
#' @references
#' \insertAllCited{}
#'
#' @seealso [csem()], [cSEMResults], [resamplecSEMResults()]
#'
#' @example inst/examples/example_resampleData.R
#'
#' @export
#'
resampleData <- function(
.object = NULL,
.data = NULL,
.resample_method = c("bootstrap", "jackknife", "permutation",
"cross-validation"),
.cv_folds = 10,
.id = NULL,
.R = 499,
.seed = NULL
) {
.resample_method <- match.arg(.resample_method,
c("bootstrap", "jackknife", "permutation", "cross-validation"))
## Set plan on how to resolve futures to "sequential" as it is virtually always
## slower to resample data using "multiprocess".; reset at the end
oplan <- future::plan()
on.exit(future::plan(oplan), add = TRUE)
future::plan("sequential")
## Get data set
if(is.null(.data)) {
## Get information according to class of object
if(any(class(.object) %in% "cSEMResults_default")) {
data <- as.data.frame(.object$Information$Arguments$.data)
id <- NULL
} else if(any(class(.object) %in% "cSEMResults_multi")) {
data <- .object[[1]]$Information$Data_pooled
data_split <- lapply(.object, function(x) x$Information$Arguments$.data)
id <- ifelse(is.null(.object[[1]]$Information$Arguments$.id),
"id", .object[[1]]$Information$Arguments$.id)
} else if(any(class(.object) %in% "cSEMResults_2ndorder")) {
data <- .object$First_stage$Information$Arguments$.data
id <- NULL
} else {
stop2("The following error occured in the `resampleData()` function:\n",
"`object` must be of class cSEMResults."
)
}
} else {
## Check data
if(!any(class(.data) %in% c("data.frame", "matrix", "list"))) {
stop2(
"The following error occured in the `resampleData()` function:\n",
"Data must be provided as a `matrix`, a `data.frame` or a `list`. ",
".data has class: ", class(.data)
)
}
## Select cases
if(!is.null(.id) && !inherits(.data, "list")) {
if(length(.id) != 1) {
stop2(
"The following error occured in the `resampleData()` function:\n",
"`.id` must be a character string or an integer identifying one single column."
)
}
if(is.matrix(.data)) {
.data <- as.data.frame(.data)
}
data <- .data
data_split <- split(.data, f = .data[, .id])
id <- .id
} else if(any(class(.data) == "list")) {
data <- do.call(rbind, .data)
if(is.null(names(.data))) {
data[, "id"] <- rep(paste0("Data_", 1:length(.data)),
times = sapply(.data, nrow))
} else {
data[, "id"] <- rep(names(.data), times = sapply(.data, nrow))
}
data_split <- .data
id <- "id"
##add names here
} else {
data <- .data
id <- NULL
}
}
# Save old seed and restore on exit! This is important since users may have
# set a seed before, in which case the global seed would be
# overwritten if not explicitly restored
old_seed <- .Random.seed
on.exit({.Random.seed <<- old_seed})
## Create seed if not already set
if(is.null(.seed)) {
set.seed(seed = NULL)
# Note (08.12.2019): Its crucial to call set.seed(seed = NULL) before
# drawing a random seed out of .Random.seed. If set.seed(seed = NULL) is not
# called sample(.Random.seed, 1) would result in the same random seed as
# long as .Random.seed remains unchanged. By resetting the seed we make
# sure that sample draws a different element everytime it is called.
.seed <- sample(.Random.seed, 1)
}
## Set seed
set.seed(.seed)
## Choose resampling method
out <- switch (.resample_method,
"jackknife" = {
if(exists("data_split")) {
future.apply::future_lapply(data_split, function(y) future.apply::future_lapply(1:nrow(y), function(x) y[-x, ]))
} else {
future.apply::future_lapply(1:nrow(data), function(x) data[-x, ])
}
},
"bootstrap" = {
if(exists("data_split")) {
future.apply::future_lapply(data_split, function(y) {
future.apply::future_lapply(1:.R, function(x) {
y[sample(1:nrow(y), size = nrow(y), replace = TRUE), ]
}, future.seed = .seed)
})
} else {
future.apply::future_lapply(1:.R, function(x) {
data[sample(1:nrow(data), size = nrow(data), replace = TRUE), ]},
future.seed = .seed
)
}
},
"permutation" = {
if(is.null(id)) {
stop2(
"The following error occured in the `resampleData()` function:\n",
"No id column specified to permutate the data with."
)
} else {
future.apply::future_lapply(1:.R, function(x) {
cbind(data[,-which(colnames(data) == id)], "id" = sample(data[, id]))},
future.seed = .seed)
}
},
"cross-validation" = {
if(.cv_folds < 2) {
stop2(
"The following error occured in the `resampleData()` function:\n",
"A minimum of 2 cross-validation folds required.")
}
if((ceiling(nrow(data)/.cv_folds)*.cv_folds - nrow(data) >= ceiling(nrow(data)/.cv_folds))){
warning2(
"The following error occured in the `resampleData()` function:\n",
"The number of .cv_folds is not plausible for the given sample size.\n",
"Change .cv_folds such that \n",
"ceiling(nrow(data)/.cv_folds)*.cv_folds - nrow(data) <= ceiling(nrow(data)/.cv_folds)"
)
}
# k-fold cross-validation (=draw k samples of equal size.).
# Note the last sample may contain less observations if equal sized
# samples are not possible
if(exists("data_split")) {
## The number of folds cannot be larger than the minimum number of
## rows/observations for all data sets in data_split
if(max(sapply(data_split, nrow)) < .cv_folds) {
stop2(
"The following error occured in the `resampleData()` function:\n",
"The number of folds is larger than the number of observations",
" in at least one of the groups."
)
}
future.apply::future_lapply(data_split, function(y) {
future.apply::future_lapply(1:.R, function(x) {
# shuffle data set
y <- y[sample(1:nrow(y)), ]
if(ceiling(nrow(data)/.cv_folds)*.cv_folds - nrow(data) < ceiling(nrow(data)/.cv_folds)){
#In case that warnings occur, the splitting of the data set might be changed
#suppressWarnings(
split(as.data.frame(y), rep(1:.cv_folds,
each = ceiling(nrow(y)/.cv_folds)))
#)
}else{
#suppressWarnings(
split(as.data.frame(y), rep(1:.cv_folds,
each = floor(nrow(y)/.cv_folds)))
#)
}
}, future.seed = .seed)
})
} else {
## The number of folds cannot be larger than the minimum number of
## rows/observations for all data sets in data_split
if(nrow(data) < .cv_folds) {
stop2(
"The following error occured in the `resampleData()` function:\n",
"The number of folds is larger than the number of observations."
)
}
# shuffle data
future.apply::future_lapply(1:.R, function(x) {
data <- data[sample(1:nrow(data)), ]
if(ceiling(nrow(data)/.cv_folds)*.cv_folds - nrow(data) < ceiling(nrow(data)/.cv_folds)){
suppressWarnings(
split(as.data.frame(data), rep(1:.cv_folds,
each = ceiling(nrow(data)/.cv_folds)))
)
}else{
suppressWarnings(
split(as.data.frame(data), rep(1:.cv_folds,
each = floor(nrow(data)/.cv_folds)))
)
}
}, future.seed = .seed)
}
} # END cross-validation
) # END switch
## Return samples
out
}
#' Resample cSEMResults
#'
#' Resample a [cSEMResults] object using bootstrap or jackknife resampling.
#' The function is called by [csem()] if the user sets
#' `csem(..., .resample_method = "bootstrap")` or
#' `csem(..., .resample_method = "jackknife")` but may also be called directly.
#'
#' Given `M` resamples (for bootstrap `M = .R` and for jackknife `M = N`, where
#' `N` is the number of observations) based on the data used to compute the
#' [cSEMResults] object provided via `.object`, `resamplecSEMResults()` essentially calls
#' [csem()] on each resample using the arguments of the original call (ignoring any arguments
#' related to resampling) and returns estimates for each of a subset of
#' practically useful resampled parameters/statistics computed by [csem()].
#' Currently, the following estimates are computed and returned by default based
#' on each resample: Path estimates, Loading estimates, Weight estimates.
#'
#' In practical application users may need to resample a specific statistic (e.g,
#' the heterotrait-monotrait ratio of correlations (HTMT) or differences between path
#' coefficients such as beta_1 - beta_2).
#' Such statistics may be provided by a function `fun(.object, ...)` or a list of
#' such functions via the `.user_funs` argument. The first argument of
#' these functions must always be `.object`.
#' Internally, the function will be applied on each
#' resample to produce the desired statistic. Hence, arbitrary complicated statistics
#' may be resampled as long as the body of the function draws on elements contained
#' in the [cSEMResults] object only. Output of `fun(.object, ...)` should preferably
#' be a (named) vector but matrices are also accepted.
#' However, the output will be vectorized (columnwise) in this case.
#' See the examples section for details.
#'
#' Both resampling the original [cSEMResults] object (call it "first resample")
#' and resampling based on a resampled [cSEMResults] object (call it "second resample")
#' are supported. Choices for the former
#' are "*bootstrap*" and "*jackknife*". Resampling based on a resample is turned off
#' by default (`.resample_method2 = "none"`) as this significantly
#' increases computation time (there are now `M * M2` resamples to compute, where
#' `M2` is `.R2` or `N`).
#' Resamples of a resample are required, e.g., for the studentized confidence
#' interval computed by the [infer()] function. Typically, bootstrap resamples
#' are used in this case \insertCite{Davison1997}{cSEM}.
#'
#' As [csem()] accepts a single data set, a list of data sets as well as data sets
#' that contain a column name used to split the data into groups,
#' the [cSEMResults] object may contain multiple data sets.
#' In this case, resampling is done by data set or group. Note that depending
#' on the number of data sets/groups, the computation may be considerably
#' slower as resampling will be repeated for each data set/group. However, apart
#' from speed considerations users don not need to worry about the type of
#' input used to compute the [cSEMResults] object as `resamplecSEMResults()`
#' is able to deal with each case.
#'
#' The number of bootstrap runs for the first and second run are given by `.R` and `.R2`.
#' The default is `499` for the first and `199` for the second run
#' but should be increased in real applications. See e.g.,
#' \insertCite{Hesterberg2015;textual}{cSEM}, p.380,
#' \insertCite{Davison1997;textual}{cSEM}, and
#' \insertCite{Efron2016;textual}{cSEM} for recommendations.
#' For jackknife `.R` are `.R2` are ignored.
#'
#' Resampling may produce inadmissible results (as checked by [verify()]).
#' By default these results are dropped however users may choose to `"ignore"`
#' or `"replace"` inadmissible results in which resampling continuous until
#' the necessary number of admissible results is reached.
#'
#' The \pkg{cSEM} package supports (multi)processing via the \href{https://github.com/futureverse/future/}{future}
#' framework \insertCite{Bengtsson2018}{cSEM}. Users may simply choose an evaluation plan
#' via `.eval_plan` and the package takes care of all the complicated backend
#' issues. Currently, users may choose between standard single-core/single-session
#' evaluation (`"sequential"`) and multiprocessing (`"multisession"` or `"multicore"`). The future package
#' provides other options (e.g., `"cluster"` or `"remote"`), however, they probably
#' will not be needed in the context of the \pkg{cSEM} package as simulations usually
#' do not require high-performance clusters. Depending on the operating system, the future
#' package will manage to distribute tasks to multiple R sessions (Windows)
#' or multiple cores. Note that multiprocessing is not necessary always faster
#' when only a "small" number of replications is required as the overhead of
#' initializing new sessions or distributing tasks to different cores
#' will not immediately be compensated by the availability of multiple sessions/cores.
#'
#' Random number generation (RNG) uses the L'Ecuyer-CRMR RGN stream as implemented in the
#' \href{https://github.com/futureverse/future.apply/}{future.apply package} \insertCite{Bengtsson2018a}{cSEM}.
#' It is independent of the evaluation plan. Hence, setting e.g., `.seed = 123` will
#' generate the same random number and replicates
#' for both `.eval_plan = "sequential"`, `.eval_plan = "multisession"`, and `.eval_plan = "multicore"`.
#' See [?future_lapply][future.apply::future_lapply] for details.
#'
#' @usage resamplecSEMResults(
#' .object = NULL,
#' .resample_method = c("bootstrap", "jackknife"),
#' .resample_method2 = c("none", "bootstrap", "jackknife"),
#' .R = 499,
#' .R2 = 199,
#' .handle_inadmissibles = c("drop", "ignore", "replace"),
#' .user_funs = NULL,
#' .eval_plan = c("sequential", "multicore", "multisession"),
#' .force = FALSE,
#' .seed = NULL,
#' .sign_change_option = c("none","individual","individual_reestimate",
#' "construct_reestimate"),
#' ...
#' )
#'
#' @inheritParams csem_arguments
#' @param .resample_method Character string. The resampling method to use. One of:
#' "*bootstrap*" or "*jackknife*". Defaults to "*bootstrap*".
#' @param ... Further arguments passed to functions supplied to `.user_funs`.
#'
#' @return The core structure is the same structure as that of `.object` with
#' the following elements added:
#' \itemize{
#' \item{ `$Estimates_resamples`: A list containing the `.R` resamples and
#' the original estimates for each of the resampled quantities (Path_estimates,
#' Loading_estimates, Weight_estimates, user defined functions).
#' Each list element is a list containing elements
#' `$Resamples` and `$Original`. `$Resamples` is a `(.R x K)` matrix with each
#' row representing one resample for each of the `K` parameters/statistics.
#' `$Original` contains the original estimates (vectorized by column if the output of
#' the user provided function is a matrix.}
#' \item {`$Information_resamples`: A list containing additional information.}
#' }
#' Use `str(<.object>, list.len = 3)` on the resulting object for an overview.
#'
#' @references
#' \insertAllCited{}
#'
#' @seealso [csem], [summarize()], [infer()], [cSEMResults]
#'
#' @example inst/examples/example_resamplecSEMResults.R
#'
#' @export
#'
resamplecSEMResults <- function(
.object = NULL,
.resample_method = c("bootstrap", "jackknife"),
.resample_method2 = c("none", "bootstrap", "jackknife"),
.R = 499,
.R2 = 199,
.handle_inadmissibles = c("drop", "ignore", "replace"),
.user_funs = NULL,
.eval_plan = c("sequential", "multicore", "multisession"),
.force = FALSE,
.seed = NULL,
.sign_change_option = c("none","individual","individual_reestimate",
"construct_reestimate"),
...
) {
## Match arguments
.resample_method <- match.arg(.resample_method)
.resample_method2 <- match.arg(.resample_method2)
.handle_inadmissibles <- match.arg(.handle_inadmissibles)
.eval_plan <- match.arg(.eval_plan)
.sign_change_option <- match.arg(.sign_change_option)
if(inherits(.object, "cSEMResults_multi")) {
out <- lapply(.object, function(x) {
resamplecSEMResults(
.object = x,
.resample_method = .resample_method,
.resample_method2 = .resample_method2,
.R = .R,
.R2 = .R2,
.handle_inadmissibles = .handle_inadmissibles,
.user_funs = .user_funs,
.eval_plan = .eval_plan,
.force = FALSE,
.seed = .seed,
.sign_change_option = .sign_change_option,
...
)
})
## Add/ set class
class(out) <- c(class(.object), "cSEMResults_resampled")
return(out)
}
## Does .object already contain resamples
if(inherits(.object, "cSEMResults_resampled")) {
if(.force) {
# Delete resamples
if(inherits(.object, "cSEMResults_default")) {
.object$Estimates[["Estimates_resample"]] <- NULL
.object$Information[["Information_resample"]] <- NULL
class(.object) <- setdiff(class(.object), "cSEMResults_resampled")
} else {
.object$Second_stage$Information[["Resamples"]] <- NULL
}
} else {
stop2(
"The following issue was encountered in the `resamplecSEMResults()` function:\n",
"The object provided already contains resamples. Use `.force = TRUE` to rerun.")
}
}
## Check for the minimum number of necessary resamples
if(.R < 3 | .R2 < 3) {
stop2("The following error occured in the `resamplecSEMResults()` function:\n",
"At least 3 resamples required.")
}
## Has the object to use the data to resample from produced admissible results?
if(sum(unlist(verify(.object))) != 0) {
warning2(
"The following issue was encountered in the `resamplecSEMResults()` function:\n",
"Estimation based on the original data has produced inadmissible results.\n",
"This may be a sign that something is wrong.",
" Resampling will continue but may not produce reliable results.")
}
# Set plan on how to resolve futures; reset at the end
oplan <- future::plan()
on.exit(future::plan(oplan), add = TRUE)
future::plan(.eval_plan)
### Process original data ----------------------------------------------------
## Select relevant statistics/parameters/quantities and vectorize
Est_original <- selectAndVectorize(.object)
## Apply user defined function if specified
user_funs <- if(!is.null(.user_funs)) {
applyUserFuns(.object, .user_funs = .user_funs, ...)
}
## Add output of the user functions to Est_original
if(!is.null(.user_funs)) {
Est_original <- c(Est_original, user_funs)
}
### Resample and compute -----------------------------------------------------
# Save old seed and restore on exit! This is important since users may have
# set a seed before, in which case the global seed would be
# overwritten if not explicitly restored
# Note (07.10.2020): for some tests performed by testthat .Random.seed is not
# available. This is when it is not initialized. To initialize
# the .Random.seed object a random number is generated.
runif(1)
old_seed <- .Random.seed
on.exit({.Random.seed <<- old_seed})
## Create seed if not already set
if(is.null(.seed)) {
set.seed(seed = NULL)
# Note (08.12.2019): Its crucial to call set.seed(seed = NULL) before
# drawing a random seed out of .Random.seed. If set.seed(seed = NULL) is not
# called sample(.Random.seed, 1) would result in the same random seed as
# long as .Random.seed remains unchanged. By resetting the seed we make
# sure that sample draws a different element every time it is called.
.seed <- sample(.Random.seed, 1)
}
out <- resamplecSEMResultsCore(
.object = .object,
.resample_method = .resample_method,
.resample_method2 = .resample_method2,
.R = .R,
.R2 = .R2,
.handle_inadmissibles = .handle_inadmissibles,
.handle_inadmissibles2 = .handle_inadmissibles,
.user_funs = .user_funs,
.eval_plan = .eval_plan,
.seed = .seed,
.sign_change_option = .sign_change_option,
...
)
# Check if at least 3 admissible results were obtained
n_admissibles <- length(out)
if(n_admissibles < 3) {
stop("The following error occured in the `resamplecSEMResults()` functions:\n",
"Less than 2 admissible results produced.",
" Consider setting `.handle_inadmissibles == 'replace'` instead.",
call. = FALSE)
}
# Turn list "inside out" and bind bootstrap samples to matrix
# - columns are variables
# - rows are bootstrap runs
out <- purrr::transpose(out)
if(.resample_method2 != "none") {
out_2 <- out$Estimates2
out <- purrr::transpose(out$Estimates1)
}
out <- out %>%
lapply(function(x) do.call(rbind, x))
# Add estimated quantities based on the the original sample/data
out <- mapply(function(l1, l2) list("Resampled" = l1, "Original" = l2),
l1 = out,
l2 = Est_original,
SIMPLIFY = FALSE)
## Return --------------------------------------------------------------------
# The desired output is: a list of two:
# 1. Resamples := a list of two
# 1.1 Estimates1 := the .R resamples of the "outer" resampling run.
# 1.2 Estimates2 := the .R2 resamples of the "inner" run or NA
# 2. Information := a list of three
# 2.1 Method := the method used to obtain the "outer" resamples
# 2.2 Method2 := the method used to obtain the "inner" resamples or NA
# 2.3 Number_of_observations := the number of observations.
# 2.4 Original_object := the original cSEMResults object
#
# Since resamplecSEMResults is called recursivly within its body it is
# difficult to produce the desired output. There is now straightforward way
# to determine if a call is a recursive call or not
# My workaround for now is to simply check if the argument provided for ".object"
# is "Est_temp" since this is the argument for the recurive call. This
# is of course not general but a dirty workaround. I guess it is save enough
# though.
is_recursive_call <- eval.parent(as.list(match.call()))$.object == "Est_temp"
## Return
if(is_recursive_call) {
out
} else {
if(.resample_method2 != "none") {
out <- list("Estimates1" = out, "Estimates2" = out_2)
} else {
out <- list("Estimates1" = out, "Estimates2" = NA)
}
## Add resamples and additional information
if(inherits(.object, "cSEMResults_2ndorder")) {
resample_out <- c(.object$Second_stage$Information)
resample_out[[length(resample_out) + 1]] <- list(
"Estimates" = c(out),
"Information_resample" = list(
"Method" = .resample_method,
"Method2" = .resample_method2,
"Number_of_admissibles" = n_admissibles,
"Number_of_observations" = nrow(.object$Second_stage$Information$Data),
"Number_of_runs" = .R,
"Number_of_runs2" = .R2,
"Seed" = .seed,
"Handle_inadmissibles" = .handle_inadmissibles,
"Sign_change_option" = .sign_change_option
)
)
names(resample_out)[length(resample_out)] <- "Resamples"
out <- list(
"First_stage" = .object$First_stage,
"Second_stage" = list(
"Estimates" = .object$Second_stage$Estimates,
"Information" = resample_out
)
)
# Renew class for 2nd stage
class(out$Second_stage) <- class(.object$Second_stage)
} else {
# Estimates
estim <- c(.object$Estimates)
estim[[length(estim) + 1]] <- c(out)
names(estim)[length(estim)] <- "Estimates_resample"
# Information
info <- c(.object$Information)
info[[length(info) + 1]] <- list(
"Method" = .resample_method,
"Method2" = .resample_method2,
"Number_of_admissibles" = n_admissibles,
"Number_of_observations" = nrow(.object$Information$Data), # for infer()
"Number_of_runs" = .R,
"Number_of_runs2" = .R2,
"Seed" = .seed,
"Handle_inadmissibles" = .handle_inadmissibles,
"Sign_change_option" = .sign_change_option
)
names(info)[length(info)] <- "Information_resample"
out <- list(
"Estimates" = estim,
"Information" = info
)
}
## Add/ set class
class(out) <- c(class(.object), "cSEMResults_resampled")
return(out)
}
}
#' Core tasks of the resamplecSEMResults function
#' @noRd
#'
resamplecSEMResultsCore <- function(
.object = args_default()$.object,
.resample_method = args_default()$.resample_method,
.resample_method2 = args_default()$.resample_method2,
.R = args_default()$.R,
.R2 = args_default()$.R2,
.handle_inadmissibles = args_default()$.handle_inadmissibles,
.handle_inadmissibles2 = NULL,
.user_funs = args_default()$.user_funs,
.eval_plan = args_default()$.eval_plan,
.seed = args_default()$.seed,
.sign_change_option = args_default()$.sign_change_option,
...
) {
## Get arguments
if(inherits(.object, "cSEMResults_2ndorder")) {
info1 <- .object$First_stage$Information
info2 <- .object$Second_stage$Information
est1_normal <- .object$First_stage$Estimates
est2_normal <- .object$Second_stage$Estimates
summary_original <- summarize(.object)
est1 <- summary_original$First_stage$Estimates
est2 <- summary_original$Second_stage$Estimates
args <- info2$Arguments_original
## Append the 2ndorder approach to args
args$.approach_2ndorder <- info2$Approach_2ndorder
## Replace original data (which contains either an id column or is a list)
## by the data used for the group that is being resampled.
if(!is.null(args$.id) | inherits(args$.data, "list")) {
args$.data <- info1$Data
args$.id <- NULL
}
} else {
info1 <- .object$Information
est_normal <- .object$Estimates
summary_original <- summarize(.object)
est <- summary_original$Estimates
args <- info1$Arguments
}
## Resample jackknife
if(.resample_method == "jackknife") {
resample_jack <- resampleData(.object, .resample_method = "jackknife")
.R <- length(resample_jack)
}
### Start resampling loop ====================================================
progressr::with_progress({
if(.R >= 10) {
progress_bar_csem <- progressr::progressor(steps = floor(.R / 10))
} else {
progress_bar_csem <- progressr::progressor(steps = .R)
}
Est_ls <- future.apply::future_lapply(1:.R, function(i) {
if (i %% 10 == 0) {
progress_bar_csem(message = sprintf("Resample run = %g", i))
}
# Est_ls <- lapply(1:.R, function(i) {
# Replace the old data set by a resampled data set (resampleData always returns
# a list so for just one draw we need to pick the first list element)
data_temp <- if(.resample_method == "jackknife") {
resample_jack[[i]]
} else {
# We could use resampleData here but, bootstrap resampling directly is faster
# (not surprising)
# (compared both approaches using microbenchmark)
data <- args[[".data"]]
data[sample(1:nrow(data), size = nrow(data), replace = TRUE), ]
}
args[[".data"]] <- data_temp
# Estimate model
Est_temp <- if(inherits(.object, "cSEMResults_2ndorder")) {
## NOTE: using do.call(csem, args) would be more elegant but is much
# much much! slower (especially for larger data sets).
do.call(csem, args)
} else {
# It is important to use foreman() here
# instead of csem() to allow for lapply(x, resamplecSEMResults_default) when x
# is of class cSEMResults_2ndorder.
## NOTE:
# 01.03.2019: Using do.call(foreman, args) would be more elegant but is much
# much much! slower (especially for larger data sets).
#
# 15.05.2019: Apparently do.call(foreman, args) is not that bad
# after all. I did several comparisons using microbenchmark
# but there was no speed difference (anymore?!). When I compared and
# thought that do.call is slow I fixed other parts of
# foreman as well...maybe they had been the real culprit.
# So we use do.call again, as it avoids redundancy
do.call(foreman, args)
}
# Check status
status_code <- sum(unlist(verify(Est_temp)))
# Distinguish depending on how inadmissibles should be handled
if(status_code == 0 | (status_code != 0 & .handle_inadmissibles == "ignore")) {
# ## Select relevant statistics/parameters/quantities
x1 <- selectAndVectorize(Est_temp)
### Sign change correction for PLS-PM
if(.sign_change_option != "none" && info1$Arguments$.approach_weights == "PLS-PM") {
if(inherits(.object, "cSEMResults_2ndorder")) {
est1_temp_normal <- Est_temp$First_stage$Estimates
est2_temp_normal <- Est_temp$Second_stage$Estimates
### Individual_reestimate and construct_reestimate -------------------
if(.sign_change_option == "individual_reestimate" |
.sign_change_option == "construct_reestimate") {
## First stage
# Is there a sign difference between the first stage resample
# and original weight estimates. If so, which weights differ?
sign_diff1 <- sign(est1_normal$Weight_estimates) !=
sign(est1_temp_normal$Weight_estimates)
# Only if at least one sign differs a correction is needed
if(sum(sign_diff1) != 0) {
W1_new <- est1_temp_normal$Weight_estimates
# Individual_reestimate
if(.sign_change_option == "individual_reestimate"){
W1_new[sign_diff1] <- W1_new[sign_diff1] * (-1)
} # END individual_reestimate
if(.sign_change_option == "construct_reestimate"){
# In line with Tenenhaus et al. (2005, p 177) loadings are used,
# although they have not considered 2nd order models
Lambda1 <- est1_normal$Loading_estimates
Lambda1_temp <- est1_temp_normal$Loading_estimates
Lambda_diff1 <- abs(rowSums(Lambda1 - Lambda1_temp))
Lambda_sum1 <- abs(rowSums(Lambda1 + Lambda1_temp))
W1_new[Lambda_diff1 > Lambda_sum1, ] <- W1_new[Lambda_diff1 > Lambda_sum1, ] * (-1)
} # END construct_reestimate
# Put corrected weights in a list to be able to supply them as
# fixed weights to .PLS_modes
W1_list <- lapply(1:nrow(W1_new), function(x) {
temp <- W1_new[x, ]
temp[temp !=0]
})
names(W1_list) <- rownames(W1_new)
## Replace modes by fixed sign-corrected weights
args[[".PLS_modes"]] <- W1_list
## Reestimate
# Note: calling csem() directly is faster, however, i guess
# change option wont be used that often, so for now, I will
# use the more elegant, while slower, do.call construction.
Est_temp <- do.call(csem, args)
## Update
est2_temp_normal <- Est_temp$Second_stage$Estimates
} # END first stage
## Second stage
# Is there a sign difference between the second stage (sign-corrected)
# resample weight estimates and the original weight estimates.
# If so, which weights differ?
sign_diff2 <- sign(est2_normal$Weight_estimates) !=
sign(est2_temp_normal$Weight_estimates)
# Only if at least one sign differs a correction is needed
if(sum(sign_diff2) != 0) {
W2_new <- est2_temp_normal$Weight_estimates
# Individual_reestimate
if(.sign_change_option == "individual_reestimate"){
W2_new[sign_diff2] <- W2_new[sign_diff2] * (-1)
} # END individual_reestimate
if(.sign_change_option == "construct_reestimate"){
# In line with Tenenhaus et al. (2005, p 177) loadings are used,
# although they have not considered 2nd order models
Lambda2 <- est2_normal$Loading_estimates
Lambda2_temp <- est2_temp_normal$Loading_estimates
Lambda_diff2 <- abs(rowSums(Lambda2 - Lambda2_temp))
Lambda_sum2 <- abs(rowSums(Lambda2 + Lambda2_temp))
W2_new[Lambda_diff2 > Lambda_sum2, ] <- W2_new[Lambda_diff2 > Lambda_sum2, ] * (-1)
} # END construct_reestimate
# Put corrected weights in a list to be able to supply them as
# fixed weights to .PLS_modes
W2_list <- lapply(1:nrow(W2_new), function(x){
temp <- W2_new[x, ]
temp[temp != 0]
})
names(W2_list) <- rownames(W2_new)
args[[".PLS_modes"]] = W2_list
## Reestimate
# Note: calling csem() directly is faster, however, i guess
# change option wont be used that often, so for now, I will
# use the more elegant, while slower, do.call construction.
Est_temp <- do.call(csem, args)
} # END second stage
# ## Update using estimates based on the sign-corrected weights
x1 <- selectAndVectorize(Est_temp)
} # END individual_reestimate, construct_reestimate
if(.sign_change_option == "individual") {
# Reverse the signs off ALL parameter estimates in a bootstrap run if
# their sign differs from the sign of the original estimation
x1 <- reverseSign(.Est_temp = Est_temp, .summary_original = summary_original)
}
} else {
est_temp_normal <- Est_temp$Estimates
sign_diff <- sign(est_normal$Weight_estimates) !=
sign(est_temp_normal$Weight_estimates)
# Is there a difference in the signs of th weights? Otherwise no correction of the signs is done
# Not sure whether this is a problem for the construct_reestimate approach which only compares the loadings
# I think not.
if(sum(sign_diff) !=0 ) {
# Sign change correction individual_reestimate and construct_reestimate
if(.sign_change_option == "individual_reestimate" |
.sign_change_option == "construct_reestimate") {
W_new <- est_temp_normal$Weight_estimates
# Individual_reestimate
if(.sign_change_option == "individual_reestimate"){
W_new[sign_diff] <- W_new[sign_diff] * (-1)
} # END individual_reestimate
if(.sign_change_option == "construct_reestimate"){
Lambda <- est_normal$Loading_estimates
Lambda_temp <- est_temp_normal$Loading_estimates
Lambda_diff <- abs(rowSums(Lambda - Lambda_temp))
Lambda_sum <- abs(rowSums(Lambda + Lambda_temp))
W_new[Lambda_diff > Lambda_sum, ] <- W_new[Lambda_diff > Lambda_sum, ] * (-1)
}
# Put corrected weights in a list to be able to supply them as
# fixed weights to .PLS_modes
W_list <- lapply(1:nrow(W_new), function(x){
temp <- W_new[x, ]
temp[temp != 0]
})
names(W_list) <- rownames(W_new)
args[[".PLS_modes"]] = W_list
## Reestimate
# Note: calling csem() directly is faster, however, i guess
# change option wont be used that often, so for now, I will
# use the more elegant, while slower, do.call construction.
Est_temp <- do.call(csem, args)
## Update using estimates based on the sign-corrected weights
x1 <- selectAndVectorize(Est_temp)
} # end if individual_reestimate, construct_reestimate
if(.sign_change_option == 'individual') {
# Reverse the signs off ALL parameter estimates in a bootstrap run if
# their sign differs from the sign of the original estimation
x1 <- reverseSign(.Est_temp = Est_temp, .summary_original = summary_original)
} # END .sign_change_option == "individual"
} # END sum(sign(.object$Estimates$Weight_estimates)!=sign(Est_temp$Estimates$Weight_estimates)
} # END else (= not 2ndorder)
} # END .sign_change_option
## Apply user defined function if specified
user_funs <- if(!is.null(.user_funs)) {
applyUserFuns(Est_temp, .user_funs = .user_funs, ...)
}
# user_funs <- if(!is.null(.user_funs)) {
# if(is.function(.user_funs)) {
# list("User_fun" = c(.user_funs(Est_temp, ...)))
# } else {
# x <- lapply(.user_funs, function(f) c(f(Est_temp, ...)))
# if(is.null(names(x))) {
# names(x) <- paste0("User_fun", 1:length(x))
# }
# x
# }
# }
## Add output of the user functions to x1
if(!is.null(.user_funs)) {
x1 <- c(x1, user_funs)
}
## Resampling from a bootstrap sample is required for the
## bootstraped t-interval CI (studentized CI), hence the second run
# In the second run no sign change option is used. We can think about
# applying the same correction as in the first run
if(.resample_method2 != "none") {
Est_resamples2 <- resamplecSEMResults(
.object = Est_temp,
.R = .R2,
.handle_inadmissibles = .handle_inadmissibles2,
.resample_method = .resample_method2,
.resample_method2 = "none",
.user_funs = .user_funs,
.seed = .seed,
.force = FALSE,
.sign_change_option = "none",
...
)
x1 <- list("Estimates1" = x1, "Estimates2" = Est_resamples2)
}
} else if(status_code != 0 & .handle_inadmissibles != "ignore") {
# Retrun NA if status is not okay and .handle_inadmissibles == "drop" or
# "replace"
x1 <- NA
}
## Return
x1
}, future.seed = .seed)
# })
})
## Process data --------------------------------------------------------------
# Delete potential NA's
out <- Filter(Negate(anyNA), Est_ls)
## Replace inadmissibles
if(length(out) != .R && .resample_method == "bootstrap" &&
.handle_inadmissibles == "replace") {
n_attempt <- 0
while (length(out) < .R) {
n_attempt <- n_attempt + 1
R_new <- .R - length(out)
if(is.null(.seed)) {
.seed <- sample(.Random.seed, 1)
} else {
.seed <- .seed + 1
}
Est_replace <- resamplecSEMResultsCore(
.object = .object,
.R = R_new,
.handle_inadmissibles = "drop",
.handle_inadmissibles2= .handle_inadmissibles,
.resample_method = .resample_method,
.resample_method2 = .resample_method2,
.R2 = .R2,
.user_funs = .user_funs,
.eval_plan = "sequential", # multiprocessing will generally
# not be meaningful here since we typically
# replace only a small subset of the total runs
.seed = .seed,
.sign_change_option = .sign_change_option,
...
)
out <- c(out, Est_replace)
if(n_attempt == 10*.R) {
warning2(
"The following issue was encountered in the `resamplecSEMResults()` function:\n",
"Attempting to replace inadmissibles failed after ", 10*.R, " attempts.",
" Returning all admissible results up to the last attempt."
)
break
}
}
}
out
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.