R/para_intensity_CI.R
In parasiteR: A Theorical-Practical Approach to Parasitological Data Analysis

Documented in para_intensity_CI

#' Mean or median intensity estimation and confidence intervals
#'
#' This function calculates point estimates and confidence intervals (CIs) for parasite intensity, using either the mean or the median as a measure of central tendency. Confidence intervals are estimated via a non-parametric bootstrap approach based on resampling (permutations) of the observed data. Specifically, the function implements bias-corrected and accelerated (BCa) bootstrap intervals, which adjust for both bias and skewness in the bootstrap distribution. This approach does not assume a specific underlying distribution and is particularly robust for overdispersed and zero-inflated parasitological data.
#'
#' Parasite intensity is defined as the number of individuals of a given parasite taxon per infested host. For each taxon, intensity metrics are calculated based only on hosts with parasite counts greater than zero. The function reshapes the dataset into long format and computes intensity statistics for each parasite taxon and grouping combination (if specified).
#' The following are estimated:
#'
#' \itemize{
#'  \item \code{A} is the total parasite abundance
#'  \item \code{nH} is the number of hosts analyzed
#'  \item \code{nH_inf} is the number of infected hosts
#' }
#'
#' Depending on the argument \code{c_median}, the function calculates:
#'
#' \itemize{
#'  \item Mean intesity \code{MeanI}: average number of parasites per infested host
#'  \item Median intesity \code{MedI}: median number of parasites per infested host
#' }
#'
#' Confidence intervals are estimated using a non-parametric bootstrap approach. Specifically, bias-corrected and accelerated (BCa) bootstrap intervals are computed by resampling the observed intensity values with replacement a specified number of times \code{perm}. This method adjusts for both bias and skewness in the bootstrap distribution.
#' Statistical considerations: parasite intensity data are typically right-skewed and may exhibit high variability due to aggregation among hosts. The use of bootstrap methods allows robust estimation of confidence intervals without assuming normality. Mean intensity is sensitive to extreme values, whereas median intensity provides a more robust measure under highly skewed distributions. When the number of infested hosts is small, bootstrap confidence intervals may be unstable or wide, and results should be interpreted with caution. The interpretation of results remains the responsibility of the user.
#'
#' @usage
#' para_intensity_CI(dataset, c_median = TRUE, sp_cols, group_vars = NULL,
#'  perm = 2000, decimal_places = 2, combine_ci = FALSE,
#'  conf_level = 0.95, verbose = FALSE)
#'
#'
#'
#' @param dataset Data frame with parasitological data.
#' @param c_median Logical. If \code{TRUE}, the results will include the median as a central tendency of measure; if \code{FALSE}, the results will include the mean of the data.
#' @param sp_cols Vector with the names of the columns containing abundance of parasites (taxa) to calculate the parasitological descriptors.
#' @param group_vars Vector with the names of categorical variables used to define groups (e.g., "Sex", "Site"). Default is \code{NULL}.
#' @param perm Number of permutations to perform for confidence interval estimation. Default is \code{2000}.
#' @param decimal_places Number of decimal places to include in the calculation. Default is \code{2}.
#' @param combine_ci Logical. If \code{TRUE}, the interval is expressed as a single column (min - max). If \code{FALSE}, the interval is split into separate lower and upper limit columns
#' @param conf_level Confidence level for interval estimation (e.g., 0.95 for 95\% confidence intervals).
#' @param verbose A logical value indicating if progress messages should be given.
#'
#' @return A data frame containing abundance estimates and confidence intervals for each parasite taxon, either globally or by group. The following variables are returned:
#' \itemize{
#'   \item \code{nH}: Number of hosts analyzed
#'   \item \code{nH_inf}: Number of infected hosts
#'   \item \code{A}: Total parasite abundance
#'   \item \code{MeanA}:Mean parasite intensity
#'   \item \code{MedA}: Median parasite intensity
#'   \item \code{Lower_CI}: Lower bound of the bootstrap confidence interval
#'   \item \code{Upper_CI}: Upper bound of the bootstrap confidence interval
#'   \item \code{CI}: If \code{combine_ci = TRUE}, confidence interval expressed are store as a single column (Lower_CI - Upper_CI)
#'   \item \code{Observation}: Categorical description of the data context:
#'        \itemize{
#'          \item \code{"Not analyzed"}: No valid observations are available for the given combination (all values are missing or the combination is absent in the dataset); therefore, no estimates can be computed.
#'          \item \code{"One host analyzed"}: Only a single host analyzed is available for the given combination; thus, no population-level inference is possible and statistical summary measures are not estimated.
#'          \item \code{"No hosts infested"}: Hosts are present for the given combination, but none are infested (abundance = 0 for all observations); consequently, no statistical summary measures of abundance or intensity can be estimated.
#'          \item \code{"One host infested"}: Only a single infested host is recorded for the given combination; therefore, no sample-based estimation of intensity or related summary measures is possible.
#'          \item \code{"Multiple hosts infested"}: More than one infested host is recorded for the given combination, allowing the estimation of summary measures.
#' }
#' }
#' @examples
#'# Calculate of the CI for the median intensity
#'med_int_CI <- para_intensity_CI(para_data$dataset,
#'                                c_median = TRUE,
#'                                sp_cols =  c("Sp1"),
#'                                group_vars = c("Site"),
#'                                decimal_places = 2,
#'                                conf_level = 0.95,
#'                                combine_ci = TRUE,
#'                                verbose = TRUE)
#'med_int_CI
#'
# Calculate the CI for the mean intensity
#'mean_int_CI <- para_intensity_CI(para_data$dataset,
#'                                 c_median = FALSE,
#'                                 sp_cols =  c("Sp1"),
#'                                 group_vars = c("Site"),
#'                                 decimal_places = 2,
#'                                 conf_level = 0.95,
#'                                 combine_ci = TRUE,
#'                                 verbose = TRUE)
#'mean_int_CI
#'
#' @references
#' Bush, A.O., Lafferty, K.D., Lotz, J.M., Shostak, A.W. (1997). Parasitology meets ecology on its own terms:
#' Margolis revisited. \emph{Journal of Parasitology}, 83(4), 575–583.
#'
#' Reiczigel, J., Marozzi, M., Fabian, I., Rózsa, L. (2019). Biostatistics for parasitologists – a primer to
#' quantitative parasitology. \emph{Trends in Parasitology}, 35(4), 277–281.
#'
#' @author Juan Manuel Cabrera, Exequiel Furlan and Elisa Helman
#'
#' @export

para_intensity_CI <- function(dataset, c_median = TRUE, sp_cols, group_vars = NULL,
                              perm = 2000, decimal_places = 2, combine_ci = FALSE,
                              conf_level = 0.95, verbose = FALSE){

  Intensity <- NA
  CI <- NA
  MeanI <- NA
  MedI <- NA
  nH_inf <- NA
  nH <- NA
  A <- NA
  Sp <- NA
  boot_result <- NA
  bca_conf_interval <- NA
  Lower_CI <- NA
  Upper_CI <- NA
  Observation <- NA

  if (verbose) message("Checking function arguments...")

  # Validaciones
  if (is.null(sp_cols) || length(sp_cols) == 0) {
    stop("The species columns must be specified (sp_cols).")
  }
  if (!all(sp_cols %in% colnames(dataset))) {
    stop("Some of the specified species columns do not exist in the dataset.")
  }
  if (!is.null(group_vars) && !all(group_vars %in% colnames(dataset))) {
    stop("Some of the specified categorical variables do not exist in the dataset.")
  }

  # Función bootstrap (solo positivos)
  if (c_median) {
    ab_function <- function(dataset, indices) {
      sample_data <- dataset[indices]
      return(stats::median(sample_data, na.rm = TRUE))
    }
  } else {
    ab_function <- function(dataset, indices) {
      sample_data <- dataset[indices]
      return(mean(sample_data, na.rm = TRUE))
    }
  }
  if (verbose) message("Calculating intensity...")
  # Formato largo
  data_long <- dataset %>%
    tidyr::pivot_longer(cols = dplyr::all_of(sp_cols),
                        names_to = "Sp",
                        values_to = "Intensity")

  # 🔴 FILTRAR SOLO HOSTS INFESTADOS PARA INTENSIDAD
  data_long_inf <- data_long %>%
    dplyr::filter(Intensity > 0)

  # Cálculo
  intensity_results <- data_long %>%
    dplyr::group_by(dplyr::across(dplyr::all_of(group_vars)), Sp) %>%
    dplyr::summarise(
      nH = sum(!is.na(Intensity)),
      nH_inf = sum(Intensity > 0, na.rm = TRUE),
      A = sum(Intensity, na.rm = TRUE),
      .groups = 'drop'
    ) %>%
    dplyr::left_join(
      data_long_inf %>%
        dplyr::group_by(dplyr::across(dplyr::all_of(group_vars)), Sp) %>%
        dplyr::summarise(
          MeanI = ifelse(dplyr::n() > 1,
                         round(mean(Intensity, na.rm = TRUE), decimal_places),
                         NA_real_),
          MedI = ifelse(dplyr::n() > 1,
                        round(stats::median(Intensity, na.rm = TRUE), decimal_places),
                        NA_real_),
          boot_result = dplyr::if_else(MeanI || MedI && nH > 1,
                                       list(boot::boot(data = Intensity, statistic = ab_function, R = perm)),
                                       list(boot::boot(data = 0, statistic = ab_function, R = perm))),
          .groups = "drop"
        ),
      by = c(group_vars, "Sp")
    ) %>%
    dplyr::rowwise() %>%
    dplyr::mutate(
      bca_conf_interval = list(
        tryCatch(
          suppressWarnings(boot::boot.ci(boot_result, type = "bca")),
          error = function(e) NULL
        )
      ),
      Lower_CI = ifelse(!is.null(bca_conf_interval),
                        round(bca_conf_interval$bca[4], decimal_places),
                        NA_real_),
      Upper_CI = ifelse(!is.null(bca_conf_interval),
                        round(bca_conf_interval$bca[5], decimal_places),
                        NA_real_)
    ) %>%
    dplyr::ungroup()

  # Observaciones (alineadas con abundance)
  intensity_results <- intensity_results %>%
    dplyr::mutate(
      Observation = dplyr::case_when(
        nH == 0 ~ "Not analyzed",
        nH == 1 ~ "One host examined",
        nH_inf == 0 ~ "No hosts infested",
        nH_inf == 1 ~ "One host infested",
        TRUE ~ "Multiple hosts infested"
      )
    )

  # 🔴 WARNING 1
  failed_ci <- any(is.na(intensity_results$Lower_CI) |
                     is.na(intensity_results$Upper_CI))

  if (failed_ci) {
    warning("Confidence intervals could not be estimated due to insufficient variability in the data.")
  }

  # 🟠 WARNING 2
  unreliable_ci <- any(
    intensity_results$nH_inf > 0 &
      intensity_results$nH_inf <= 5
  )

  if (unreliable_ci) {
    warning("Some confidence intervals may be unreliable due to low variability or small sample size.
            Extreme bootstrap values may define the interval limits.")
  }

  # Combinar CI
  if (combine_ci) {
    intensity_results <- intensity_results %>%
      dplyr::mutate(
        CI = ifelse(nH_inf > 0,
                    paste(Lower_CI, Upper_CI, sep = " - "),
                    NA_character_)
      ) %>%
      dplyr::select(dplyr::all_of(group_vars), Sp, nH, nH_inf, A,
                    MeanI, MedI, CI, Observation)
  } else {
    intensity_results <- intensity_results %>%
      dplyr::select(dplyr::all_of(group_vars), Sp, nH, nH_inf, A,
                    MeanI, MedI, Lower_CI, Upper_CI, Observation)
  }

  # Selección final
  if (c_median) {
    intensity_results <- intensity_results %>% dplyr::select(-MeanI)
  } else {
    intensity_results <- intensity_results %>% dplyr::select(-MedI)
  }
  if (verbose) message("Calculation completed")
  return(intensity_results)
}