R/data_documentation.R

#' High-correlation dataset with contamination
#'
#' Synthetic dataset generated from a multivariate normal distribution with
#' strong correlation structure (\eqn{\rho = 0.8}). It contains 550 observations
#' and 10 variables of mixed type (continuous, categorical, binary, and weights).
#' The last 50 rows correspond to contaminated observations created by adding
#' perturbations equal to three times the standard deviation of each quantitative
#' variable to a subset of original units. This results in a controlled 10\%
#' contamination level. These data follow the design in
#' \insertCite{boj2024robustification}{dbrobust}.
#'
#' @format A data frame with 550 rows and 10 variables:
#' \describe{
#'   \item{V1}{Continuous variable 1}
#'   \item{V2}{Continuous variable 2}
#'   \item{V3}{Continuous variable 3}
#'   \item{V4}{Continuous variable 4}
#'   \item{V5}{Categorical variable 1 (3 categories, approx. balanced)}
#'   \item{V6}{Categorical variable 2 (3 categories, approx. balanced)}
#'   \item{V7}{Categorical variable 3 (4 categories, uniform distribution)}
#'   \item{V8}{Binary variable 1 (40\% zeros, 60\% ones)}
#'   \item{V9}{Binary variable 2 (60\% zeros, 40\% ones)}
#'   \item{w_loop}{Observation weights derived from the joint distribution of
#'   V5 and V8, following a proportional frequency-based scheme.}
#' }
#'
#' @details
#' \itemize{
#'  \item Continuous variables were drawn directly from the multivariate normal sample.
#'  \item Binary and categorical variables were obtained by discretizing normal margins
#'   using percentile-based thresholds.
#'  \item Contaminated observations (rows 501–550) were generated by perturbing
#'   original cases with fluctuations of 3 SD.
#'  \item The weighting scheme prioritizes frequent category combinations.
#' }
#' @references
#' \insertRef{boj2024robustification}{dbrobust}
#'
"Data_HC_contamination"

#' Moderate-correlation dataset with contamination
#'
#' Synthetic dataset generated from a multivariate normal distribution with
#' moderate correlation structure (\eqn{\rho = 0.6}). It contains 525 observations
#' and 10 variables of mixed type (continuous, categorical, binary, and weights).
#' The last 25 rows correspond to contaminated observations created by adding
#' perturbations equal to three times the standard deviation of each quantitative
#' variable to a subset of original units. This results in a controlled 5\%
#' contamination level. These data follow the design in
#' \insertCite{boj2024robustification}{dbrobust}.
#'
#' @format A data frame with 525 rows and 10 variables:
#' \describe{
#'   \item{V1}{Continuous variable 1}
#'   \item{V2}{Continuous variable 2}
#'   \item{V3}{Continuous variable 3}
#'   \item{V4}{Continuous variable 4}
#'   \item{V5}{Categorical variable 1 (3 categories, approx. balanced)}
#'   \item{V6}{Categorical variable 2 (3 categories, approx. balanced)}
#'   \item{V7}{Categorical variable 3 (4 categories, uniform distribution)}
#'   \item{V8}{Binary variable 1 (40\% zeros, 60\% ones)}
#'   \item{V9}{Binary variable 2 (60\% zeros, 40\% ones)}
#'   \item{w_loop}{Observation weights derived from the joint distribution of
#'   V5 and V8, following a proportional frequency-based scheme.}
#' }
#'
#' @details
#' \itemize{
#'  \item Continuous variables were drawn directly from the multivariate normal sample.
#'  \item Binary and categorical variables were obtained by discretizing normal margins
#'   using percentile-based thresholds.
#'  \item Contaminated observations (rows 501–525) were generated by perturbing
#'   original cases with fluctuations of 3 SD.
#'  \item The weighting scheme prioritizes frequent category combinations.
#' }
#' @references
#' \insertRef{boj2024robustification}{dbrobust}
#'
"Data_MC_contamination"

#' High-correlation dataset without contamination
#'
#' Synthetic dataset generated from a multivariate normal distribution with
#' strong correlation structure (\eqn{\rho = 0.8}). It contains 500 observations
#' and 10 variables of mixed type (continuous, categorical, binary, and weights).
#' No contaminated cases were added in this version, so the dataset represents
#' a clean scenario with 0\% contamination. These data follow the design in
#' \insertCite{boj2024robustification}{dbrobust}.
#'
#' @format A data frame with 500 rows and 10 variables:
#' \describe{
#'   \item{V1}{Continuous variable 1}
#'   \item{V2}{Continuous variable 2}
#'   \item{V3}{Continuous variable 3}
#'   \item{V4}{Continuous variable 4}
#'   \item{V5}{Categorical variable 1 (3 categories, approx. balanced)}
#'   \item{V6}{Categorical variable 2 (3 categories, approx. balanced)}
#'   \item{V7}{Categorical variable 3 (4 categories, uniform distribution)}
#'   \item{V8}{Binary variable 1 (40\% zeros, 60\% ones)}
#'   \item{V9}{Binary variable 2 (60\% zeros, 40\% ones)}
#'   \item{w_loop}{Observation weights derived from the joint distribution of
#'   V5 and V8, following a proportional frequency-based scheme.}
#' }
#'
#' @details
#' \itemize{
#'  \item Continuous variables were drawn directly from the multivariate normal sample.
#'  \item Binary and categorical variables were obtained by discretizing normal margins
#'   using percentile-based thresholds.
#'  \item Unlike other datasets in this collection, no artificial contamination was
#'   introduced here.
#'  \item The weighting scheme prioritizes frequent category combinations.
#' }
#' @references
#' \insertRef{boj2024robustification}{dbrobust}
#'
"Data_HC_no_contamination"

#' Moderate-correlation dataset without contamination
#'
#' Synthetic dataset generated from a multivariate normal distribution with
#' moderate correlation structure (\eqn{\rho = 0.6}). It contains 500 observations
#' and 10 variables of mixed type (continuous, categorical, binary, and weights).
#' No contaminated cases were added in this version, so the dataset represents
#' a clean scenario with 0\% contamination. These data follow the design in
#' \insertCite{boj2024robustification}{dbrobust}.
#'
#' @format A data frame with 500 rows and 10 variables:
#' \describe{
#'   \item{V1}{Continuous variable 1}
#'   \item{V2}{Continuous variable 2}
#'   \item{V3}{Continuous variable 3}
#'   \item{V4}{Continuous variable 4}
#'   \item{V5}{Categorical variable 1 (3 categories, approx. balanced)}
#'   \item{V6}{Categorical variable 2 (3 categories, approx. balanced)}
#'   \item{V7}{Categorical variable 3 (4 categories, uniform distribution)}
#'   \item{V8}{Binary variable 1 (40\% zeros, 60\% ones)}
#'   \item{V9}{Binary variable 2 (60\% zeros, 40\% ones)}
#'   \item{w_loop}{Observation weights derived from the joint distribution of
#'   V5 and V8, following a proportional frequency-based scheme.}
#' }
#'
#' @details
#' \itemize{
#'  \item Continuous variables were drawn directly from the multivariate normal sample.
#'  \item Binary and categorical variables were obtained by discretizing normal margins
#'   using percentile-based thresholds.
#'  \item Unlike other datasets in this collection, no artificial contamination was
#'   introduced here.
#'  \item The weighting scheme prioritizes frequent category combinations.
#' }
#' @references
#' \insertRef{boj2024robustification}{dbrobust}
#'
"Data_MC_no_contamination"

Try the dbrobust package in your browser

Any scripts or data that you put into this service are public.

dbrobust documentation built on Nov. 5, 2025, 6:24 p.m.