simdata_contaminated: Generates random variates from K multivariate contaminated...
In ramhiser/sortinghat: sortinghat

Description Usage Arguments Details Value Examples

We generate n_k observations (k = 1, …, K) from each of K multivariate contaminated normal distributions. Let N_p(μ, Σ) denote the p-dimensional multivariate normal distribution with mean vector μ and positive-definite covariance matrix Σ. Then, let the kth population have a p-dimensional multivariate contaminated normal distribution:

1 2	simdata_contaminated(n, mean, cov, epsilon = rep(0, K), kappa = rep(1, K), seed = NULL)

`n`	a vector (of length K) of the sample sizes for each population
`mean`	a vector or a list (of length K) of mean vectors
`cov`	a symmetric matrix or a list (of length K) of symmetric covariance matrices.
`epsilon`	a vector (of length K) indicating the probability of sampling a contaminated population (i.e., outlier) for each population
`kappa`	a vector (of length K) that determines the amount of scale contamination for each population
`seed`	seed for random number generation (If `NULL`, does not set seed)

(1 - ε_k) N_p(μ_k, Σ_k) + ε_k N_p(μ_k, κ_k Σ_k),

where ε_k \in [0, 1] is the probability of sampling from a contaminated population (i.e., outlier) and κ_k ≥ 1 determines the amount of scale contamination. The contaminated normal distribution can be viewed as a mixture of two multivariate normal random distributions, where the second has a scaled covariance matrix, which can introduce extreme outliers for sufficiently large κ_k.

The number of populations, K, is determined from the length of the vector of sample sizes, coden. The mean vectors and covariance matrices each can be given in a list of length K. If one covariance matrix is given (as a matrix or a list having 1 element), then all populations share this common covariance matrix. The same logic applies to population means.

The contamination probabilities in epsilon can be given as a numeric vector or a single value, in which case the degrees of freedom is replicated K times. The same idea applies to the scale contamination in the kappa argument.

By default, epsilon is a vector of zeros, and kappa is a vector of ones. Hence, no contamination is applied by default.

named list containing:

x:: A matrix whose rows are the observations generated and whose columns are the p features (variables)
y:: A vector denoting the population from which the observation in each row was generated.

# Generates 10 observations from each of two multivariate contaminated normal
# populations with equal covariance matrices. Each population has a
# contamination probability of 0.05 and scale contamination of 10.
mean_list <- list(c(1, 0), c(0, 1))
cov_identity <- diag(2)
data <- simdata_contaminated(n = c(10, 10), mean = mean_list,
                             cov = cov_identity, epsilon = 0.05, kappa = 10,
                             seed = 42)
dim(data$x)
table(data$y)

# Generates 10 observations from each of three multivariate contaminated
# normal populations with unequal covariance matrices. The contamination
# probabilities and scales differ for each population as well.
set.seed(42)
mean_list <- list(c(-3, -3), c(0, 0), c(3, 3))
cov_list <- list(cov_identity, 2 * cov_identity, 3 * cov_identity)
data2 <- simdata_contaminated(n = c(10, 10, 10), mean = mean_list,
                              cov = cov_list, epsilon = c(0.05, 0.1, 0.2),
                              kappa = c(2, 5, 10))
dim(data2$x)
table(data2$y)